Domain patterns for data pipeline architecture — batch processing, stream processing, ETL/ELT, DAG scheduling, data quality, schema evolution, backfill strategies, and failure modes. Use when designing or evaluating data pipelines, ETL systems, or streaming data infrastructure.
复制安装指令,让 AI 自动完成配置 · 推荐新手
请帮我安装 askskill 上的 "system-type-data-pipeline" 技能: 1. 下载 https://raw.githubusercontent.com/microsoft/amplifier-bundle-systems-design/main/skills/system-type-data-pipeline/SKILL.md 2. 保存为 ~/.claude/skills/system-type-data-pipeline/SKILL.md 3. 装好后重载技能,告诉我可以用了
Patterns, failure modes, and anti-patterns for batch and streaming data pipelines.
What it is. Process bounded datasets on a schedule — hourly, daily, or triggered. Read a full partition, transform, write output. The workhorse of data engineering. When to use. Reporting, analytics, ML feature generation, any workload where latency of minutes to hours is acceptable. When the source data naturally arrives in chunks (file drops, database snapshots, daily exports). When transformations are complex aggregations across the full dataset. When to avoid. When business requirements demand sub-minute freshness. When the dataset grows faster than the batch window can process it — you'll never catch up. When downstream consumers need continuous updates, not periodic dumps.
What it is. Process unbounded data continuously as it arrives. Events flow through a topology of operators. State is maintained in-stream. When to use. Real-time fraud detection, live dashboards, operational alerting, CDC-based replication, any use case where "the answer must be current." When events have value that decays with time. When to avoid. When you need complex joins across large historical windows — stream state gets expensive. When your team has no operational experience with Flink/Kafka Streams/Spark Streaming (the failure modes are subtle and unforgiving). When the "real-time" requirement is actually "within an hour" — that's batch.
What it is. Process small batches at very short intervals (seconds to low minutes). Spark Structured Streaming is the canonical example. Gives near-real-time latency with batch-like programming models. When to use. When you need latency better than batch but the team's skill set is batch-oriented. When exactly-once semantics are easier to reason about in batch units. When sub-second latency is not required. When to avoid. When you need true event-at-a-time processing with sub-second latency. When the micro-batch interval masks timing bugs that will surface under load. When the overhead of repeated job initialization dominates actual processing time.
ETL (Extract, Transform, Load). Transform data before loading into the target. Traditional approach. Use when the target system is expensive (data warehouse with per-query pricing), when you need to filter/clean before storage, or when the target can't handle raw data volumes. ELT (Extract, Load, Transform). Load raw data first, transform in the target system. Modern approach enabled by cheap storage and powerful query engines. Use when storage is cheap, when you want to preserve raw data for reprocessing, when transformations evolve frequently, or when the target system (Snowflake, BigQuery, Databricks) has strong compute for transformations. The real tradeoff. ETL reduces storage cost and query scope at the expense of flexibility — you can't transform what you didn't keep. ELT preserves optionality at the expense of storage cost and potential query performance on raw data. Default to ELT unless you have a specific reason not to.
What it is. Run batch and streaming pipelines in parallel. Batch layer provides complete, accurate results; speed layer provides approximate, real-time results. Merge at query time. When to use. Almost never in new systems. Was a necessary compromise before streaming frameworks matured. When to avoid. In most cases. Maintaining two codepaths that must produce consistent results is an operational nightmare. Logic drift between batch and speed layers is the norm, not the exception. Prefer Kappa architecture unless you have a proven need for both.
What it is. Single streaming pipeline handles both real-time and historical reprocessing. Replay the log to reprocess.
…
Catalog of reusable architectural primitives — boundaries, contracts, state machines, queues, caches, consistency models, and more. For each: what it is, when it's right, when it's WRONG. Use when selecting patterns for a design or evaluating whether a pattern fits.
Domain-Driven Design as a lens for system architecture — bounded contexts, aggregates, ubiquitous language, context mapping, domain events, and strategic vs tactical patterns. Use when modeling complex business domains, defining service boundaries, or evaluating whether a system's structure reflects its domain.
The Unix/Linux design philosophy as a lens for system design — mechanism vs policy, composability, small tools, text streams, convention over configuration, and the principle of least surprise. Use when evaluating designs for composability, simplicity, or separation of concerns.
Object-oriented design principles as a lens for system architecture — SOLID, composition over inheritance, the actor model, design patterns (and when they're wrong), encapsulation, polymorphism, and responsibility-driven design. Use when evaluating code organization, module boundaries, or object/component relationships.
Domain patterns for Azure cloud architecture — compute selection, managed services, identity (Entra ID), networking, data platform, messaging, deployment, cost management, and operational patterns. Use when designing or evaluating a system deployed on Microsoft Azure.
Adversarial review of a system design from 6 critical perspectives -- SRE, security, staff engineer, finance, operator, and developer advocate. Produces a unified risk assessment. Use for INTERACTIVE on-demand reviews during a design conversation (/adversarial-review). For RECIPE-DRIVEN reviews (where prior step context is needed), use the systems-design-critic agent instead.