system-type-data-pipeline

Name: system-type-data-pipeline
Author: microsoft

帮助设计与评估数据管道架构，覆盖批流处理、调度、质量与故障策略。

来源

GitHub

更新于

2026-07-20

// 安全评估低风险

仅提示词，不执行代码
开源可审计

正在进行安全审计…

凭证密钥
网络外发
代码执行
数据访问
来源供应链

// 安装

复制安装指令，让 AI 自动完成配置 · 推荐新手

请帮我安装 askskill 上的 "system-type-data-pipeline" 技能：
1. 下载 https://raw.githubusercontent.com/microsoft/amplifier-bundle-systems-design/main/skills/system-type-data-pipeline/SKILL.md
2. 保存为 ~/.claude/skills/system-type-data-pipeline/SKILL.md
3. 装好后重载技能，告诉我可以用了

// 下载

下载 SKILL.md机读安装清单 ↗

// 用法示例

设计实时与离线混合管道

输入

请为电商平台设计一套数据管道架构，支持订单实时分析与每日离线报表。请比较批处理、流处理、ETL/ELT 的取舍，给出 DAG 调度、数据质量校验、回填策略、模式演进和故障恢复方案。

预期产出

一份结构化架构方案，说明技术选型、处理链路、关键权衡与运维策略。

评审现有 ETL 系统

输入

请评审这套现有 ETL 系统：每日批处理延迟高、任务依赖复杂、回填经常失败。请从 DAG 设计、失败模式、数据质量、可观测性和扩展性角度指出风险，并提出改进建议。

预期产出

一份问题诊断与优化建议清单，包含优先级和可落地改进方向。

制定模式演进方案

输入

我们准备升级事件数据模型，新增字段并调整部分类型。请制定一套模式演进方案，确保上下游兼容，说明版本管理、数据校验、历史数据回填和异常处理策略。

预期产出

一份兼容性优先的模式演进方案，明确实施步骤、风险与保障措施。

// 文档

System Type: Data Pipeline

Patterns, failure modes, and anti-patterns for batch and streaming data pipelines.

Core Patterns

Batch Processing

What it is. Process bounded datasets on a schedule — hourly, daily, or triggered. Read a full partition, transform, write output. The workhorse of data engineering. When to use. Reporting, analytics, ML feature generation, any workload where latency of minutes to hours is acceptable. When the source data naturally arrives in chunks (file drops, database snapshots, daily exports). When transformations are complex aggregations across the full dataset. When to avoid. When business requirements demand sub-minute freshness. When the dataset grows faster than the batch window can process it — you'll never catch up. When downstream consumers need continuous updates, not periodic dumps.

Stream Processing

What it is. Process unbounded data continuously as it arrives. Events flow through a topology of operators. State is maintained in-stream. When to use. Real-time fraud detection, live dashboards, operational alerting, CDC-based replication, any use case where "the answer must be current." When events have value that decays with time. When to avoid. When you need complex joins across large historical windows — stream state gets expensive. When your team has no operational experience with Flink/Kafka Streams/Spark Streaming (the failure modes are subtle and unforgiving). When the "real-time" requirement is actually "within an hour" — that's batch.

Micro-Batch

What it is. Process small batches at very short intervals (seconds to low minutes). Spark Structured Streaming is the canonical example. Gives near-real-time latency with batch-like programming models. When to use. When you need latency better than batch but the team's skill set is batch-oriented. When exactly-once semantics are easier to reason about in batch units. When sub-second latency is not required. When to avoid. When you need true event-at-a-time processing with sub-second latency. When the micro-batch interval masks timing bugs that will surface under load. When the overhead of repeated job initialization dominates actual processing time.

ETL vs ELT

ETL (Extract, Transform, Load). Transform data before loading into the target. Traditional approach. Use when the target system is expensive (data warehouse with per-query pricing), when you need to filter/clean before storage, or when the target can't handle raw data volumes. ELT (Extract, Load, Transform). Load raw data first, transform in the target system. Modern approach enabled by cheap storage and powerful query engines. Use when storage is cheap, when you want to preserve raw data for reprocessing, when transformations evolve frequently, or when the target system (Snowflake, BigQuery, Databricks) has strong compute for transformations. The real tradeoff. ETL reduces storage cost and query scope at the expense of flexibility — you can't transform what you didn't keep. ELT preserves optionality at the expense of storage cost and potential query performance on raw data. Default to ELT unless you have a specific reason not to.

Lambda Architecture

What it is. Run batch and streaming pipelines in parallel. Batch layer provides complete, accurate results; speed layer provides approximate, real-time results. Merge at query time. When to use. Almost never in new systems. Was a necessary compromise before streaming frameworks matured. When to avoid. In most cases. Maintaining two codepaths that must produce consistent results is an operational nightmare. Logic drift between batch and speed layers is the norm, not the exception. Prefer Kappa architecture unless you have a proven need for both.