Domain patterns for ML/AI serving and training systems — model serving, feature stores, training pipelines, experiment tracking, A/B testing, GPU scheduling, and failure modes. Use when designing or evaluating machine learning infrastructure, model serving platforms, or AI-powered product features.
复制安装指令,让 AI 自动完成配置 · 推荐新手
请帮我安装 askskill 上的 "system-type-ml-serving" 技能: 1. 下载 https://raw.githubusercontent.com/microsoft/amplifier-bundle-systems-design/main/skills/system-type-ml-serving/SKILL.md 2. 保存为 ~/.claude/skills/system-type-ml-serving/SKILL.md 3. 装好后重载技能,告诉我可以用了
Patterns, failure modes, and anti-patterns for machine learning infrastructure and model serving systems.
What it is. Model receives a request, runs inference synchronously, returns a prediction. The model sits in the request path — latency matters as much as accuracy. When to use. User-facing predictions where the result must be immediate: search ranking, recommendation, fraud scoring at transaction time, autocomplete. When to avoid. When predictions can be precomputed. When the model is too large to meet latency budgets. When the cost per prediction doesn't justify real-time serving. Key concerns. Tail latency (P99, not just P50) dominates user experience. Model loading time creates cold start problems. Memory footprint determines how many models fit per node. Timeouts must be set aggressively — a slow prediction is worse than a fallback.
What it is. Run predictions over a large dataset on a schedule (hourly, daily). Write results to a store; serve precomputed predictions at request time. When to use. Recommendations that refresh periodically. Risk scoring where real-time freshness isn't required. Any case where the input space is bounded and enumerable. When to avoid. When input features change faster than the batch interval. When the input space is too large to precompute (e.g., arbitrary user queries). When staleness directly harms the user. Key concerns. Batch jobs that overrun their schedule. Incomplete batches that leave stale predictions for some entities. The join between precomputed predictions and request-time serving (cache misses for new entities). Monitoring must cover prediction freshness, not just job success.
What it is. Model consumes events from a stream (Kafka, Kinesis), produces predictions continuously. Sits between batch and real-time — lower latency than batch, lower cost than synchronous serving. When to use. Event-driven predictions: fraud detection on transaction streams, anomaly detection on telemetry, real-time feature updates feeding downstream models. When to avoid. When you need sub-100ms request-response latency. When the prediction consumer expects synchronous responses. Key concerns. Consumer lag means predictions fall behind reality. Backpressure from slow models causes event queue growth. Exactly-once semantics for predictions that trigger side effects (e.g., blocking a transaction). Reprocessing on model update — do you recompute predictions for the backlog or only apply the new model forward?
Model-as-a-service. Centralized inference endpoint. Clear ownership, independent scaling, versioning decoupled from application deploys. But: network latency, one more service to operate, coupling on availability. Embedded models. Model ships inside the application binary or container. No network hop. But: every application deploy includes the model, model updates require app redeployment, resource isolation is harder (the model competes with the app for memory and CPU). The real question: How often does the model change independently of the application? If weekly or more, service extraction pays off. If quarterly, embedding avoids operational overhead.
What it is. A system that manages feature computation, storage, and serving for ML models. Separates feature engineering from model training and serving. Online store. Low-latency key-value lookups at serving time. Backed by Redis, DynamoDB, or similar. Optimized for point lookups by entity ID. Offline store. Historical feature values for training. Backed by a data warehouse, object storage, or lakehouse. Optimized for bulk reads with time-range filters.
…
Catalog of reusable architectural primitives — boundaries, contracts, state machines, queues, caches, consistency models, and more. For each: what it is, when it's right, when it's WRONG. Use when selecting patterns for a design or evaluating whether a pattern fits.
Domain-Driven Design as a lens for system architecture — bounded contexts, aggregates, ubiquitous language, context mapping, domain events, and strategic vs tactical patterns. Use when modeling complex business domains, defining service boundaries, or evaluating whether a system's structure reflects its domain.
The Unix/Linux design philosophy as a lens for system design — mechanism vs policy, composability, small tools, text streams, convention over configuration, and the principle of least surprise. Use when evaluating designs for composability, simplicity, or separation of concerns.
Object-oriented design principles as a lens for system architecture — SOLID, composition over inheritance, the actor model, design patterns (and when they're wrong), encapsulation, polymorphism, and responsibility-driven design. Use when evaluating code organization, module boundaries, or object/component relationships.
Domain patterns for Azure cloud architecture — compute selection, managed services, identity (Entra ID), networking, data platform, messaging, deployment, cost management, and operational patterns. Use when designing or evaluating a system deployed on Microsoft Azure.
Adversarial review of a system design from 6 critical perspectives -- SRE, security, staff engineer, finance, operator, and developer advocate. Produces a unified risk assessment. Use for INTERACTIVE on-demand reviews during a design conversation (/adversarial-review). For RECIPE-DRIVEN reviews (where prior step context is needed), use the systems-design-critic agent instead.