Use when running tasks in Docker containers with safety limits, watchdog monitoring for resource enforcement, orphan container recovery, sidecar container provisioning, or scripting reproducible dev stack environments.
复制安装指令,让 AI 自动完成配置 · 推荐新手
请帮我安装 askskill 上的 "container-orchestration-patterns" 技能: 1. 下载 https://raw.githubusercontent.com/microsoft/amplifier-bundle-skills/main/skills/container-orchestration-patterns/SKILL.md 2. 保存为 ~/.claude/skills/container-orchestration-patterns/SKILL.md 3. 装好后重载技能,告诉我可以用了
Problem: You're executing tasks in containers (one per task). Those tasks can fork-bomb, exhaust memory, run forever, or leave orphan containers after a crash. You need safety limits, monitoring, and cleanup — plus optional sidecar services (databases, caches, auxiliary APIs).
Approach: Hard container limits (PID, memory, CPU, lifetime), a watchdog loop that polls docker stats and kills violators, orphan recovery on restart, and sidecar provisioning with bind-mounted persistent data.
Pattern proven in production across multiple Python CLI tools and web services.
Safety limits exist because of a real incident: in one production deployment, over 4,000 runaway test processes consumed 103Gi of RAM and caused OOM kills across the host.
# Container safety limits — prevent fork bomb and memory exhaustion incidents.
# These values were determined after a real incident where thousands of runaway
# processes consumed all available RAM and caused OOM kills.
CONTAINER_PIDS_LIMIT = 256
CONTAINER_MEMORY_LIMIT = "8g"
CONTAINER_MEMORY_SWAP_LIMIT = "8g"
CONTAINER_CPU_LIMIT = 2.0
MAX_INSTANCE_LIFETIME_SECONDS = 12 * 60 * 60 # 12 hours
These are passed to docker create as resource constraints. The PID limit is the most critical — it prevents fork bombs from escaping the container's cgroup.
The watchdog runs as a background asyncio task, polling every 5 minutes:
async def watchdog_loop(self, instance_store, interval=300):
while True:
for instance_id, info in list(self._active.items()):
await self._watchdog_check_instance(instance_id, info, instance_store)
await asyncio.sleep(interval)
async def _watchdog_check_instance(self, instance_id, info, instance_store):
container_name = info.container_name
# Check 1: Lifetime
if age_seconds > MAX_INSTANCE_LIFETIME_SECONDS:
await self._watchdog_destroy(instance_id, ...)
return
# Check 2 & 3: PIDs and Memory (single docker stats call)
rc, stdout, _ = await self._client._run_docker(
"stats", "--no-stream", "--format", "{{.PIDs}} {{.MemPerc}}",
container_name)
parts = stdout.strip().split()
pid_count = int(parts[0])
mem_perc = float(parts[1].rstrip("%"))
if pid_count > _WATCHDOG_PID_THRESHOLD: # 200
await self._watchdog_destroy(...)
return
if mem_perc > _WATCHDOG_MEMORY_PERCENT_THRESHOLD: # 80%
await self._watchdog_destroy(...)
return
Key design: the watchdog uses docker stats --no-stream with a format string to get both PID count and memory percentage in a single call. This minimizes Docker API overhead.
The thresholds (_WATCHDOG_PID_THRESHOLD = 200, _WATCHDOG_MEMORY_PERCENT_THRESHOLD = 80.0) are below the hard limits (CONTAINER_PIDS_LIMIT = 256, CONTAINER_MEMORY_LIMIT = "8g"). This gives the watchdog a chance to detect and kill containers before they hit the hard limit and get OOM-killed by the kernel.
Destroying a container also destroys its sidecar containers:
async def _watchdog_destroy(self, instance_id, container_name, instance_store):
# Destroy the main container
await self._client.destroy_container(container_name)
# Destroy sidecar if present
info = self._active.get(instance_id)
if info is not None and info.sidecar_env_id is not None: # Destroy companion containers if your architecture uses them
await destroy_sidecar(info.sidecar_env_id)
# Update status and remove from active tracking
instance_store.update_instance(instance_id, status="cancelled")
self._active.pop(instance_id, None)
…
Guide for creating new Amplifier modules including protocol implementation, entry points, mount functions, and testing patterns. Use when creating new modules or understanding module architecture.
Python coding standards for Amplifier including type hints, async patterns, error handling, and formatting. Use when writing Python code for Amplifier modules.
Adapt a skill written for another AI coding assistant (Claude Code, Cursor, etc.) into a properly structured Amplifier SKILL.md file. Reads the source skill, identifies platform-specific conventions, researches the source platform if needed, and produces an Amplifier-native skill conforming to the Agent Skills specification with Amplifier extensions. Use when the user wants to adapt a skill, port a skill, convert a skill to amplifier, translate a skill, or has a SKILL.md from another platform they want to bring into Amplifier.
Use when your service needs authentication that works without friction locally but secures remote access, automatic TLS certificate setup, or token-based auth with auto-generation and localhost bypass.
Use when building a new CLI tool that needs one-line install via uv or npm, subcommand dispatch with a default action, or 3-tier config resolution (CLI flags, config file, hardcoded defaults).
Amplifier design philosophy using Linux kernel metaphor. Covers mechanism vs policy, module architecture, event-driven design, and kernel principles. Use when designing new modules or making architectural decisions.