container-orchestration-patterns

Name: container-orchestration-patterns
Author: microsoft

帮助你安全编排 Docker 容器任务，并搭建可复现的开发运行环境

星标

★ 10

来源

GitHub

更新于

2026-07-20

// 安全评估低风险

仅提示词，不执行代码
开源可审计

正在进行安全审计…

凭证密钥
网络外发
代码执行
数据访问
来源供应链

// 安装

复制安装指令，让 AI 自动完成配置 · 推荐新手

请帮我安装 askskill 上的 "container-orchestration-patterns" 技能：
1. 下载 https://raw.githubusercontent.com/microsoft/amplifier-bundle-skills/main/skills/container-orchestration-patterns/SKILL.md
2. 保存为 ~/.claude/skills/container-orchestration-patterns/SKILL.md
3. 装好后重载技能，告诉我可以用了

// 下载

下载 SKILL.md机读安装清单 ↗

// 用法示例

为容器任务设置安全限制

输入

请为一个运行数据处理脚本的 Docker 容器设计编排方案，要求限制 CPU 和内存、设置超时 watchdog，并在异常退出后自动清理孤儿容器。

预期产出

一份包含资源限制、监控机制和故障清理策略的容器编排方案。

恢复异常遗留容器

输入

我有一套批处理任务经常因中断留下孤儿容器。请给出一个恢复流程，包含检测、日志保留、容器回收和重试策略。

预期产出

一套处理孤儿容器的标准操作流程和可执行脚本思路。

搭建可复现开发环境

输入

请为一个包含主服务、数据库和日志 sidecar 的本地开发环境设计脚本化方案，要求团队成员一键启动并保持环境一致。

预期产出

一份包含 sidecar 配置、启动脚本和环境复现说明的开发栈方案。

// 文档

Container Orchestration & Dev Stacks

The Pattern

Problem: You're executing tasks in containers (one per task). Those tasks can fork-bomb, exhaust memory, run forever, or leave orphan containers after a crash. You need safety limits, monitoring, and cleanup — plus optional sidecar services (databases, caches, auxiliary APIs).

Approach: Hard container limits (PID, memory, CPU, lifetime), a watchdog loop that polls docker stats and kills violators, orphan recovery on restart, and sidecar provisioning with bind-mounted persistent data.

Pattern proven in production across multiple Python CLI tools and web services.

Key Design Decisions

1. Container safety limits — the runaway processes incident

Safety limits exist because of a real incident: in one production deployment, over 4,000 runaway test processes consumed 103Gi of RAM and caused OOM kills across the host.

# Container safety limits — prevent fork bomb and memory exhaustion incidents.
# These values were determined after a real incident where thousands of runaway
# processes consumed all available RAM and caused OOM kills.
CONTAINER_PIDS_LIMIT = 256
CONTAINER_MEMORY_LIMIT = "8g"
CONTAINER_MEMORY_SWAP_LIMIT = "8g"
CONTAINER_CPU_LIMIT = 2.0
MAX_INSTANCE_LIFETIME_SECONDS = 12 * 60 * 60  # 12 hours

These are passed to docker create as resource constraints. The PID limit is the most critical — it prevents fork bombs from escaping the container's cgroup.

2. Watchdog monitoring loop

The watchdog runs as a background asyncio task, polling every 5 minutes:

async def watchdog_loop(self, instance_store, interval=300):
    while True:
        for instance_id, info in list(self._active.items()):
            await self._watchdog_check_instance(instance_id, info, instance_store)
        await asyncio.sleep(interval)

async def _watchdog_check_instance(self, instance_id, info, instance_store):
    container_name = info.container_name

    # Check 1: Lifetime
    if age_seconds > MAX_INSTANCE_LIFETIME_SECONDS:
        await self._watchdog_destroy(instance_id, ...)
        return

    # Check 2 & 3: PIDs and Memory (single docker stats call)
    rc, stdout, _ = await self._client._run_docker(
        "stats", "--no-stream", "--format", "{{.PIDs}} {{.MemPerc}}",
        container_name)
    parts = stdout.strip().split()
    pid_count = int(parts[0])
    mem_perc = float(parts[1].rstrip("%"))

    if pid_count > _WATCHDOG_PID_THRESHOLD:     # 200
        await self._watchdog_destroy(...)
        return
    if mem_perc > _WATCHDOG_MEMORY_PERCENT_THRESHOLD:  # 80%
        await self._watchdog_destroy(...)
        return

Key design: the watchdog uses docker stats --no-stream with a format string to get both PID count and memory percentage in a single call. This minimizes Docker API overhead.

The thresholds (_WATCHDOG_PID_THRESHOLD = 200, _WATCHDOG_MEMORY_PERCENT_THRESHOLD = 80.0) are below the hard limits (CONTAINER_PIDS_LIMIT = 256, CONTAINER_MEMORY_LIMIT = "8g"). This gives the watchdog a chance to detect and kill containers before they hit the hard limit and get OOM-killed by the kernel.

3. Watchdog destroy — cleanup with sidecar awareness

Destroying a container also destroys its sidecar containers:

async def _watchdog_destroy(self, instance_id, container_name, instance_store):
    # Destroy the main container
    await self._client.destroy_container(container_name)
    # Destroy sidecar if present
    info = self._active.get(instance_id)
    if info is not None and info.sidecar_env_id is not None:  # Destroy companion containers if your architecture uses them
        await destroy_sidecar(info.sidecar_env_id)
    # Update status and remove from active tracking
    instance_store.update_instance(instance_id, status="cancelled")
    self._active.pop(instance_id, None)