← All one-liners·#042·ops·aws·expert

ECS stopped-task postmortem → claude root-cause classifier

Sweeps every cluster for recently STOPPED tasks, joins service, exit code, and stopReason, then hands the table to claude to bucket failures as code bug, OOM/resource exhaustion, or config/IAM issue so the worker team can route the page to the right owner.

Setup
  • → brew install awscli jq
  • → aws configure # needs ecs:ListClusters, ecs:ListTasks, ecs:DescribeTasks
  • → claude /login OR export ANTHROPIC_API_KEY=sk-…
Cost per run
<$0.01
The one-liner
$ aws ecs list-clusters --query 'clusterArns[]' --output text | tr '\t' '\n' | while read -r C; do ARNS=$(aws ecs list-tasks --cluster "$C" --desired-status STOPPED --query 'taskArns[]' --output text); [ -z "$ARNS" ] && continue; for CHUNK in $(echo "$ARNS" | xargs -n 100); do aws ecs describe-tasks --cluster "$C" --tasks $CHUNK --query 'tasks[].{cluster:clusterArn,group:group,taskDef:taskDefinitionArn,stoppedAt:stoppedAt,stopCode:stopCode,stopReason:stoppedReason,exit:containers[0].exitCode,reason:containers[0].reason}' --output json; done; done | jq -s 'add | map(select(.exit != 0 or .stopCode != "EssentialContainerExited")) | sort_by(.stoppedAt) | reverse | .[0:40]' | claude -p "You are an on-call SRE triaging an ECS worker fleet. Each JSON object is a stopped task. For each, classify root cause as: CODE_BUG (non-zero exit + app-level reason), OOM (container reason mentions OutOfMemory or exit 137), RESOURCE (stopCode TaskFailedToStart, no capacity, ENI), CONFIG (image pull, secrets, IAM, networking), or UNKNOWN. Group by (service, bucket), output a markdown table with: service | bucket | count | representative stopReason | next action. Keep it phone-sized."
What each stage does
  1. [01] awsaws ecs list-clusters --query 'clusterArns[]' --output text | tr '\t' '\n'
    --output text returns cluster ARNs tab-separated on one line; tr converts to newlines so `while read` iterates correctly. JSON output here would force a jq detour.
  2. [02] awsaws ecs list-tasks --cluster "$C" --desired-status STOPPED
    STOPPED is the only desired-status that exposes failed tasks. ECS retains stopped tasks for ~1 hour, so this is a moving postmortem window — fine for triage, useless for week-old incidents (use CloudWatch Container Insights for those).
  3. [03] jqxargs -n 100
    describe-tasks caps at 100 task ARNs per call; chunking is mandatory in any cluster with a real workload or the API returns InvalidParameterException.
  4. [04] aws--query 'tasks[].{exit:containers[0].exitCode,reason:containers[0].reason,stopCo…
    JMESPath projection pulls only the 7 fields the LLM needs. Full describe-tasks payloads are 4-8KB each — sending raw would blow the token budget on a busy cluster.
  5. [05] jqjq -s 'add | map(select(.exit != 0 or .stopCode != "EssentialContainerExited")) …
    -s slurps per-cluster JSON arrays into one stream; the filter drops clean shutdowns (exit 0 + EssentialContainerExited is normal Fargate completion), then caps at 40 most-recent failures — plenty for diagnosis, cheap on tokens.
  6. [06] claudeclaude -p "... classify root cause as: CODE_BUG / OOM / RESOURCE / CONFIG / UNKN…
    Pre-defined buckets prevent the model from inventing a 12-category taxonomy. Grouping by service+bucket collapses 40 raw failures into ~5 actionable rows — the right shape for a pager.
Expected output (sample)
| service | bucket | count | representative stopReason | next action |
|---|---|---|---|---|
| etl-batch-fn | OOM | 14 | OutOfMemoryError: Container killed (exit 137) | Raise task memory from 2GB → 4GB or fix pandas merge in transform.py |
| reprocess-worker | CODE_BUG | 9 | Essential container exited (exit 1): KeyError 'order_id' | Null-safety bug after schema change; ping #payments |
| image-thumbnailer | RESOURCE | 4 | RESOURCE:MEMORY — no container instance met all requirements | Capacity provider scaled to max; raise ASG ceiling |
| auth-token-rotator | CONFIG | 2 | CannotPullContainerError: pull access denied | ECR repo policy missing task role; check IAM |
| analytics-ingest | UNKNOWN | 1 | Task failed ELB health checks | Investigate manually |
Caveats & tips
  • ECS only retains stopped tasks for ~1 hour — for older failures pull from CloudWatch Container Insights or an event-bridge archive.
  • containers[0] assumes a single-container task; sidecar-heavy task defs need containers[?essential==`true`] | [0] instead.
  • stopCode is null for tasks killed by service deployments (normal); the exit!=0 filter catches the real failures.
  • In large accounts (>50 clusters), wrap the outer loop in `xargs -P 8` for parallelism — sequential list-tasks calls dominate runtime.