← All one-liners·#042·ops·aws·expert

ECS stopped-task postmortem → claude root-cause classifier

Sweeps every cluster for recently STOPPED tasks, joins service, exit code, and stopReason, then hands the table to claude to bucket failures as code bug, OOM/resource exhaustion, or config/IAM issue so the worker team can route the page to the right owner.

Setup

→ brew install awscli jq
→ aws configure # needs ecs:ListClusters, ecs:ListTasks, ecs:DescribeTasks
→ claude /login OR export ANTHROPIC_API_KEY=sk-…

Cost per run

<$0.01

The one-liner

$ aws ecs list-clusters --query 'clusterArns[]' --output text | tr '\t' '\n' | while read -r C; do ARNS=$(aws ecs list-tasks --cluster "$C" --desired-status STOPPED --query 'taskArns[]' --output text); [ -z "$ARNS" ] && continue; for CHUNK in $(echo "$ARNS" | xargs -n 100); do aws ecs describe-tasks --cluster "$C" --tasks $CHUNK --query 'tasks[].{cluster:clusterArn,group:group,taskDef:taskDefinitionArn,stoppedAt:stoppedAt,stopCode:stopCode,stopReason:stoppedReason,exit:containers[0].exitCode,reason:containers[0].reason}' --output json; done; done | jq -s 'add | map(select(.exit != 0 or .stopCode != "EssentialContainerExited")) | sort_by(.stoppedAt) | reverse | .[0:40]' | claude -p "You are an on-call SRE triaging an ECS worker fleet. Each JSON object is a stopped task. For each, classify root cause as: CODE_BUG (non-zero exit + app-level reason), OOM (container reason mentions OutOfMemory or exit 137), RESOURCE (stopCode TaskFailedToStart, no capacity, ENI), CONFIG (image pull, secrets, IAM, networking), or UNKNOWN. Group by (service, bucket), output a markdown table with: service | bucket | count | representative stopReason | next action. Keep it phone-sized."

What each stage does

[01] awsaws ecs list-clusters --query 'clusterArns[]' --output text | tr '\t' '\n'
--output text returns cluster ARNs tab-separated on one line; tr converts to newlines so `while read` iterates correctly. JSON output here would force a jq detour.
[02] awsaws ecs list-tasks --cluster "$C" --desired-status STOPPED
STOPPED is the only desired-status that exposes failed tasks. ECS retains stopped tasks for ~1 hour, so this is a moving postmortem window — fine for triage, useless for week-old incidents (use CloudWatch Container Insights for those).
[03] jqxargs -n 100
describe-tasks caps at 100 task ARNs per call; chunking is mandatory in any cluster with a real workload or the API returns InvalidParameterException.
[04] aws--query 'tasks[].{exit:containers[0].exitCode,reason:containers[0].reason,stopCo…
JMESPath projection pulls only the 7 fields the LLM needs. Full describe-tasks payloads are 4-8KB each — sending raw would blow the token budget on a busy cluster.
[05] jqjq -s 'add | map(select(.exit != 0 or .stopCode != "EssentialContainerExited")) …
-s slurps per-cluster JSON arrays into one stream; the filter drops clean shutdowns (exit 0 + EssentialContainerExited is normal Fargate completion), then caps at 40 most-recent failures — plenty for diagnosis, cheap on tokens.
[06] claudeclaude -p "... classify root cause as: CODE_BUG / OOM / RESOURCE / CONFIG / UNKN…
Pre-defined buckets prevent the model from inventing a 12-category taxonomy. Grouping by service+bucket collapses 40 raw failures into ~5 actionable rows — the right shape for a pager.

Expected output (sample)

| service | bucket | count | representative stopReason | next action |
|---|---|---|---|---|
| etl-batch-fn | OOM | 14 | OutOfMemoryError: Container killed (exit 137) | Raise task memory from 2GB → 4GB or fix pandas merge in transform.py |
| reprocess-worker | CODE_BUG | 9 | Essential container exited (exit 1): KeyError 'order_id' | Null-safety bug after schema change; ping #payments |
| image-thumbnailer | RESOURCE | 4 | RESOURCE:MEMORY — no container instance met all requirements | Capacity provider scaled to max; raise ASG ceiling |
| auth-token-rotator | CONFIG | 2 | CannotPullContainerError: pull access denied | ECR repo policy missing task role; check IAM |
| analytics-ingest | UNKNOWN | 1 | Task failed ELB health checks | Investigate manually |

Caveats & tips

ECS only retains stopped tasks for ~1 hour — for older failures pull from CloudWatch Container Insights or an event-bridge archive.
containers[0] assumes a single-container task; sidecar-heavy task defs need containers[?essential==`true`] | [0] instead.
stopCode is null for tasks killed by service deployments (normal); the exit!=0 filter catches the real failures.
In large accounts (>50 clusters), wrap the outer loop in `xargs -P 8` for parallelism — sequential list-tasks calls dominate runtime.

← #041

SQS fleet health snapshot → claude wedge-vs-starve diagnosis

#043 →

Athena: audit last 50 queries for bytes-scanned hogs via claude