SQS fleet health snapshot → claude wedge-vs-starve diagnosis
Pulls depth and oldest-message age for every SQS queue in the region plus 1h sent-message volume from CloudWatch, then asks claude to classify each queue as wedged, starving, or healthy so the on-call worker team knows where to look first.
Setup
- → brew install awscli jq
- → aws configure # needs sqs:ListQueues, sqs:GetQueueAttributes, cloudwatch:GetMetricData
- → claude /login OR export ANTHROPIC_API_KEY=sk-…
Cost per run
<$0.01
The one-liner
$ END=$(date -u +%Y-%m-%dT%H:%M:%SZ); START=$(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ); URLS=$(aws sqs list-queues --query 'QueueUrls[]' --output text); printf 'queue\tvisible\toldest_age_s\tsent_1h\n' > /tmp/sqs.tsv; QUERIES='[]'; i=0; for U in $URLS; do NAME=$(basename "$U"); ATTRS=$(aws sqs get-queue-attributes --queue-url "$U" --attribute-names ApproximateNumberOfMessages ApproximateAgeOfOldestMessage --query 'Attributes' --output json); VIS=$(echo "$ATTRS" | jq -r '.ApproximateNumberOfMessages // "0"'); AGE=$(echo "$ATTRS" | jq -r '.ApproximateAgeOfOldestMessage // "0"'); QUERIES=$(echo "$QUERIES" | jq --arg n "$NAME" --arg id "m$i" '. + [{Id:$id,MetricStat:{Metric:{Namespace:"AWS/SQS",MetricName:"NumberOfMessagesSent",Dimensions:[{Name:"QueueName",Value:$n}]},Period:3600,Stat:"Sum"},ReturnData:true}]'); echo "$NAME\t$VIS\t$AGE\tPENDING_$i" >> /tmp/sqs.tsv; i=$((i+1)); done; SENT=$(aws cloudwatch get-metric-data --start-time "$START" --end-time "$END" --metric-data-queries "$QUERIES" --output json | jq -r '.MetricDataResults[] | "\(.Id)\t\(.Values[0] // 0)"'); while IFS=$'\t' read -r ID VAL; do sed -i.bak "s|PENDING_${ID#m}|$VAL|" /tmp/sqs.tsv; done <<< "$SENT"; cat /tmp/sqs.tsv | claude -p "You are an on-call SRE getting paged at 3am about a worker fleet. The TSV is queue|visible|oldest_age_s|sent_1h. For each queue classify as WEDGED (consumers stuck: high age + high depth + sent>0), STARVING (no producers: sent=0, depth flat), DRAINING_OK, or HEALTHY. Output a phone-screen-sized table sorted by severity, one-line action per row."What each stage does
- [01] aws
END=$(date -u +%Y-%m-%dT%H:%M:%SZ); START=$(date -u -v-1H +...)GNU/BSD-portable 1-hour window in ISO-8601 UTC. CloudWatch get-metric-data requires RFC3339 timestamps and rejects local TZ. - [02] aws
aws sqs list-queues --query 'QueueUrls[]' --output textEnumerates all queues in the current region. --query unwraps the QueueUrls array so the shell for-loop gets bare URLs (no JSON quoting). - [03] aws
aws sqs get-queue-attributes --queue-url "$U" --attribute-names ApproximateNumbe…AgeOfOldestMessage is the one signal SQS itself exposes that proves consumers are stuck — CloudWatch only gives you depth deltas. Must request attribute names explicitly; default returns nothing. - [04] jq
jq --arg n "$NAME" --arg id "m$i" '. + [{Id:$id,MetricStat:{Metric:{Namespace:"A…Builds a single batched GetMetricData payload (up to 500 metrics per call) instead of N legacy get-metric-statistics calls. Period:3600+Stat:Sum gives a 1h send count per queue. - [05] aws
aws cloudwatch get-metric-data --metric-data-queries "$QUERIES"One API call returns all per-queue send counts. Modern batch API; legacy get-metric-statistics would be N round-trips and rate-limit you in accounts with >50 queues. - [06] claude
claude -p "You are an on-call SRE getting paged at 3am ... classify as WEDGED / …The four-bucket taxonomy forces the model to differentiate 'consumers broken' from 'producers stopped' — the two failure modes that look identical on a CloudWatch depth dashboard.
Expected output (sample)
SEVERITY QUEUE STATE visible / age / sent_1h ACTION P1 reprocess-worker-queue WEDGED 18420 / 2734s / 22100 Consumers stuck — check ECS service reprocess-worker, likely poison pill P1 billing-events-queue WEDGED 812 / 1890s / 940 Visibility timeout > processing time; bump VisibilityTimeout or fix slow handler P2 analytics-ingest-queue STARVING 0 / 0s / 0 Upstream producer silent for >1h; check kinesis-firehose-ingest P3 emails-outbound-queue DRAINING 1247 / 12s / 8200 Healthy burst, draining normally -- notifications-fanout-dlq HEALTHY 0 / 0s / 3 No action
Caveats & tips
- ApproximateAgeOfOldestMessage has up to 1 minute eventual-consistency lag — don't trigger alerts on a single sample.
- Cross-account queues won't show up in list-queues; run per-account with --profile if you have a multi-account worker fleet.
- NumberOfMessagesSent is a 5-min-resolution metric in CloudWatch standard; the 1h Sum is reliable but sub-minute spikes are invisible.
- FIFO queues append .fifo to QueueName — basename strips the URL but keeps the suffix, which is correct for the CloudWatch dimension.