← All one-liners·#041·ops·aws·power

SQS fleet health snapshot → claude wedge-vs-starve diagnosis

Pulls depth and oldest-message age for every SQS queue in the region plus 1h sent-message volume from CloudWatch, then asks claude to classify each queue as wedged, starving, or healthy so the on-call worker team knows where to look first.

Setup
  • → brew install awscli jq
  • → aws configure # needs sqs:ListQueues, sqs:GetQueueAttributes, cloudwatch:GetMetricData
  • → claude /login OR export ANTHROPIC_API_KEY=sk-…
Cost per run
<$0.01
The one-liner
$ END=$(date -u +%Y-%m-%dT%H:%M:%SZ); START=$(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ); URLS=$(aws sqs list-queues --query 'QueueUrls[]' --output text); printf 'queue\tvisible\toldest_age_s\tsent_1h\n' > /tmp/sqs.tsv; QUERIES='[]'; i=0; for U in $URLS; do NAME=$(basename "$U"); ATTRS=$(aws sqs get-queue-attributes --queue-url "$U" --attribute-names ApproximateNumberOfMessages ApproximateAgeOfOldestMessage --query 'Attributes' --output json); VIS=$(echo "$ATTRS" | jq -r '.ApproximateNumberOfMessages // "0"'); AGE=$(echo "$ATTRS" | jq -r '.ApproximateAgeOfOldestMessage // "0"'); QUERIES=$(echo "$QUERIES" | jq --arg n "$NAME" --arg id "m$i" '. + [{Id:$id,MetricStat:{Metric:{Namespace:"AWS/SQS",MetricName:"NumberOfMessagesSent",Dimensions:[{Name:"QueueName",Value:$n}]},Period:3600,Stat:"Sum"},ReturnData:true}]'); echo "$NAME\t$VIS\t$AGE\tPENDING_$i" >> /tmp/sqs.tsv; i=$((i+1)); done; SENT=$(aws cloudwatch get-metric-data --start-time "$START" --end-time "$END" --metric-data-queries "$QUERIES" --output json | jq -r '.MetricDataResults[] | "\(.Id)\t\(.Values[0] // 0)"'); while IFS=$'\t' read -r ID VAL; do sed -i.bak "s|PENDING_${ID#m}|$VAL|" /tmp/sqs.tsv; done <<< "$SENT"; cat /tmp/sqs.tsv | claude -p "You are an on-call SRE getting paged at 3am about a worker fleet. The TSV is queue|visible|oldest_age_s|sent_1h. For each queue classify as WEDGED (consumers stuck: high age + high depth + sent>0), STARVING (no producers: sent=0, depth flat), DRAINING_OK, or HEALTHY. Output a phone-screen-sized table sorted by severity, one-line action per row."
What each stage does
  1. [01] awsEND=$(date -u +%Y-%m-%dT%H:%M:%SZ); START=$(date -u -v-1H +...)
    GNU/BSD-portable 1-hour window in ISO-8601 UTC. CloudWatch get-metric-data requires RFC3339 timestamps and rejects local TZ.
  2. [02] awsaws sqs list-queues --query 'QueueUrls[]' --output text
    Enumerates all queues in the current region. --query unwraps the QueueUrls array so the shell for-loop gets bare URLs (no JSON quoting).
  3. [03] awsaws sqs get-queue-attributes --queue-url "$U" --attribute-names ApproximateNumbe…
    AgeOfOldestMessage is the one signal SQS itself exposes that proves consumers are stuck — CloudWatch only gives you depth deltas. Must request attribute names explicitly; default returns nothing.
  4. [04] jqjq --arg n "$NAME" --arg id "m$i" '. + [{Id:$id,MetricStat:{Metric:{Namespace:"A…
    Builds a single batched GetMetricData payload (up to 500 metrics per call) instead of N legacy get-metric-statistics calls. Period:3600+Stat:Sum gives a 1h send count per queue.
  5. [05] awsaws cloudwatch get-metric-data --metric-data-queries "$QUERIES"
    One API call returns all per-queue send counts. Modern batch API; legacy get-metric-statistics would be N round-trips and rate-limit you in accounts with >50 queues.
  6. [06] claudeclaude -p "You are an on-call SRE getting paged at 3am ... classify as WEDGED / …
    The four-bucket taxonomy forces the model to differentiate 'consumers broken' from 'producers stopped' — the two failure modes that look identical on a CloudWatch depth dashboard.
Expected output (sample)
SEVERITY  QUEUE                          STATE     visible / age / sent_1h   ACTION
P1        reprocess-worker-queue         WEDGED    18420 / 2734s / 22100     Consumers stuck — check ECS service reprocess-worker, likely poison pill
P1        billing-events-queue           WEDGED      812 / 1890s /   940     Visibility timeout > processing time; bump VisibilityTimeout or fix slow handler
P2        analytics-ingest-queue         STARVING      0 /    0s /     0     Upstream producer silent for >1h; check kinesis-firehose-ingest
P3        emails-outbound-queue          DRAINING  1247 /   12s /  8200     Healthy burst, draining normally
--        notifications-fanout-dlq       HEALTHY       0 /    0s /     3     No action
Caveats & tips
  • ApproximateAgeOfOldestMessage has up to 1 minute eventual-consistency lag — don't trigger alerts on a single sample.
  • Cross-account queues won't show up in list-queues; run per-account with --profile if you have a multi-account worker fleet.
  • NumberOfMessagesSent is a 5-min-resolution metric in CloudWatch standard; the 1h Sum is reliable but sub-minute spikes are invisible.
  • FIFO queues append .fifo to QueueName — basename strips the URL but keeps the suffix, which is correct for the CloudWatch dimension.