← All one-liners·#037·ops·aws·power

SQS DLQ: peek poison pills → claude triage table

Snapshot up to 10 messages from a dead-letter queue without consuming them, then ask claude to bucket them by failure shape and emit a markdown triage table. Production-safe because visibility-timeout=0 leaves messages immediately re-pollable.

Setup
  • → brew install awscli jq
  • → aws configure # needs sqs:ReceiveMessage + sqs:GetQueueAttributes on the DLQ
  • → claude /login OR export ANTHROPIC_API_KEY=sk-…
  • → Know the DLQ URL (aws sqs get-queue-url --queue-name prod-orders-dlq)
Cost per run
<$0.01
The one-liner
$ aws sqs receive-message \
  --queue-url https://sqs.eu-central-1.amazonaws.com/123456789012/prod-orders-dlq \
  --max-number-of-messages 10 \
  --visibility-timeout 0 \
  --message-system-attribute-names ApproximateReceiveCount SentTimestamp \
  --message-attribute-names All \
| jq '[.Messages[]? | {
    id: .MessageId,
    receives: (.Attributes.ApproximateReceiveCount | tonumber),
    sent: (.Attributes.SentTimestamp | tonumber / 1000 | strftime("%Y-%m-%dT%H:%M:%SZ")),
    body: (.Body | try fromjson catch .)
  }]' \
| claude -p --append-system-prompt "You are an on-call SRE doing DLQ triage. Cluster by failure shape (schema mismatch, downstream 5xx, timeout, business-logic reject). Never invent fields not in the input." "Bucket these poison pills. For each bucket emit: count, one-line failure shape, smallest reproducer (1 trimmed payload), suspected root cause, and either 'redrive' or 'patch-then-redrive'. Output a single markdown table sorted by count desc."
What each stage does
  1. [01] awsaws sqs receive-message --visibility-timeout 0
    visibility-timeout=0 is the key flag: messages reappear instantly so you peek without consuming. Without it, messages hide for 30s and on-call thinks the DLQ drained.
  2. [02] aws--max-number-of-messages 10
    10 is the API hard cap per call. SQS uses short-poll sampling so a single call may return fewer; for a fuller picture loop 3-4 times.
  3. [03] aws--message-system-attribute-names ApproximateReceiveCount SentTimestamp
    ApproximateReceiveCount reveals whether this is a fresh failure or a stuck pill (e.g. 200 receives = redrive policy is broken). attribute-names is deprecated; use message-system-attribute-names.
  4. [04] jqjq '… body: (.Body | try fromjson catch .)'
    Most producers send JSON-stringified Body; try fromjson promotes it to a real object so claude sees structure instead of a quoted blob. Falls back to raw string for non-JSON payloads.
  5. [05] claudeclaude -p --append-system-prompt "You are an on-call SRE …"
    Persona pins the model to triage mode. The 'never invent fields' guard is load-bearing — without it claude hallucinates stack-trace fields that aren't in the message body.
Expected output (sample)
| Count | Failure shape | Sample id | Root cause | Action |
|-------|--------------|-----------|------------|--------|
| 6 | Schema mismatch: order.total is string, expected number | 8a3f… | Producer v2.14 dropped Decimal cast | patch-then-redrive |
| 3 | Downstream 503 from payment-svc | b91c… | Stripe webhook backpressure 14:02-14:09 UTC | redrive |
| 1 | Business-logic reject: negative quantity | 4ee0… | Frontend missing min=1 validation | patch-then-redrive |
Caveats & tips
  • visibility-timeout=0 still increments ApproximateReceiveCount on each peek — if your redrive policy maxReceiveCount is tight (e.g. 3), repeated peeks can technically push a message past the threshold; safe on DLQs (no further redrive) but never run this on a live source queue.
  • Message Body may be base64 if the producer used SendMessage with binary; jq fromjson will fail silently and you'll get a quoted blob — pipe through `| base64 -d` first if you see that.
  • Cost: 1 ReceiveMessage call = 1 SQS request ($0.40 per million). Claude input ~2-3k tokens for 10 messages, well under $0.01. IAM needs sqs:ReceiveMessage on the DLQ ARN specifically — wildcard policies often miss DLQs.