SQS DLQ: peek poison pills → claude triage table
Snapshot up to 10 messages from a dead-letter queue without consuming them, then ask claude to bucket them by failure shape and emit a markdown triage table. Production-safe because visibility-timeout=0 leaves messages immediately re-pollable.
Setup
- → brew install awscli jq
- → aws configure # needs sqs:ReceiveMessage + sqs:GetQueueAttributes on the DLQ
- → claude /login OR export ANTHROPIC_API_KEY=sk-…
- → Know the DLQ URL (aws sqs get-queue-url --queue-name prod-orders-dlq)
Cost per run
<$0.01
The one-liner
$ aws sqs receive-message \
--queue-url https://sqs.eu-central-1.amazonaws.com/123456789012/prod-orders-dlq \
--max-number-of-messages 10 \
--visibility-timeout 0 \
--message-system-attribute-names ApproximateReceiveCount SentTimestamp \
--message-attribute-names All \
| jq '[.Messages[]? | {
id: .MessageId,
receives: (.Attributes.ApproximateReceiveCount | tonumber),
sent: (.Attributes.SentTimestamp | tonumber / 1000 | strftime("%Y-%m-%dT%H:%M:%SZ")),
body: (.Body | try fromjson catch .)
}]' \
| claude -p --append-system-prompt "You are an on-call SRE doing DLQ triage. Cluster by failure shape (schema mismatch, downstream 5xx, timeout, business-logic reject). Never invent fields not in the input." "Bucket these poison pills. For each bucket emit: count, one-line failure shape, smallest reproducer (1 trimmed payload), suspected root cause, and either 'redrive' or 'patch-then-redrive'. Output a single markdown table sorted by count desc."What each stage does
- [01] aws
aws sqs receive-message --visibility-timeout 0visibility-timeout=0 is the key flag: messages reappear instantly so you peek without consuming. Without it, messages hide for 30s and on-call thinks the DLQ drained. - [02] aws
--max-number-of-messages 1010 is the API hard cap per call. SQS uses short-poll sampling so a single call may return fewer; for a fuller picture loop 3-4 times. - [03] aws
--message-system-attribute-names ApproximateReceiveCount SentTimestampApproximateReceiveCount reveals whether this is a fresh failure or a stuck pill (e.g. 200 receives = redrive policy is broken). attribute-names is deprecated; use message-system-attribute-names. - [04] jq
jq '… body: (.Body | try fromjson catch .)'Most producers send JSON-stringified Body; try fromjson promotes it to a real object so claude sees structure instead of a quoted blob. Falls back to raw string for non-JSON payloads. - [05] claude
claude -p --append-system-prompt "You are an on-call SRE …"Persona pins the model to triage mode. The 'never invent fields' guard is load-bearing — without it claude hallucinates stack-trace fields that aren't in the message body.
Expected output (sample)
| Count | Failure shape | Sample id | Root cause | Action | |-------|--------------|-----------|------------|--------| | 6 | Schema mismatch: order.total is string, expected number | 8a3f… | Producer v2.14 dropped Decimal cast | patch-then-redrive | | 3 | Downstream 503 from payment-svc | b91c… | Stripe webhook backpressure 14:02-14:09 UTC | redrive | | 1 | Business-logic reject: negative quantity | 4ee0… | Frontend missing min=1 validation | patch-then-redrive |
Caveats & tips
- visibility-timeout=0 still increments ApproximateReceiveCount on each peek — if your redrive policy maxReceiveCount is tight (e.g. 3), repeated peeks can technically push a message past the threshold; safe on DLQs (no further redrive) but never run this on a live source queue.
- Message Body may be base64 if the producer used SendMessage with binary; jq fromjson will fail silently and you'll get a quoted blob — pipe through `| base64 -d` first if you see that.
- Cost: 1 ReceiveMessage call = 1 SQS request ($0.40 per million). Claude input ~2-3k tokens for 10 messages, well under $0.01. IAM needs sqs:ReceiveMessage on the DLQ ARN specifically — wildcard policies often miss DLQs.