← All one-liners·#038·ops·aws·expert

CloudWatch alarm → 30-min context dump → claude RCA memo

Given a firing alarm, pull its config, the metric's last 30 min of datapoints, and matching log lines, then have claude write a one-page incident memo with a leading hypothesis. Cuts the 'open seven tabs' phase of every page.

Setup

→ brew install awscli jq
→ aws configure # needs cloudwatch:DescribeAlarms, cloudwatch:GetMetricStatistics, logs:FilterLogEvents
→ claude /login OR export ANTHROPIC_API_KEY=sk-…
→ Know the alarm name (left column of the CloudWatch Alarms console) and the log group of the resource it watches

Cost per run

<$0.01

The one-liner

$ ALARM=prod-orders-api-5xx-rate; LOGS=/aws/lambda/prod-orders-api; \
A=$(aws cloudwatch describe-alarms --alarm-names "$ALARM" --alarm-types MetricAlarm); \
NS=$(jq -r '.MetricAlarms[0].Namespace' <<<"$A"); \
MN=$(jq -r '.MetricAlarms[0].MetricName' <<<"$A"); \
DIMS=$(jq -c '.MetricAlarms[0].Dimensions' <<<"$A"); \
M=$(aws cloudwatch get-metric-statistics \
      --namespace "$NS" --metric-name "$MN" \
      --dimensions "$(jq -r '.[] | "Name=\(.Name),Value=\(.Value)"' <<<"$DIMS" | paste -sd' ' -)" \
      --start-time "$(date -u -v-30M '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || date -u -d '30 min ago' '+%Y-%m-%dT%H:%M:%SZ')" \
      --end-time "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" \
      --period 60 --statistics Average Maximum Sum); \
L=$(aws logs tail "$LOGS" --since 30m --filter-pattern '?ERROR ?Exception ?"status=5"' --format short | head -200); \
jq -n --argjson alarm "$A" --argjson metric "$M" --arg logs "$L" \
  '{alarm: $alarm.MetricAlarms[0] | {AlarmName, StateValue, StateReason, Threshold, ComparisonOperator, EvaluationPeriods, Period}, datapoints: ($metric.Datapoints | sort_by(.Timestamp)), recent_errors: $logs}' \
| claude -p --append-system-prompt "You are the IC on a live incident. Write the memo a Director would read at 3am. Prefer specific causes (deploy, throttle, dependency) over generic ones (load spike). State confidence." "Write a 5-section incident memo: (1) one-line headline, (2) leading hypothesis with confidence %, (3) supporting signal from datapoints+logs, (4) 3 alternative hypotheses ranked, (5) next 3 checks the on-call should run in order. Markdown, no preamble."

What each stage does

[01] awsaws cloudwatch describe-alarms --alarm-names "$ALARM" --alarm-types MetricAlarm
--alarm-types MetricAlarm is required because describe-alarms defaults to metric alarms only but explicit is safer once composite alarms exist in the account. StateReason returns the breaching-datapoint text — claude uses it as the anchor.
[02] jqjq -r '.[] | "Name=\(.Name),Value=\(.Value)"' | paste -sd' ' -
get-metric-statistics needs --dimensions in shorthand form (Name=foo,Value=bar). Reconstructing it from the alarm's structured Dimensions array is the non-obvious glue most people get wrong.
[03] aws--period 60 --statistics Average Maximum Sum
60s period is the finest CloudWatch keeps for standard metrics for 15 days. Pulling Avg+Max+Sum together lets claude tell a smooth rise (capacity) from a spike (single bad request).
[04] awsaws logs tail --since 30m --filter-pattern '?ERROR ?Exception ?"status=5"'
The ?term syntax is CloudWatch Logs OR — matches any of the three. Without quotes around status=5 the equals breaks the parser.
[05] jqjq -n --argjson alarm … --arg logs …
--argjson keeps structured fields parseable (so claude can see Threshold as a number), --arg keeps the raw log blob as a string so multi-line stack traces survive intact.
[06] claudeclaude -p --append-system-prompt "You are the IC on a live incident…"
Forcing a confidence % and a ranked alternatives section stops the model from hedging into uselessness. 'Director at 3am' framing produces short paragraphs over bullet soup.

Expected output (sample)

### Headline
prod-orders-api 5xx rate at 4.2% (threshold 1%) since 14:31 UTC — correlates with deploy v2.14.0

### Leading hypothesis (78%)
New code path in v2.14.0 calls DynamoDB GetItem without exponential backoff; throttles surface as 500s.

### Supporting signal
- Sum metric: 0 → 142 over 4 minutes starting 14:31, exactly the deploy timestamp
- Logs: 91 of 200 sampled errors contain 'ProvisionedThroughputExceededException'
- No matching error before 14:31

### Alternatives
1. (12%) Upstream payment-svc latency cascading — would expect timeouts not throttles
2. (7%) Lambda cold-start burst from autoscale — Max would track Avg, but Max is 3× Avg here
3. (3%) Bad input from a single client — error spread suggests not

### Next 3 checks
1. `aws lambda get-function --function-name prod-orders-api | jq '.Configuration.Version, .Configuration.LastModified'`
2. CloudWatch DynamoDB ConsumedReadCapacityUnits vs ProvisionedReadCapacityUnits on orders table, same window
3. If confirmed: rollback to v2.13.4 via `aws lambda update-alias --function-name prod-orders-api --name prod --function-version 47`

Caveats & tips

describe-alarms returns nothing for composite alarms unless you pass --alarm-types CompositeAlarm; the example uses MetricAlarm explicitly — if your page came from a composite alarm, swap the type and walk children via --children-of-alarm-name.
macOS `date -u -v-30M` vs GNU `date -u -d '30 min ago'` — the command tries macOS first then falls back. On busybox (some Alpine images) neither works; install coreutils.
get-metric-statistics with --period 60 is capped at 3 hours of lookback for sub-1min resolution metrics and 15 days for standard 1-min. Pulling more than 1440 datapoints in one call errors out.
Cost: 3 CloudWatch API calls (~$0.00003) + log scan ($0.0050/GB scanned — bounded by head -200) + claude ~5-8k input tokens. Total well under $0.01 per memo. logs:FilterLogEvents on the log group is the IAM permission most CloudWatch read-only roles miss.

← #037

SQS DLQ: peek poison pills → claude triage table

#039 →

Unattached EBS volumes + snapshots → claude cleanup plan