Glue: triage last 100 job runs for failure patterns with claude
Pulls the last 100 runs of a Glue ETL job, groups by state and error message, and hands the rollup to claude to diagnose recurring failure modes — turns 'it failed again' Slack pings into a ranked fix list.
Setup
- → brew install awscli jq
- → aws configure # needs glue:GetJobRuns on the job
- → claude /login OR export ANTHROPIC_API_KEY=sk-…
Cost per run
<$0.01
The one-liner
$ aws glue get-job-runs --job-name reprocess-events-daily --max-items 100 \
| jq '{
job: "reprocess-events-daily",
total: (.JobRuns | length),
window: { from: (.JobRuns | last | .StartedOn), to: (.JobRuns | first | .StartedOn) },
by_state: (.JobRuns | group_by(.JobRunState) | map({ state: .[0].JobRunState, count: length })),
failures: (.JobRuns | map(select(.JobRunState=="FAILED")) | map({
started: .StartedOn,
exec_s: .ExecutionTime,
attempt: .Attempt,
worker: .WorkerType,
dpu: .MaxCapacity,
error: (.ErrorMessage // "" | .[0:240])
})),
error_buckets: (.JobRuns | map(select(.ErrorMessage)) | group_by(.ErrorMessage | .[0:80])
| map({ snippet: .[0].ErrorMessage[0:80], count: length }) | sort_by(-.count) | .[0:8])
}' \
| claude -p 'You are an on-call data engineer. This is the last 100 runs of the reprocess-events-daily Glue job. Diagnose the top 3 recurring failure patterns in error_buckets — for each give: likely root cause, one config or code fix (DPU bump, bookmark reset, S3 prefix race, OOM on join, etc.), and how to confirm in CloudWatch logs. End with whether retries actually help for this job.'What each stage does
- [01] aws
aws glue get-job-runs --job-name reprocess-events-daily --max-items 100`get-job-runs` returns `JobRuns[]` newest-first with `JobRunState`, `ErrorMessage`, `ExecutionTime`, `Attempt`, `WorkerType`, `MaxCapacity`. `--max-items 100` caps total returned across pagination so this stays one call without a NextToken loop for typical daily jobs. - [02] jq
jq '{ by_state: (.JobRuns | group_by(.JobRunState) | map({state,count})) }'`group_by` on `JobRunState` produces the SUCCEEDED/FAILED/TIMEOUT distribution — the first thing you want before drilling into errors. Cheap signal-vs-noise filter. - [03] jq
jq '.JobRuns | map(select(.ErrorMessage)) | group_by(.ErrorMessage | .[0:80]) | …Truncating `ErrorMessage` to 80 chars before grouping collapses near-identical errors (same exception with different timestamps/paths) into real buckets. Without the truncate, every run looks unique and the LLM gets noise. - [04] claude
claude -p 'You are an on-call data engineer. Diagnose the top 3 recurring failur…Forcing the 'retries help y/n' close-out is the high-value bit: Glue auto-retries cost money and time, and most teams never measure whether attempt=2 succeeds more often than attempt=1 for a given error class. The LLM has the Attempt field right there.
Expected output (sample)
{
"job": "reprocess-events-daily", "total": 100,
"by_state": [{"state":"SUCCEEDED","count":71},{"state":"FAILED","count":24},{"state":"TIMEOUT","count":5}]
}
claude analysis:
1. 14 runs: 'ExecutorLostFailure ... Container killed by YARN for exceeding memory limits' — OOM on the daily skew partition. Fix: bump WorkerType G.1X → G.2X for this job only, or repartition before the join on user_id. Confirm via CloudWatch /aws-glue/jobs/error log group, search 'OutOfMemoryError'.
2. 6 runs: 'AnalysisException: Path does not exist: s3://prod-events-raw/...' — upstream Kinesis Firehose hasn't flushed the hour partition yet when the 02:00 UTC trigger fires. Fix: add 30-min buffer to schedule, or use s3:ObjectCreated event trigger instead of cron.
3. 4 runs: 'Job bookmark ... checkpoint state corrupt' — bookmark drift after schema change. Fix: one-time `--job-bookmark-option job-bookmark-pause`, reset, resume.
Retries: of 24 FAILED, only 3 had Attempt=2 SUCCEEDED — retries help for transient S3 path races but not for OOMs (attempt 2 OOMs identically). Set MaxRetries=1 for this job and route alerts to on-call.Caveats & tips
- `get-job-runs` returns runs newest-first; `--max-items 100` is the *total* returned, not per-page — pagination kicks in transparently and can take 2–3 API calls.
- `ErrorMessage` is null on SUCCEEDED runs and sometimes on cancelled-by-timeout — the `select(.ErrorMessage)` filter handles this but watch for jobs where every failure has the same generic 'Command Failed with exit code 1' (CloudWatch is the only source of real stack trace there).
- Truncating to 80 chars groups well for Spark/Python stack traces but can over-collapse for Glue Studio visual-job errors that all start identically — bump to 160 if buckets look too coarse.
- Job runs older than 90 days are pruned by Glue — don't expect a year of history.