← All one-liners·#043·data·aws·power

Athena: audit last 50 queries for bytes-scanned hogs via claude

Pulls the most recent Athena query history with bytes scanned and runtime, then asks claude to flag the costliest queries and suggest partition/projection fixes — the weekly review a data-platform owner actually has time for.

Setup

→ brew install awscli jq
→ aws configure # needs athena:ListQueryExecutions + BatchGetQueryExecution on the workgroup
→ claude /login OR export ANTHROPIC_API_KEY=sk-…

Cost per run

<$0.01

The one-liner

$ IDS=$(aws athena list-query-executions --work-group analytics-prod --max-items 50 --query 'QueryExecutionIds' --output text) && \
aws athena batch-get-query-execution --query-execution-ids $IDS \
  | jq '[.QueryExecutions[] | select(.Status.State=="SUCCEEDED") | {
      id: .QueryExecutionId,
      gb_scanned: ((.Statistics.DataScannedInBytes // 0) / 1073741824 | .*100|round/100),
      runtime_s: ((.Statistics.EngineExecutionTimeInMillis // 0) / 1000),
      submitted: .Status.SubmissionDateTime,
      sql: (.Query | gsub("\\s+";" ") | .[0:280])
    }] | sort_by(-.gb_scanned) | .[0:20]' \
  | claude -p 'You are the data-platform owner reviewing last week of Athena spend (Athena bills $5 per TB scanned). For each of the top 5 queries by gb_scanned, give one concrete fix: missing partition predicate, SELECT *, no LIMIT on exploratory query, or candidate for partition projection / Iceberg. Cite the query id. End with one cross-cutting pattern.'

What each stage does

[01] awsaws athena list-query-executions --work-group analytics-prod --max-items 50 --qu…
`list-query-executions` returns IDs newest-first within the named workgroup (Athena keeps 45 days). `--query` with `--output text` flattens the JSON to space-separated IDs ready to splat into the next call.
[02] awsaws athena batch-get-query-execution --query-execution-ids $IDS
`batch-get-query-execution` accepts up to 50 IDs as positional list args (no commas) and returns full `QueryExecutions[]` including `Statistics.DataScannedInBytes` and the original SQL — this is the only call that surfaces bytes-scanned.
[03] jqjq '[.QueryExecutions[] | select(.Status.State=="SUCCEEDED") | { id, gb_scanned,…
Drop FAILED/CANCELLED rows (their bytes-scanned is misleading), convert bytes→GB with a 2-decimal round, collapse SQL whitespace and truncate to 280 chars so the model gets signal not formatting noise, then rank by cost descending and keep top 20.
[04] claudeclaude -p 'You are the data-platform owner reviewing last week of Athena spend..…
`-p` is one-shot non-interactive mode — reads stdin, prints answer, exits. The prompt forces per-query attribution (cite the id) and a single cross-cutting pattern so output stays actionable instead of a generic optimization lecture.

Expected output (sample)

1. 8f3a-c2e1: 412 GB scanned, 47s. SELECT * FROM raw_events with no WHERE on event_date — adds full historical scan. Fix: add `event_date >= current_date - interval '7' day` (partition column).
2. 1b9d-44a7: 287 GB, 31s. JOIN on user_events without partition pruning on either side. Fix: filter both tables on event_date before join.
3. 7c2f-9e08: 198 GB, 22s. Exploratory `LIMIT 10` query but predicate-less — Athena still scans all partitions before LIMIT. Fix: always pair LIMIT with date predicate.
4. 3a51-d0b4: 156 GB, 19s. JSON_EXTRACT on raw column repeated 6× in SELECT — re-parses per row. Fix: CTE that extracts once.
5. 2e88-7fa1: 142 GB, 18s. `ORDER BY ts DESC LIMIT 1` over full table to find latest — classic. Fix: max(ts) subquery, or partition projection.
Pattern: 4 of 5 are missing event_date predicates. Add a workgroup `data-usage-control` query-bytes-scanned-limit at 50 GB to fail these at submission time.

Caveats & tips

Athena keeps query history 45 days only; older queries silently drop off `list-query-executions`.
`batch-get-query-execution` caps at 50 IDs per call — for full audits across workgroups, loop with `--starting-token` paginators.
Bytes-scanned for queries against Iceberg tables can under-report manifest reads; cross-check with CloudTrail `s3:GetObject` if numbers look too clean.
Cost label assumes ~50 KB of Claude input; if you bump to 200 queries the token cost climbs into single-digit cents.

← #042

ECS stopped-task postmortem → claude root-cause classifier

#044 →

Glue: triage last 100 job runs for failure patterns with claude