← All one-liners·#043·data·aws·power

Athena: audit last 50 queries for bytes-scanned hogs via claude

Pulls the most recent Athena query history with bytes scanned and runtime, then asks claude to flag the costliest queries and suggest partition/projection fixes — the weekly review a data-platform owner actually has time for.

Setup
  • → brew install awscli jq
  • → aws configure # needs athena:ListQueryExecutions + BatchGetQueryExecution on the workgroup
  • → claude /login OR export ANTHROPIC_API_KEY=sk-…
Cost per run
<$0.01
The one-liner
$ IDS=$(aws athena list-query-executions --work-group analytics-prod --max-items 50 --query 'QueryExecutionIds' --output text) && \
aws athena batch-get-query-execution --query-execution-ids $IDS \
  | jq '[.QueryExecutions[] | select(.Status.State=="SUCCEEDED") | {
      id: .QueryExecutionId,
      gb_scanned: ((.Statistics.DataScannedInBytes // 0) / 1073741824 | .*100|round/100),
      runtime_s: ((.Statistics.EngineExecutionTimeInMillis // 0) / 1000),
      submitted: .Status.SubmissionDateTime,
      sql: (.Query | gsub("\\s+";" ") | .[0:280])
    }] | sort_by(-.gb_scanned) | .[0:20]' \
  | claude -p 'You are the data-platform owner reviewing last week of Athena spend (Athena bills $5 per TB scanned). For each of the top 5 queries by gb_scanned, give one concrete fix: missing partition predicate, SELECT *, no LIMIT on exploratory query, or candidate for partition projection / Iceberg. Cite the query id. End with one cross-cutting pattern.'
What each stage does
  1. [01] awsaws athena list-query-executions --work-group analytics-prod --max-items 50 --qu…
    `list-query-executions` returns IDs newest-first within the named workgroup (Athena keeps 45 days). `--query` with `--output text` flattens the JSON to space-separated IDs ready to splat into the next call.
  2. [02] awsaws athena batch-get-query-execution --query-execution-ids $IDS
    `batch-get-query-execution` accepts up to 50 IDs as positional list args (no commas) and returns full `QueryExecutions[]` including `Statistics.DataScannedInBytes` and the original SQL — this is the only call that surfaces bytes-scanned.
  3. [03] jqjq '[.QueryExecutions[] | select(.Status.State=="SUCCEEDED") | { id, gb_scanned,…
    Drop FAILED/CANCELLED rows (their bytes-scanned is misleading), convert bytes→GB with a 2-decimal round, collapse SQL whitespace and truncate to 280 chars so the model gets signal not formatting noise, then rank by cost descending and keep top 20.
  4. [04] claudeclaude -p 'You are the data-platform owner reviewing last week of Athena spend..…
    `-p` is one-shot non-interactive mode — reads stdin, prints answer, exits. The prompt forces per-query attribution (cite the id) and a single cross-cutting pattern so output stays actionable instead of a generic optimization lecture.
Expected output (sample)
1. 8f3a-c2e1: 412 GB scanned, 47s. SELECT * FROM raw_events with no WHERE on event_date — adds full historical scan. Fix: add `event_date >= current_date - interval '7' day` (partition column).
2. 1b9d-44a7: 287 GB, 31s. JOIN on user_events without partition pruning on either side. Fix: filter both tables on event_date before join.
3. 7c2f-9e08: 198 GB, 22s. Exploratory `LIMIT 10` query but predicate-less — Athena still scans all partitions before LIMIT. Fix: always pair LIMIT with date predicate.
4. 3a51-d0b4: 156 GB, 19s. JSON_EXTRACT on raw column repeated 6× in SELECT — re-parses per row. Fix: CTE that extracts once.
5. 2e88-7fa1: 142 GB, 18s. `ORDER BY ts DESC LIMIT 1` over full table to find latest — classic. Fix: max(ts) subquery, or partition projection.
Pattern: 4 of 5 are missing event_date predicates. Add a workgroup `data-usage-control` query-bytes-scanned-limit at 50 GB to fail these at submission time.
Caveats & tips
  • Athena keeps query history 45 days only; older queries silently drop off `list-query-executions`.
  • `batch-get-query-execution` caps at 50 IDs per call — for full audits across workgroups, loop with `--starting-token` paginators.
  • Bytes-scanned for queries against Iceberg tables can under-report manifest reads; cross-check with CloudTrail `s3:GetObject` if numbers look too clean.
  • Cost label assumes ~50 KB of Claude input; if you bump to 200 queries the token cost climbs into single-digit cents.