DuckDB on NYC taxi public Parquet
The dataset everyone benchmarks on, queryable from your laptop in seconds — DuckDB streams just the column bytes it needs from CloudFront.
Setup
- → brew install duckdb
Cost per run
free
The one-liner
$ duckdb -c "
INSTALL httpfs; LOAD httpfs;
SELECT passenger_count,
ROUND(AVG(fare_amount), 2) AS avg_fare,
COUNT(*) AS trips
FROM 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet'
WHERE passenger_count BETWEEN 1 AND 6
GROUP BY 1
ORDER BY 1
"What each stage does
- [01] duckdb
INSTALL httpfs; LOAD httpfs;One-time install of the HTTP-filesystem extension. Cached after first run. - [02] duckdb
FROM 'https://….parquet'Streams only the columns + row groups needed for this query. A 50MB Parquet typically transfers <5MB for an aggregation like this. - [03] duckdb
GROUP BY 1 ORDER BY 1Numeric `GROUP BY` references column position — passenger_count. Reads as 'group by the first SELECT expression'.
Expected output (sample)
┌─────────────────┬──────────┬─────────┐ │ passenger_count │ avg_fare │ trips │ ├─────────────────┼──────────┼─────────┤ │ 1 │ 18.45 │ 1,742K │ │ 2 │ 20.91 │ 312K │ │ 3 │ 19.84 │ 78K │ │ 4 │ 20.12 │ 35K │ │ 5 │ 18.77 │ 62K │ │ 6 │ 19.02 │ 40K │ └─────────────────┴──────────┴─────────┘
Caveats & tips
- Swap '2024-01' for any month — TLC publishes from 2009.
- Same syntax works for `green_tripdata_*.parquet` and `fhvhv_tripdata_*.parquet`.