← All one-liners·#017·analysis·duckdb·power

DuckDB on NYC taxi public Parquet

The dataset everyone benchmarks on, queryable from your laptop in seconds — DuckDB streams just the column bytes it needs from CloudFront.

Setup
  • → brew install duckdb
Cost per run
free
The one-liner
$ duckdb -c "
  INSTALL httpfs; LOAD httpfs;
  SELECT passenger_count,
         ROUND(AVG(fare_amount), 2) AS avg_fare,
         COUNT(*)                    AS trips
  FROM 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet'
  WHERE passenger_count BETWEEN 1 AND 6
  GROUP BY 1
  ORDER BY 1
"
What each stage does
  1. [01] duckdbINSTALL httpfs; LOAD httpfs;
    One-time install of the HTTP-filesystem extension. Cached after first run.
  2. [02] duckdbFROM 'https://….parquet'
    Streams only the columns + row groups needed for this query. A 50MB Parquet typically transfers <5MB for an aggregation like this.
  3. [03] duckdbGROUP BY 1 ORDER BY 1
    Numeric `GROUP BY` references column position — passenger_count. Reads as 'group by the first SELECT expression'.
Expected output (sample)
┌─────────────────┬──────────┬─────────┐
│ passenger_count │ avg_fare │  trips  │
├─────────────────┼──────────┼─────────┤
│ 1               │   18.45  │ 1,742K  │
│ 2               │   20.91  │   312K  │
│ 3               │   19.84  │    78K  │
│ 4               │   20.12  │    35K  │
│ 5               │   18.77  │    62K  │
│ 6               │   19.02  │    40K  │
└─────────────────┴──────────┴─────────┘
Caveats & tips
  • Swap '2024-01' for any month — TLC publishes from 2009.
  • Same syntax works for `green_tripdata_*.parquet` and `fhvhv_tripdata_*.parquet`.