← All one-liners·#017·analysis·duckdb·power

DuckDB on NYC taxi public Parquet

The dataset everyone benchmarks on, queryable from your laptop in seconds — DuckDB streams just the column bytes it needs from CloudFront.

Setup

→ brew install duckdb

Cost per run

free

The one-liner

$ duckdb -c "
  INSTALL httpfs; LOAD httpfs;
  SELECT passenger_count,
         ROUND(AVG(fare_amount), 2) AS avg_fare,
         COUNT(*)                    AS trips
  FROM 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet'
  WHERE passenger_count BETWEEN 1 AND 6
  GROUP BY 1
  ORDER BY 1
"

What each stage does

[01] duckdbINSTALL httpfs; LOAD httpfs;
One-time install of the HTTP-filesystem extension. Cached after first run.
[02] duckdbFROM 'https://….parquet'
Streams only the columns + row groups needed for this query. A 50MB Parquet typically transfers <5MB for an aggregation like this.
[03] duckdbGROUP BY 1 ORDER BY 1
Numeric `GROUP BY` references column position — passenger_count. Reads as 'group by the first SELECT expression'.

Expected output (sample)

┌─────────────────┬──────────┬─────────┐
│ passenger_count │ avg_fare │  trips  │
├─────────────────┼──────────┼─────────┤
│ 1               │   18.45  │ 1,742K  │
│ 2               │   20.91  │   312K  │
│ 3               │   19.84  │    78K  │
│ 4               │   20.12  │    35K  │
│ 5               │   18.77  │    62K  │
│ 6               │   19.02  │    40K  │
└─────────────────┴──────────┴─────────┘

Caveats & tips

Swap '2024-01' for any month — TLC publishes from 2009.
Same syntax works for `green_tripdata_*.parquet` and `fhvhv_tripdata_*.parquet`.

← #016

Most-starred repos in any language

#018 →

DuckDB: peek at any CSV's schema