DuckDB: query a remote Parquet over HTTPS
The most impressive DuckDB trick — point it at an HTTPS URL ending in .parquet and it streams just the bytes it needs. Production-grade analytics on a stranger's file.
Setup
- → brew install duckdb
Cost per run
free
The one-liner
$ duckdb -c "
INSTALL httpfs; LOAD httpfs;
SELECT l_returnflag, l_linestatus,
SUM(l_quantity) AS sum_qty,
AVG(l_extendedprice) AS avg_price
FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet'
GROUP BY 1, 2
ORDER BY 1, 2
"What each stage does
- [01] duckdb
INSTALL httpfs; LOAD httpfs;Load the HTTP filesystem extension (downloads once, cached forever). Enables https:// paths. - [02] duckdb
FROM 'https://….parquet'DuckDB issues HTTP Range requests for just the column chunks it needs — typically <1% of the file. A 10GB remote Parquet might transfer 30MB. - [03] duckdb
GROUP BY 1, 2Classic TPC-H Q1 aggregation. Runs in ~200ms on this 60K-row file, ~5s on the full 600M-row dataset.
Expected output (sample)
┌────────────────┬────────────────┬─────────┬───────────┐ │ l_returnflag │ l_linestatus │ sum_qty │ avg_price │ ├────────────────┼────────────────┼─────────┼───────────┤ │ A │ F │ 380,456 │ 38,237.15 │ │ N │ F │ 8,971 │ 38,284.46 │ │ N │ O │ 757,535 │ 38,249.12 │ │ R │ F │ 381,449 │ 38,250.85 │ └────────────────┴────────────────┴─────────┴───────────┘
Caveats & tips
- Same pattern works for .csv and .json over HTTPS.
- S3 / GCS / Azure work too — `INSTALL aws; LOAD aws;` and use `s3://bucket/key.parquet`.
- First run downloads the httpfs extension (~3MB).