← All one-liners·#013·analysis·self contained·power

DuckDB: query a remote Parquet over HTTPS

The most impressive DuckDB trick — point it at an HTTPS URL ending in .parquet and it streams just the bytes it needs. Production-grade analytics on a stranger's file.

Setup
  • → brew install duckdb
Cost per run
free
The one-liner
$ duckdb -c "
  INSTALL httpfs; LOAD httpfs;
  SELECT l_returnflag, l_linestatus,
         SUM(l_quantity)      AS sum_qty,
         AVG(l_extendedprice) AS avg_price
  FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet'
  GROUP BY 1, 2
  ORDER BY 1, 2
"
What each stage does
  1. [01] duckdbINSTALL httpfs; LOAD httpfs;
    Load the HTTP filesystem extension (downloads once, cached forever). Enables https:// paths.
  2. [02] duckdbFROM 'https://….parquet'
    DuckDB issues HTTP Range requests for just the column chunks it needs — typically <1% of the file. A 10GB remote Parquet might transfer 30MB.
  3. [03] duckdbGROUP BY 1, 2
    Classic TPC-H Q1 aggregation. Runs in ~200ms on this 60K-row file, ~5s on the full 600M-row dataset.
Expected output (sample)
┌────────────────┬────────────────┬─────────┬───────────┐
│ l_returnflag   │ l_linestatus   │ sum_qty │ avg_price │
├────────────────┼────────────────┼─────────┼───────────┤
│ A              │ F              │ 380,456 │ 38,237.15 │
│ N              │ F              │   8,971 │ 38,284.46 │
│ N              │ O              │ 757,535 │ 38,249.12 │
│ R              │ F              │ 381,449 │ 38,250.85 │
└────────────────┴────────────────┴─────────┴───────────┘
Caveats & tips
  • Same pattern works for .csv and .json over HTTPS.
  • S3 / GCS / Azure work too — `INSTALL aws; LOAD aws;` and use `s3://bucket/key.parquet`.
  • First run downloads the httpfs extension (~3MB).