Data Profiler: Column Statistics for CSV, JSON, Parquet
Generate column-level statistics for any CSV, JSON, or Parquet file. Null rates, distinct counts, min, max, average, and sample values.
Accepts .csv, .json, and .parquet. Runs DuckDB's SUMMARIZE on the file to compute per-column statistics in your browser.
Drop a tabular file onto the page and get a per-column report in seconds. The tool runs DuckDB's SUMMARIZE statement against the file in your browser and renders type, null percentage, distinct count, min, max, and average for every column, plus three non-null sample values. Use it as a first pass on any new dataset before writing transforms or loading it into a database.
Profiling a new dataset is the first thing any analyst or engineer does. The questions are always the same: how many rows, how many columns, what types, how many nulls per column, what's the range of values, what does a sample look like. This tool answers all of them in one pass by running DuckDB's SUMMARIZE against the file inside your browser.
SUMMARIZE returns one row per column with the inferred type, approximate distinct count, null percentage, min, max, average for numeric columns, and standard deviation. We render that as a card per column alongside three actual non-null sample values pulled with a separate LIMIT 3 query, so you can eyeball encoding or format issues that summary statistics miss.
No schema is required and nothing is uploaded. Type inference comes from DuckDB's read_csv_auto for CSV, read_json_auto for JSON, and the embedded schema for Parquet. Parquet files report exact distinct counts; CSV and JSON use HyperLogLog approximation, which is accurate to within a few percent for large datasets and exact for small ones. Load time is reported in milliseconds so you can compare formats: Parquet is usually 5 to 20 times faster than CSV at the same row count.
- 1
Pick a file
Click Choose file and pick a .csv, .json, or .parquet file. The DuckDB engine downloads on the first run and caches after that.
- 2
Profile runs automatically
The tool reads the file into DuckDB, counts rows, runs SUMMARIZE for per-column stats, and pulls three sample values per column.
- 3
Read the report
Each column card shows type, null percentage, distinct count, min, max, average, and samples. Top stats show rows, columns, file size, and load time.
First look at a new dataset
Before writing ETL code, see types, null rates, and value ranges to plan your schema and validation rules.
Spot data quality issues
High null percentages, surprising distinct counts, or out-of-range min/max values flag columns that need cleaning.
Compare CSV vs Parquet
Profile the same data in both formats to see distinct count accuracy and how much faster Parquet loads.
Verify an export matches expectations
After exporting from a database or warehouse, check row count and column types match what you intended to ship.
Does the file leave my browser?
No. The file is read into memory locally and passed to DuckDB-WASM running in a Web Worker. There is no upload step and no server call.
What does the distinct count mean?
It's DuckDB's approx_unique value from SUMMARIZE, computed with HyperLogLog. For small datasets it equals the exact count; for large ones it's accurate to within a few percent at a fraction of the memory cost.
Why is average blank for some columns?
Average is only meaningful for numeric columns. SUMMARIZE returns it for INT, DOUBLE, DECIMAL, and similar types. For strings, dates, and booleans the field is hidden.
How big a file can it profile?
Up to your browser's per-tab memory limit, usually 2 to 4 GB. Parquet handles larger row counts than CSV because of compression and columnar layout.
What does load time include?
Wall clock from the moment you pick the file until all stats are rendered. It covers reading the file into a Uint8Array, registering it with DuckDB, running COUNT and SUMMARIZE, and pulling per-column samples.