Parquet Metadata Inspector: Row Groups, Compression, Encodings
See file version, row counts, row groups, compression, and per-column encoded sizes in any Parquet file.
Upload a .parquet file and see everything the footer exposes: format version, total row count, number of row groups, the writer's identifier, custom key-value pairs, and a per-row-group breakdown showing each column's codec, encodings, compressed bytes, uncompressed bytes, and value count. It's the same view a Parquet CLI would print, rendered as readable tables.
Every Parquet file ends with a footer that holds the schema, a list of row groups, and the column metadata inside each group. Engines like Spark and DuckDB use this metadata to plan reads: they look at row-group statistics to skip groups that can't match a filter, and at encoding and codec info to decide how to decode pages. When something is wrong, the footer is also where the answer lives.
This inspector parses the footer locally and surfaces the fields that matter most when triaging a file. The summary card shows the format version, total row count, number of row groups, column count, and the created_by string that identifies the writer (parquet-mr, pyarrow, DuckDB, and so on). The key-value metadata section exposes producer-specific entries such as Arrow schema dumps, geo bounding boxes, or pandas index hints.
For each row group you get the per-column compression codec (SNAPPY, GZIP, ZSTD), the encoding list (PLAIN, RLE, RLE_DICTIONARY, DELTA_BINARY_PACKED), the compressed and uncompressed byte counts, and the value count. That breakdown is enough to compute compression ratios per column, spot a missing dictionary encoding, or notice that a writer accidentally landed everything in a single oversized row group.
- 1
Open the file
Pick a .parquet file. Only the footer is read, so the operation is fast even for very large files.
- 2
Read the summary
Version, total rows, number of row groups, footer size, and the created_by writer identifier all surface at the top.
- 3
Drill into row groups
Each row group expands into a column-level table with codec, encodings, and compressed versus uncompressed byte counts.
Diagnose a slow query
If one column dominates the file size, a query that touches it will be slow. The per-column compressed-bytes column makes that obvious.
Spot a writer bug
Created_by strings expose which library wrote the file. Match that against known issues for parquet-mr or pyarrow versions.
Confirm row-group sizing
Tools recommend row groups of roughly 128 to 512 MB. Compare against your file to decide whether to repartition.
Audit Arrow or geo metadata
GeoParquet and Arrow embed JSON in key-value metadata. The inspector shows it raw so you can verify CRS, geometry types, or column index hints.
Does this download anything from the server?
No. The parsing runs entirely in your browser using hyparquet. The file stays on your machine.
Why is the row group count so large?
Writers like Spark sometimes emit many small row groups. That hurts scan performance; an inspector tells you whether to coalesce.
What does encodings mean per column?
Parquet supports several page encodings (PLAIN, RLE, RLE_DICTIONARY, DELTA_BINARY_PACKED, and others). A column lists every encoding it actually used across its pages.
Why is the compressed size sometimes larger than uncompressed?
For tiny columns the codec headers and dictionary overhead can exceed the raw bytes. It's a real artifact of small row groups, not a bug in the reader.
Where does the created_by string come from?
The writer stamps it in the footer. Examples: 'parquet-mr version 1.13.1', 'parquet-cpp-arrow version 14.0.2', 'DuckDB'.