You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The default parquet library used by Polars (parquet2) does not support run length encoding for data pages (RLE_DICTIONARY). As a result, parquet files created from sorted, low-cardinality data are much larger than they need to be e.g. 4,028,562 KiB vs 17 KiB (see below)
Pyarrow supports RLE_DICTIONARY, so files created with write_parquet(use_pyarrow=True) benefit from this feature, however use_pyarrow is not available with sink_parquet().
Generate Files:
import polars as pl
num_entries=10_000_000
df = pl.select(pl.concat([pl.repeat(x, n=num_entries) for x in range(250, 350)]).alias("latency"))
rg_size = 64 * 1024 * 1024
df.write_parquet("file.parq", row_group_size=rg_size, compression="uncompressed")
df.write_parquet("file.zstd.parq", row_group_size=rg_size, compression="zstd")
df.write_parquet("file.pya.parq", row_group_size=rg_size, compression="uncompressed", use_pyarrow=True)
Print Metadata and File Size
import os
import pyarrow.parquet as pq
metadata = pq.read_metadata("file.parq",)
file_size = int(os.path.getsize("file.parq",) / 1024)
print(f"File: size: {file_size:,d} KiB")
print(metadata)
print(metadata.row_group(0).column(0))
Confirming that the native write_parquet/sink_parquet functions now support RLE_DICTIONARY. Uncompressed test file size has dropped from 4,028,562 KiB to 15 KiB (23 KiB with statistics). Validated with the following code:
import polars as pl
rg_size=10_000_000
df = pl.select(pl.concat([pl.repeat(x, n=rg_size) for x in range(250, 350)]).alias("latency"))
df.write_parquet("file.parq", row_group_size=rg_size, compression="uncompressed")
pl.Config.set_streaming_chunk_size(rg_size)
lf = pl.scan_parquet("file.parq").rename({"latency": "s-latency"})
lf.sink_parquet("file-sink.parq", row_group_size=rg_size, compression="uncompressed")
Problem description
The default parquet library used by Polars (parquet2) does not support run length encoding for data pages (RLE_DICTIONARY). As a result, parquet files created from sorted, low-cardinality data are much larger than they need to be e.g. 4,028,562 KiB vs 17 KiB (see below)
Pyarrow supports RLE_DICTIONARY, so files created with write_parquet(use_pyarrow=True) benefit from this feature, however use_pyarrow is not available with sink_parquet().
Generate Files:
Print Metadata and File Size
Default Uncompressed
Default Compressed
PyArrow Uncompressed
The text was updated successfully, but these errors were encountered: