You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wrote data via write_parquet to disk and used a GZIP compressions via arrow_open_stream_args={"compression": "gzip"}. I tried to load it later via read_parquetand the samearrow_open_stream_args`, but ran into the following error:
That writing and loading of GZIPed Parquet data works in the same way (maybe it needs some other arguments for reading).
Versions / Dependencies
Python 3.11.7
Ray Master
Reproduction script
importrayimportray.dataimportnumpyasnpimportpandasaspdimportosimportpyarrow.parquetaspq# Initialize Rayray.init(ignore_reinit_error=True)
# Create a larger sample datasetnp.random.seed(42) # For reproducibilitydata= {
'col1': np.random.randint(1, 100, size=10000),
'col2': np.random.choice(['a', 'b', 'c', 'd', 'e'], size=10000),
'col3': np.random.random(size=10000)
}
# Convert the data dictionary to a Ray datasetdf=pd.DataFrame(data)
ds=ray.data.from_pandas(df)
# Define directory pathsparquet_dir_compressed='output_compressed'parquet_dir_uncompressed='output_uncompressed'# Write to Parquet with gzip compressionds.write_parquet(parquet_dir_compressed, arrow_open_stream_args={"compression": "gzip"})
# Write to Parquet without any compressionds.write_parquet(parquet_dir_uncompressed)
# Function to calculate the total size of a directorydefcalculate_directory_size(directory):
total_size=0fordirpath, dirnames, filenamesinos.walk(directory):
forfinfilenames:
fp=os.path.join(dirpath, f)
ifos.path.isfile(fp):
total_size+=os.path.getsize(fp)
returntotal_size# Get the sizes of the Parquet directoriescompressed_size=calculate_directory_size(parquet_dir_compressed)
uncompressed_size=calculate_directory_size(parquet_dir_uncompressed)
print(f"Compressed directory size: {compressed_size} bytes")
print(f"Uncompressed directory size: {uncompressed_size} bytes")
# Inspect Parquet metadatadefinspect_parquet_metadata(directory):
fordirpath, dirnames, filenamesinos.walk(directory):
forfilenameinfilenames:
filepath=os.path.join(dirpath, filename)
iffilepath.endswith('.parquet'):
parquet_file=pq.ParquetFile(filepath)
metadata=parquet_file.metadataprint(f"File: {filepath}")
print(f"Compression: {metadata.row_group(0).column(0).compression}")
print(f"Total Rows: {metadata.num_rows}")
print(f"Row Group Count: {metadata.num_row_groups}")
# Inspect metadata of the compressed and uncompressed Parquet filesprint("\nCompressed Parquet Metadata:")
inspect_parquet_metadata(parquet_dir_compressed)
print("\nUncompressed Parquet Metadata:")
inspect_parquet_metadata(parquet_dir_uncompressed)
# Shutdown Ray to clean up resourcesray.shutdown()
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered:
simonsays1980
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Aug 15, 2024
scottjlee
added
P1
Issue that should be fixed within a few weeks
data
Ray Data-related issues
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Aug 15, 2024
What happened + What you expected to happen
What happened
I wrote data via
write_parquet
to disk and used a GZIP compressions viaarrow_open_stream_args={"compression": "gzip"}. I tried to load it later via
read_parquetand the same
arrow_open_stream_args`, but ran into the following error:What I expected to happen
That writing and loading of GZIPed Parquet data works in the same way (maybe it needs some other arguments for reading).
Versions / Dependencies
Python 3.11.7
Ray Master
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: