[Data] - Parquet writing/loading with GZIP compression does not work as expected. #47153

simonsays1980 · 2024-08-15T08:35:07Z

What happened + What you expected to happen

What happened

I wrote data via write_parquet to disk and used a GZIP compressions via arrow_open_stream_args={"compression": "gzip"}. I tried to load it later via read_parquetand the samearrow_open_stream_args`, but ran into the following error:

Error creating dataset. Could not read schema from '/tmp/cartpole-columns/cartpole-v1/run-000001-00001/0_000000_000000.parquet'. Is this a 'parquet' file?: Could not open Parquet input source '/tmp/cartpole-columns/cartpole-v1/run-000001-00001/0_000000_000000.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

What I expected to happen

That writing and loading of GZIPed Parquet data works in the same way (maybe it needs some other arguments for reading).

Versions / Dependencies

Python 3.11.7
Ray Master

Reproduction script

import ray
import ray.data
import numpy as np
import pandas as pd
import os
import pyarrow.parquet as pq

# Initialize Ray
ray.init(ignore_reinit_error=True)

# Create a larger sample dataset
np.random.seed(42)  # For reproducibility
data = {
    'col1': np.random.randint(1, 100, size=10000),
    'col2': np.random.choice(['a', 'b', 'c', 'd', 'e'], size=10000),
    'col3': np.random.random(size=10000)
}

# Convert the data dictionary to a Ray dataset
df = pd.DataFrame(data)
ds = ray.data.from_pandas(df)

# Define directory paths
parquet_dir_compressed = 'output_compressed'
parquet_dir_uncompressed = 'output_uncompressed'

# Write to Parquet with gzip compression
ds.write_parquet(parquet_dir_compressed, arrow_open_stream_args={"compression": "gzip"})

# Write to Parquet without any compression
ds.write_parquet(parquet_dir_uncompressed)

# Function to calculate the total size of a directory
def calculate_directory_size(directory):
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(directory):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            if os.path.isfile(fp):
                total_size += os.path.getsize(fp)
    return total_size

# Get the sizes of the Parquet directories
compressed_size = calculate_directory_size(parquet_dir_compressed)
uncompressed_size = calculate_directory_size(parquet_dir_uncompressed)

print(f"Compressed directory size: {compressed_size} bytes")
print(f"Uncompressed directory size: {uncompressed_size} bytes")

# Inspect Parquet metadata
def inspect_parquet_metadata(directory):
    for dirpath, dirnames, filenames in os.walk(directory):
        for filename in filenames:
            filepath = os.path.join(dirpath, filename)
            if filepath.endswith('.parquet'):
                parquet_file = pq.ParquetFile(filepath)
                metadata = parquet_file.metadata
                print(f"File: {filepath}")
                print(f"Compression: {metadata.row_group(0).column(0).compression}")
                print(f"Total Rows: {metadata.num_rows}")
                print(f"Row Group Count: {metadata.num_row_groups}")

# Inspect metadata of the compressed and uncompressed Parquet files
print("\nCompressed Parquet Metadata:")
inspect_parquet_metadata(parquet_dir_compressed)

print("\nUncompressed Parquet Metadata:")
inspect_parquet_metadata(parquet_dir_uncompressed)

# Shutdown Ray to clean up resources
ray.shutdown()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

simonsays1980 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 15, 2024

scottjlee added P1 Issue that should be fixed within a few weeks data Ray Data-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 15, 2024

bveeramani mentioned this issue Aug 15, 2024

[Data] Fix bug where arrow_parquet_args aren't used #47161

Merged

8 tasks

scottjlee closed this as completed in #47161 Aug 30, 2024

scottjlee closed this as completed in c03aeff Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] - Parquet writing/loading with GZIP compression does not work as expected. #47153

[Data] - Parquet writing/loading with GZIP compression does not work as expected. #47153

simonsays1980 commented Aug 15, 2024

[Data] - Parquet writing/loading with GZIP compression does not work as expected. #47153

[Data] - Parquet writing/loading with GZIP compression does not work as expected. #47153

Comments

simonsays1980 commented Aug 15, 2024

What happened + What you expected to happen

What happened

What I expected to happen

Versions / Dependencies

Reproduction script

Issue Severity