Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] - Parquet writing/loading with GZIP compression does not work as expected. #47153

Closed
simonsays1980 opened this issue Aug 15, 2024 · 0 comments · Fixed by #47161
Closed
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks

Comments

@simonsays1980
Copy link
Collaborator

What happened + What you expected to happen

What happened

I wrote data via write_parquet to disk and used a GZIP compressions via arrow_open_stream_args={"compression": "gzip"}. I tried to load it later via read_parquetand the samearrow_open_stream_args`, but ran into the following error:

Error creating dataset. Could not read schema from '/tmp/cartpole-columns/cartpole-v1/run-000001-00001/0_000000_000000.parquet'. Is this a 'parquet' file?: Could not open Parquet input source '/tmp/cartpole-columns/cartpole-v1/run-000001-00001/0_000000_000000.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

What I expected to happen

That writing and loading of GZIPed Parquet data works in the same way (maybe it needs some other arguments for reading).

Versions / Dependencies

Python 3.11.7
Ray Master

Reproduction script

import ray
import ray.data
import numpy as np
import pandas as pd
import os
import pyarrow.parquet as pq

# Initialize Ray
ray.init(ignore_reinit_error=True)

# Create a larger sample dataset
np.random.seed(42)  # For reproducibility
data = {
    'col1': np.random.randint(1, 100, size=10000),
    'col2': np.random.choice(['a', 'b', 'c', 'd', 'e'], size=10000),
    'col3': np.random.random(size=10000)
}

# Convert the data dictionary to a Ray dataset
df = pd.DataFrame(data)
ds = ray.data.from_pandas(df)

# Define directory paths
parquet_dir_compressed = 'output_compressed'
parquet_dir_uncompressed = 'output_uncompressed'

# Write to Parquet with gzip compression
ds.write_parquet(parquet_dir_compressed, arrow_open_stream_args={"compression": "gzip"})

# Write to Parquet without any compression
ds.write_parquet(parquet_dir_uncompressed)

# Function to calculate the total size of a directory
def calculate_directory_size(directory):
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(directory):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            if os.path.isfile(fp):
                total_size += os.path.getsize(fp)
    return total_size

# Get the sizes of the Parquet directories
compressed_size = calculate_directory_size(parquet_dir_compressed)
uncompressed_size = calculate_directory_size(parquet_dir_uncompressed)

print(f"Compressed directory size: {compressed_size} bytes")
print(f"Uncompressed directory size: {uncompressed_size} bytes")

# Inspect Parquet metadata
def inspect_parquet_metadata(directory):
    for dirpath, dirnames, filenames in os.walk(directory):
        for filename in filenames:
            filepath = os.path.join(dirpath, filename)
            if filepath.endswith('.parquet'):
                parquet_file = pq.ParquetFile(filepath)
                metadata = parquet_file.metadata
                print(f"File: {filepath}")
                print(f"Compression: {metadata.row_group(0).column(0).compression}")
                print(f"Total Rows: {metadata.num_rows}")
                print(f"Row Group Count: {metadata.num_row_groups}")

# Inspect metadata of the compressed and uncompressed Parquet files
print("\nCompressed Parquet Metadata:")
inspect_parquet_metadata(parquet_dir_compressed)

print("\nUncompressed Parquet Metadata:")
inspect_parquet_metadata(parquet_dir_uncompressed)

# Shutdown Ray to clean up resources
ray.shutdown()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@simonsays1980 simonsays1980 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 15, 2024
@scottjlee scottjlee added P1 Issue that should be fixed within a few weeks data Ray Data-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants