Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Empty or null list(s) results in scrambled data #120

Closed
chris-branch opened this issue Sep 30, 2024 · 2 comments
Closed

[BUG] Empty or null list(s) results in scrambled data #120

chris-branch opened this issue Sep 30, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@chris-branch
Copy link

Parquet Viewer Version
2.10.1.1

Where was the parquet file created?
Parquet.NET

Description
There is something wrong with the code that parses lists/arrays. If you have a column that is a list/array type, and you have rows where that column is either empty (i.e., 0 elements) or null, ParquetViewer shows the data mixed up across rows. Examples:

In all examples below, assume the following schema:

    internal class TestRow
    {
        public string Column1 { get; set; }
        public List<double> Column2 { get; set; }

        public TestRow(string column1, List<double> column2)
        {
            Column1 = column1;
            Column2 = column2;
        }
    }

Example 1: This has no nulls or empty values and works as expected:

    List<TestRow> data1 = new List<TestRow>
    {
        new TestRow("Row 1", new List<double> { 1, 2, 3, 4, 5 }),
        new TestRow("Row 2", new List<double> { 6, 7, 8, 9, 10 }),
        new TestRow("Row 3", new List<double> { 11, 12, 13, 14, 15 })
    };
    ParquetSerializer.SerializeAsync(data1, @"sample1.parquet").Wait();

sample1

Example 2: This has an empty list in row 1 and results in scrambled data in rows 1-3

    List<TestRow> data2 = new List<TestRow>
    {
        new TestRow("Row 1", new List<double>()),
        new TestRow("Row 2", new List<double> { 6, 7, 8, 9, 10 }),
        new TestRow("Row 3", new List<double> { 11, 12, 13, 14, 15 })
    };
    ParquetSerializer.SerializeAsync(data2, @"sample2.parquet").Wait();

sample2

Example 3: This has an empty list in row 2 and results in scrambled data in rows 2-3

    List<TestRow> data3 = new List<TestRow>
    {
        new TestRow("Row 1", new List<double> { 1, 2, 3, 4, 5 }),
        new TestRow("Row 2", new List<double>()),
        new TestRow("Row 3", new List<double> { 11, 12, 13, 14, 15 })
    };
    ParquetSerializer.SerializeAsync(data3, @"sample3.parquet").Wait();

sample3

Sample files
sample_parquets.zip

@chris-branch chris-branch added the bug Something isn't working label Sep 30, 2024
@AndreiYachmeneu
Copy link

Here is another example:

import pyarrow as pa
import pyarrow.parquet as pq

arr = pa.array([["dog", "cat"], [], None], type=pa.list_(pa.string()))
tbl = pa.table([arr], names=['animals'])

pq.write_table(tbl, "animals.parquet")
print(pq.read_table("animals.parquet").to_pandas())

image

None displays as [] in ParquetViewer:
image

@mukunku mukunku mentioned this issue Dec 23, 2024
@mukunku
Copy link
Owner

mukunku commented Dec 23, 2024

I really appreciate the detailed examples, sample code, and sample files! It made solving this issue much easier. Please try out v3.2.0.0 which should handle null/empty Lists correctly.

It appears different parquet writers write the data slightly differently so I had to adjust the code to accommodate.

@mukunku mukunku closed this as completed Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants