Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Type mismatch silently leads to incorrect de-serialization result #573

Closed
dkotov opened this issue Nov 25, 2024 · 7 comments
Closed
Milestone

Comments

@dkotov
Copy link

dkotov commented Nov 25, 2024

Library Version

5.0.2

OS

Ubuntu Linux 22.04

OS Architecture

64 bit

How to reproduce?

  1. create parquet file with Int64 non-zero values
  2. define class for de-serialization with this property defined with Int32 type
  3. try to de-serialize values from the file into the class objects

Result: de-serialization completes without errors/warnings but values in objects don't match values in the file, e.g. 3 instead of 1.

Failing test

No response

@aloneguid aloneguid added the cannot reproduce Either it's not reproducible in general, or reproducing it will take unpractical amount of time. label Nov 25, 2024
@aloneguid
Copy link
Owner

aloneguid commented Nov 25, 2024

Not reproducible, here is the test proving it:

        class EdgeCaseInt32 {
            public int Id { get; set; }
        }

        [Fact]
        public async Task EdgeCase_rawint64_to_classInt32() {
            var schema = new ParquetSchema(new DataField<long>("Id"));
            using var ms = new MemoryStream();
            using(ParquetWriter writer = await ParquetWriter.CreateAsync(schema, ms)) {
                using(ParquetRowGroupWriter rg = writer.CreateRowGroup()) {
                    await rg.WriteColumnAsync(new DataColumn(schema.DataFields[0], new long[] { 1, 2, 3 }));
                }
            }
            ms.Position = 0;

            IList<EdgeCaseInt32> data = await ParquetSerializer.DeserializeAsync<EdgeCaseInt32>(ms);

            Assert.Equal(1, data[0].Id);
            Assert.Equal(2, data[1].Id);
            Assert.Equal(3, data[2].Id);

        }

Feel free to reopen with reproducible test if I didn't understand you correctly.

aloneguid added a commit that referenced this issue Nov 25, 2024
@dkotov
Copy link
Author

dkotov commented Nov 26, 2024

You are right, I did miss one detail: to reproduce this issue one needs a file with the following schema (without logical type/hint):

message spark_schema {
  optional int64 Id;
}

If I get it correctly, the provided test actually replicates the following schema (with logical type/hint):

message root {
  required int64 Id (INTEGER(64,true));
}

That's why it doesn't reproduce the issue.
Unfortunately, I'm not sure how to simulate the first schema with Parquet.Net code - need your help here.

But I attached a sample file with the following content:

parquet-tools cat ./no-logical-type.parquet
# output
[{"Id":1},{"Id":2},{"Id":3}]

And here is an integration test for it:

class EdgeCaseInt32 {
    public int? Id { get; set; }
}

[Fact]
public async Task EdgeCase_fileInt64_to_classInt32() {
    IList<EdgeCaseInt32> data = await ParquetSerializer.DeserializeAsync<EdgeCaseInt32>("no-logical-type.parquet");

    Assert.Equal(1, data[0].Id); // Actual: 1 - SUCCESS
    Assert.Equal(2, data[1].Id); // Actual: 0 - FAILURE
    Assert.Equal(3, data[2].Id); // Actual: 2 - FAILURE
}

@dkotov
Copy link
Author

dkotov commented Nov 26, 2024

@aloneguid please reopen the issue on my behalf as I don't have required permissions to do it. thanks!

@dkotov
Copy link
Author

dkotov commented Dec 3, 2024

@aloneguid please reopen this issue or let me know if provided details are not enough

@aloneguid aloneguid reopened this Dec 4, 2024
@aloneguid aloneguid removed the cannot reproduce Either it's not reproducible in general, or reproducing it will take unpractical amount of time. label Dec 4, 2024
@aloneguid
Copy link
Owner

Checking the sample now, hold on. Floor definitely displays your file correctly.

image

image

@aloneguid aloneguid added this to the 5.0.3 milestone Dec 4, 2024
@aloneguid
Copy link
Owner

Got it, the issue is that your class should be defined with optional Int64 rather than Int32, then deserialization will work:

class EdgeCase {
   public long? Id { get; set; }
}

I have added stricter validation for this use case to v5.0.3 as well.

@dkotov
Copy link
Author

dkotov commented Dec 4, 2024

Great!
Thank you for the fix @aloneguid . Resolving...

@dkotov dkotov closed this as completed Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants