Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Can't read Parquet file with fixed_len_byte_array_column data type generated by Python #463

Closed
AndrewDavidLees opened this issue Jan 22, 2024 · 5 comments
Assignees
Milestone

Comments

@AndrewDavidLees
Copy link

Library Version

4.23.1 (and earlier)

OS

Windows

OS Architecture

64 bit

How to reproduce?

  1. Created a parquet file in Python (this could already be the source of the issue). See CreateParquetFile2.py. One column is "('fixed_len_byte_array_column', pa.binary(3))" generated using "'fixed_len_byte_array_column': [b'abc', b'def', b'ghi', b'jkl',b'mno', b'qrs']". See dataTypesExample.parquet
  2. Read file in Parquet.NET using C#. Current exception is:
    "'Specified argument was out of the range of valid values.'"
    at System.ThrowHelper.ThrowArgumentOutOfRangeException()
    at Parquet.Encodings.ParquetPlainEncoder.Decode(Span1 source, Span1 data)
    at Parquet.Encodings.ParquetPlainEncoder.Decode(Array data, Int32 offset, Int32 count, SchemaElement tse, Span`1 source, Int32& elementsRead)
    at Parquet.File.DataColumnReader.d__12.MoveNext()
    at Parquet.File.DataColumnReader.d__9.MoveNext()
    at Emb.PricingSuite.ProcessingEngine.Implementation.ParquetFileDataReader.d__53.MoveNext() in C:\Work\Radar4\Emb.PricingSuite.PE.Components.Data\src\ConnectionHandling\Parquet\ParquetFileDataReader.cs:line 325
  3. Previous exception in 4.10 complained about a startValue being out of range. I tried upgrading to see if it made any difference.
    BugReport.zip

Failing test

No response

@aloneguid aloneguid self-assigned this Jan 22, 2024
@aloneguid
Copy link
Owner

Thanks for this. Looks like it's failing on decoding byte array length, which is encoded as a really large value:

image

What's worse, it's also failing to read in Apache Spark:

Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false))
	at org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1317)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:191)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertPrimitiveField$2(ParquetSchemaConverter.scala:269)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:209)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:173)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$3(ParquetSchemaConverter.scala:138)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$3$adapted(ParquetSchemaConverter.scala:108)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
	at scala.collection.immutable.Range.foreach(Range.scala:158)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertInternal(ParquetSchemaConverter.scala:108)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:78)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:577)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:577)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:557)
	at scala.collection.immutable.Stream.map(Stream.scala:418)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:557)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:549)
	at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:76)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	... 1 more

@aloneguid
Copy link
Owner

I think i've fixed it:

image

@aloneguid aloneguid added this to the 4.23.2 milestone Jan 22, 2024
@aloneguid
Copy link
Owner

fix released, please give it a go ;)

@AndrewDavidLees
Copy link
Author

Wow, that's amazing, thanks Ivan! Seems to be working perfectly fine now :)

@aloneguid
Copy link
Owner

I'm glad to hear that everything is working perfectly now! If you found my assistance helpful, a star would be appreciated. Thank you! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants