-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23388][SQL] Support for Parquet Binary DecimalType in VectorizedColumnReader #20580
Conversation
@@ -691,11 +696,6 @@ public WritableColumnVector arrayData() { | |||
*/ | |||
protected abstract WritableColumnVector reserveNewColumn(int capacity, DataType type); | |||
|
|||
protected boolean isArray() { | |||
return type instanceof ArrayType || type instanceof BinaryType || type instanceof StringType || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to minimize this change? i.e. just protected
-> public
without changing the place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to keep the previous place but I noticed that all the public
methods in this class are grouped together so I thought that I should move this to keep that consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this change in WritableColumnVector
class.
WritableColumnVector
class will be public API in Spark 2.3. It seem to be necessary to discuss with @cloud-fan for changing the visibility of a method or field.
@@ -444,7 +444,7 @@ private void readBinaryBatch(int rowId, int num, WritableColumnVector column) { | |||
// This is where we implement support for the valid type conversions. | |||
// TODO: implement remaining type conversions | |||
VectorizedValuesReader data = (VectorizedValuesReader) dataColumn; | |||
if (column.dataType() == DataTypes.StringType || column.dataType() == DataTypes.BinaryType) { | |||
if (column.isArray()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need new test cases for supporting a new type?
IIUC, does this PR add to support array type, too? |
Yeah I believe it will add support for the array type too. Spark actually previously supported these types but the support was removed in this PR: 9c29c55#diff-7bdf5fd0ce0b1ccbf4ecf083611976e6R428 I'm just trying to add it back. |
@cloud-fan Is there any reason that the above PR removed to support some types such as |
As far as we can tell this was an accidental breaking change, as dropping support for this in vectorized Parquet reader was never called out. We have Parquet datasets with binary columns with logical type DECIMAL that were loadable before the change and have since become unloadable, throwing in |
e04d8a8
to
378ce28
Compare
@kiszk - I've changed the implementation to no longer use As for testing, it might be easiest to check in a parquet file with the binary decimal format and then check that spark can read it? |
If you did add a test it should probably generate the Parquet file programmatically rather than checking it in. Some examples in https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala. However, given that there weren't tests before and this just fixes a break that was recently introduced, it seems reasonable that creation of tests for the vectorized reader can be done in a folowup? |
It was an accident, thanks for the fix! Can we add a test? It's always good to have a test for a bug fix, even the bug was introduced recently. |
I'll see if I can generate a parquet file with the right schema to add for a test, but probably cannot look at this till tomorrow. |
ok to test |
You don't need to generate the parquet file manually, just write a parquet file using Spark and read it back. We can probably add this test in |
@@ -444,7 +444,8 @@ private void readBinaryBatch(int rowId, int num, WritableColumnVector column) { | |||
// This is where we implement support for the valid type conversions. | |||
// TODO: implement remaining type conversions | |||
VectorizedValuesReader data = (VectorizedValuesReader) dataColumn; | |||
if (column.dataType() == DataTypes.StringType || column.dataType() == DataTypes.BinaryType) { | |||
if (column.dataType() == DataTypes.StringType || column.dataType() == DataTypes.BinaryType | |||
|| DecimalType.isByteArrayDecimalType(column.dataType())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please correct the indent too.
Test build #87335 has finished for PR 20580 at commit
|
LGTM. Since this is a regression + blocker of Spark 2.3 release, I am merging it now. Please submit a follow-up PR to add the tests. Thanks! |
…edColumnReader ## What changes were proposed in this pull request? Re-add support for parquet binary DecimalType in VectorizedColumnReader ## How was this patch tested? Existing test suite Author: James Thompson <jamesthomp@users.noreply.github.com> Closes #20580 from jamesthomp/jt/add-back-binary-decimal. (cherry picked from commit 5bb1141) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
What changes were proposed in this pull request?
Re-add support for parquet binary DecimalType in VectorizedColumnReader
How was this patch tested?
Existing test suite