Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support to read binary as string in parquet #10399

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

kevincmchen
Copy link
Contributor

Summary:

Some other Parquet-producing systems, in particular Impala, will write string-type data as binary into Parquet file. and Velox's ParquetReader does not support binaryAsString. this PR is intended to resolve this issue.

issue resolved: [#10398]

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 4, 2024
Copy link

netlify bot commented Jul 4, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit c796f5d
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/668d03bdeec77e00081a70b1

@@ -655,12 +653,17 @@ TypePtr ReaderBase::convertType(
return DOUBLE();
case thrift::Type::type::BYTE_ARRAY:
case thrift::Type::type::FIXED_LEN_BYTE_ARRAY:
if (binaryAsString) {
return VARCHAR();
if (options_.fileSchema()->containsChild(schemaElement.name)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be

auto requestedType = options_.requestedType() ? options_.requestedType(): schema();

@@ -655,12 +653,18 @@ TypePtr ReaderBase::convertType(
return DOUBLE();
case thrift::Type::type::BYTE_ARRAY:
case thrift::Type::type::FIXED_LEN_BYTE_ARRAY:
if (binaryAsString) {
return VARCHAR();
if (options_.fileSchema() != nullPtr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably need to add requestedType to getParquetColumnInfo to handle nested VARBINARY as well

const TypePtr& ReaderBase::getChildRequestedSchema(
const TypePtr& requestedSchema,
std::string& name) const {
if (!requestedSchema && requestedSchema->isRow() &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!requestedSchema) {
  VELOX_CHECK(requestedSchema->isRow() && requestedSchema->asRow().containsChild(name));
} else {

TypePtr ReaderBase::convertType(
const thrift::SchemaElement& schemaElement) const {
const thrift::SchemaElement& schemaElement,
const TypePtr& requestedSchema) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is requestedType not requestedSchema

if (binaryAsString) {
return VARCHAR();
case thrift::Type::type::FIXED_LEN_BYTE_ARRAY: {
if (requestedSchema != nullptr && requestedSchema->isRow() &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we don't need these

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants