Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(arrow-csv): support encoding of binary in CSV writer #5782

Merged
merged 1 commit into from
May 20, 2024

Conversation

hiltontj
Copy link
Contributor

Allows for writing binary (Binary, LargeBinary, and FixedSizeBinary) to CSV. Note: FixedSizeBinary was already being supported in this way.

Values are encoded as HEX, by using the default Arrow formatter.

A test was added that accounts for null values when encoding all three binary types in CSV.

Which issue does this PR close?

Closes #3921.

Rationale for this change

Currently, FixedSizeBinary is encoded in CSV (and JSON) as HEX (see #3291 (comment)). Therefore, it seemed reasonable to lift the restriction that was preventing writing Binary and LargeBinary arrays.

What changes are included in this PR?

The condition that checks if the datatype on the array is Binary or LargeBinary was removed, and a test was added to check writing to CSV of the three binary types: Binary, LargeBinary, and FixedSizeBinary. The test also checks for null values.

Are there any user-facing changes?

This will change the behaviour of the writer for users, but should not be a breaking change, since previously, an error was being thrown when attempting to write these data types to CSV.

We may need to document that such types are encoded as HEX in the CSV writer, as they are with the standard Arrow formatter.

Allows for writing binary (Binary, LargeBinary, and FixedSizeBinary) to
CSV. Note: FixedSizeBinary was already being supported in this way.

Values are encoded as HEX, by using the default Arrow formatter.

A test was added that accounts for null values when encoding all three
binary types in CSV.
@github-actions github-actions bot added the arrow Changes to the arrow crate label May 17, 2024
@hiltontj
Copy link
Contributor Author

It is worth noting the discussion here: #3292 where, at the time, I think there was some contention about encoding binary in CSV based on other Arrow implementations (pyarrow).

@hiltontj hiltontj changed the title feat: support encoding of binary in CSV writer feat(arrow-csv): support encoding of binary in CSV writer May 18, 2024
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, we can always leave the door open to encoding binary data as base64 (or some other encoding) down the line. Whilst innefficient, hex encoding should avoid the issues in #3292 where CSV doesn't really define how escaping should behave

@tustvold tustvold merged commit c498eb7 into apache:master May 20, 2024
23 checks passed
@hiltontj hiltontj deleted the csv-binary branch May 20, 2024 13:25
@hiltontj
Copy link
Contributor Author

@tustvold - I linked this to the incorrect issue. This should have closed #3291. Sorry about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants