Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iceberg: utils for operating on avro #21493

Merged
merged 3 commits into from
Jul 18, 2024

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Jul 17, 2024

Pulls out some utilities that were previously used in tests. We will need to be able to read and write Avro files (this is the format manifests and manifest lists are in).

Also makes some adjustments to be able to pass around the iobuf when using Avro's reader/writer classes, and adds a test demonstrating how this will be done. This test also exercises getting metadata from the Avro header, which is where e.g. the Iceberg schema will be stored in manifests[1].

[1] https://iceberg.apache.org/spec/#manifests

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

  • none

andrwng added 2 commits July 17, 2024 15:09
A previous commit added a test that defined some Avro serialization
utilites. These will be useful outside of tests, since it's generally
common for other parts of Redpanda to use iobufs as the in-memory buffer
of choice.
When using a DataFile{Writer,Reader}, we'll need to be able to pass the
buffer used by the writer. This is difficult to do with the current
interface of avro_iobuf_ostream, which is std::moved into the writer.

So, this moves the buffer outside of the ostream, which makes it easier
for callers to manage the iobuf.

Note, it doesn't appear the trimming I was doing on release() was needed
after all.
Adds a simple test for serializing manifest files with a DataFileWriter
and reading it back with a DataFileReader. This is will ultimately be
what we'll use to serialize manifests, since the DataFileWriter is what
will write additional metadata[1][2] (like the Iceberg schema).

[1] https://github.com/redpanda-data/avro/blob/1410e79f9df61669c2d52f6d0643e6c35156e615/lang/c%2B%2B/impl/DataFile.cc#L246-L252
[2] https://iceberg.apache.org/spec/#manifests
@andrwng andrwng force-pushed the iceberg-avro-utils branch from af65819 to 011cae1 Compare July 18, 2024 00:04
Comment on lines +32 to +33
iobuf buf;
auto out = std::make_unique<avro_iobuf_ostream>(4096, &buf);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@dotnwat dotnwat merged commit 645911e into redpanda-data:dev Jul 18, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants