Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iceberg: initial data structures for logical data types #21415

Merged
merged 3 commits into from
Jul 17, 2024

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Jul 15, 2024

Adds initial logical types[1] that will be used to represent an Iceberg schema. These include "primitives" (basic types like int, float, string, etc) as well as complex types (types that are composed of other types, like list, map, struct). Complex types are represented in Iceberg as composed of "nested fields" which are types that are themselves composed of other types (primitives or complex types).

This PR introduces the recursive definition of these types, and initial code to serialize them to and from JSON (Iceberg manifests include table schemas represented as JSON). The implementation of these types is similar in style to the Iceberg Rust library, though we are using std::variants instead of Rust enums, since C++ enums aren't expressive enough to express complex nested enums with members

For reference, here are implementations in other languages:

[1] https://iceberg.apache.org/spec/#schemas-and-data-types

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

  • none

@andrwng andrwng force-pushed the iceberg-datatypes branch from 77d6336 to 66cb6c2 Compare July 16, 2024 01:57
@andrwng andrwng requested review from dotnwat and jcipar July 16, 2024 02:23
@andrwng andrwng marked this pull request as ready for review July 16, 2024 05:59
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jul 16, 2024

@andrwng andrwng force-pushed the iceberg-datatypes branch 4 times, most recently from a1460f6 to 6c1b97f Compare July 16, 2024 19:24
jcipar
jcipar previously approved these changes Jul 16, 2024
Copy link
Contributor

@jcipar jcipar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! This looks really good.

I left a couple minor comments, but otherwise this looks ready to merge.

src/v/iceberg/datatypes.cc Outdated Show resolved Hide resolved
}
bool operator==(const nested_field& lhs, const nested_field& rhs) {
return lhs.id == rhs.id && lhs.required == rhs.required
&& lhs.name == rhs.name && lhs.type == rhs.type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to require the names be equal? What is this used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I'm just using these to ensure my serialization is correct.

You're right though that schema type equivalence may have a different criteria. I'd argue though that if we want equivalence of just types, it should be a dedicated comparison method, rather than operator==.

src/v/iceberg/datatypes_json.cc Outdated Show resolved Hide resolved
src/v/iceberg/datatypes.cc Show resolved Hide resolved
src/v/iceberg/datatypes.cc Outdated Show resolved Hide resolved
andrwng added 2 commits July 16, 2024 19:45
Adds initial logical types[1] that will be used to represent an Iceberg
schema. These include "primitives" (basic types like int, float, string,
etc) as well as complex types (types that are composed of other types,
like list, map, struct). Complex types are represented in Iceberg as
composed of "nested fields" which are types that are themselves composed
of other types (primitives or complex types).

This PR introduces the recursive definition of these types, with a basic
equality operator and ostream operator (note, this is for JSON
serialization, that will come in a following commit).

The implementation of these types is similar in style to the Iceberg
Rust library, though we are using std::variants instead of Rust enums,
since C++ enums aren't expressive enough to express complex nested enums
with members.
Adds some helpers for operating on the json::Value class, that will be
useful in building JSON parsing for Iceberg metadata types.

Long term it may make sense to move these to some general purpose
module, but since this is only used in Iceberg for now, and to avoid
distractions, the utilities are just added to the Iceberg module.
@andrwng andrwng force-pushed the iceberg-datatypes branch from 70d5400 to 93b1ffd Compare July 17, 2024 02:45
@andrwng andrwng requested a review from dotnwat July 17, 2024 02:46
dotnwat
dotnwat previously approved these changes Jul 17, 2024
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

src/v/iceberg/datatypes_json.cc Show resolved Hide resolved
src/v/iceberg/datatypes_json.cc Outdated Show resolved Hide resolved
Comment on lines +124 to +126
.match("binary", binary_type{});
}
if (!v.IsObject()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle the default match case with an exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This already appears to be handled?

BOOST_CHECK_EXCEPTION(
string_switch<int8_t>("ccc").match("a", 0).match("b", 1).
operator int8_t(),
std::runtime_error,
[](const std::runtime_error& e) {
// check that the error string includes the string we were searching
// for as a weak hint to where the error occurred
if (!std::string(e.what()).ends_with("ccc")) {
BOOST_TEST_FAIL(
"Expected error message to end with ccc but was: " << e.what());
};
return true;
});

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh i thought it just fell through or something. sgtm

@emaxerrno
Copy link
Contributor

Very cool

Adds JSON serialization for newly added Iceberg data types.

A test is added to demonstrate the roundtrip with a type comprised of
multiple nested fields.
@vbotbuildovich
Copy link
Collaborator

new failures in https://buildkite.com/redpanda/redpanda/builds/51624#0190bf88-4433-4b6d-b663-2861e98c87c9:

"rptest.tests.scaling_up_test.ScalingUpTest.test_fast_node_addition"

@andrwng
Copy link
Contributor Author

andrwng commented Jul 17, 2024

CI failure: #20224

@dotnwat dotnwat merged commit e539d40 into redpanda-data:dev Jul 17, 2024
16 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants