Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: serde for segment_meta_cstore #9102

Merged
merged 4 commits into from
Mar 10, 2023

Conversation

andijcr
Copy link
Contributor

@andijcr andijcr commented Feb 24, 2023

This PR builds upon #7898
Fixes #9232

Adds segment_meta_cstore::from_iobuf/to_iobuf via serde of the sub-objects.
most of the work is done via serde_fields and serde::envelope, but some manual work is done via serde_write/serde_read, to support std::list and absl::btree_map

std::array is added to the list of containers recognized by serde. serialization is done like other sequence containers, as size,[elements...]. Deserialization first checks the serialized size against std::tuple_size<std_array_t> and throws in case of mismatch. this is done to catch bugs where the size of the std::array is changed in code, but the writer didn't add decoding logic for older binary data. it's not meant to be a recoverable error.

a new unit test checks round-trip serialization/deserialization

a benchmark check serialization/deserialization against a memcpy baseline

test iterations median mad min max allocs tasks inst
cstore_bench.column_store_serialize_baseline 1553 92.628us 64.712ns 92.469us 92.744us 38.000 0.000 0.0
cstore_bench.column_store_serialize_result 217 299.284us 449.493ns 298.768us 299.870us 7644.059 0.000 0.0
cstore_bench.column_store_deserialize_baseline 1043 272.181us 11.634ns 271.952us 272.606us 2507.000 0.000 0.0
cstore_bench.column_store_deserialize_result 215 36.598us 25.693ns 36.521us 36.624us 283.864 0.000 0.0

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

Release Notes

  • none

@andijcr andijcr force-pushed the feat/manifest-compression/serde branch from a345a36 to b4e8c4d Compare February 28, 2023 18:50
@andijcr andijcr changed the title Draft: manifest-compression serde addition cloud_storage: serde for segment_meta_cstore Mar 3, 2023
@andijcr andijcr force-pushed the feat/manifest-compression/serde branch from f75c1d8 to 71ae315 Compare March 3, 2023 11:47
@andijcr andijcr marked this pull request as ready for review March 3, 2023 11:48
@andijcr andijcr added the area/cloud-storage Shadow indexing subsystem label Mar 3, 2023
@andijcr andijcr requested review from Lazin, jcsp and graphcareful March 3, 2023 11:49
@@ -18,6 +18,7 @@
#include <seastar/core/circular_buffer.hh>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serde/serde.h: added support for std::array as size,elements (+ throw)

what does + throw mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means that it will throw a serde_exception if the serialized size does not match the std::array size

@@ -287,6 +288,9 @@ void write(iobuf& out, T t);
template<typename T>
void write(iobuf& out, std::vector<T> t);

template<typename T, size_t Size>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::size_t is used in the concept, but size_t here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -31,30 +31,33 @@ static constexpr uint32_t FOR_buffer_depth = 16;
*/
struct delta_xor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with a nttp value

what is nttp?

Copy link
Contributor Author

@andijcr andijcr Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acronym for Non-Type Template Parameter. It's a really bad name for a template parameter that is a value and not a type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, of course. thanks

@@ -713,6 +729,20 @@ void read_nested(iobuf_parser& in, T& t, std::size_t const bytes_left_limit) {
for (auto i = 0U; i < size; ++i) {
t.push_back(read_nested<value_type>(in, bytes_left_limit));
}
} else if constexpr (reflection::is_std_array<Type>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the implication of this? Curious i don't know myself but would the compiler generate a new branch of generated code here once per every type of std::array with a different size parameter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but would the compiler generate a new branch of generated code here once per every type of std::array with a different size parameter?

yeh, that's the way i understand it. i think that alone is a good argument for not supporting std::array. OTOH, we probably have very few instances of it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should isntead support std::span<T, std::dynamic_extent>?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably ok unless we have any good alternative ideas. the most important thing is that we keep an on-disk format that doesn't constrain us much. and i think we have here: its the same as vector, for exmaple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes std::span seems a good idea. let me try

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://gcc.godbolt.org/z/sf3GoeoGj did a quick test. gcc will produce reasonable code, but clang needs attribute((noinline)).
If it's ok i can use this code.
another option is to only add support for std::span in serde, and i can use serde_write/serde_read to convert std::array to std::span.
let me know if I should go with the first or the second option

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i'd vote to keep it as it is

Copy link
Contributor

@Lazin Lazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: benchmark test for the serialisation will be very helpful going forward, otherwise LGTM

@andijcr
Copy link
Contributor Author

andijcr commented Mar 8, 2023

nit: benchmark test for the serialisation will be very helpful going forward, otherwise LGTM

Since I have to fix conflicts and modify how std::array is serialized, I'll also add a benchmark test

andijcr added 4 commits March 9, 2023 11:15
serialization is done as std::vector,
deserialization will check that size matches the std::tuple_size<std_array_t>,
and will throw if different.
This is done as a safety measure in case of a size change without a change in decoding logic of old binary data.
It's meant to catch bugs during testing, not as a recoverable error.
this in done in a retro-compatible way, so as to not touch preexisting run-time use.
delta_for_encoder can now be used with a runtime version of delta_* algo
or with a nttp value.
at the same time, delta_for_encoder and delta_stream_pos_t are
serde-enabled, and serde is aware if delta_* is nttp or not, reducing
slightly the amount of data used
serde is done naively via serde::envelope. std::list and absl::btree_map
are not serde enabled, they are processed in their containing class via
serde_write/read mirroring what it's done for other sequence/map types
(size,[elements...]/[(key,value)...]).
the nttp version of delta_* algorithm is used to remove the need to
serde it. this has a cascading effect of removing the need for some user
defined constructors.
column_store::hint_vec_t is transformed from a std::tuple to a
std::array since all the elements of the tuple were of type hint_t.
memory and performance wise there should be no difference, but it's
compatible with serde.

a unit test is added that serializes and deserializes a
segment_meta_cstore and checks the elements
@andijcr andijcr force-pushed the feat/manifest-compression/serde branch from 71ae315 to ee99486 Compare March 9, 2023 14:21
@andijcr
Copy link
Contributor Author

andijcr commented Mar 9, 2023

force push: rebase and benchmark for serialization.

@andijcr andijcr requested a review from Lazin March 9, 2023 14:31
@andijcr
Copy link
Contributor Author

andijcr commented Mar 10, 2023

Copy link
Contributor

@Lazin Lazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Lazin Lazin merged commit 27b2dc7 into redpanda-data:dev Mar 10, 2023
@andijcr andijcr deleted the feat/manifest-compression/serde branch March 10, 2023 11:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/redpanda
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cloud_storage: serialization for segment_meta_cstore
4 participants