-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate from JSON to Protobuf+Snappy format for index cache #1013
Conversation
Can we discuss first why flat buffer not proto buffer? (: |
Essentially why adding another dep |
IMO Flatbuffers should be used as they allow random reads and you do not have to read everything into memory and parse before accessing the fields. https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html this is a nice comparison from one of the (former?) core developers of protobufs at Google. @bwplotka what's your opinion on this topic? I thought it was already decided |
However, protobufs support map types so it would integrate into our codebase in a nicer way. Plus, we read it and write it at once so perhaps indeed it would be nicer to stay with protobufs? |
I think it was never decided, at least not aware. This task was "move to some flutbuffer/protobuf". Anyway, I don't have strong opinion, but would prefer conistency and reusing existings deps if possible. But happy to discuss. I know protobufs well, but I am newbie in flatbuffers (: Also I remember @fabxc had this discussion before when designing TSDB index and he decided to NOT use any of those - just raw binary format with varints (: Maybe I can dig out this discussion from somewhere so we can learn from their arguments. (: |
Also.. if you look on index cache it's actually 1:1 to TSDB index just with series and postings removed. ;> So maybe it's as easy as stripping out TSDB index? Imagine how was computing this would be. Plus we could reuse exactly the same readers/writers (partially) |
Would be great to have some benchmarks prior vs current, as right now we have no idea whether it worked. |
Regarding why TSDB uses a custom format:
In hindsight it would've been a good idea to only use a custom format for the hot parts of the index and use flatbuffers for any scaffolding (like the TOC). |
It seems like it will be impossible to use TSDB format properly for this use-case since Also, the TSDB code serializes all of the data into pure big-endian numbers. AFAICT no compression is supported in these code paths that we are interested in. From these simple test cases that I have made I can tell that the index cache files do not differ much in size:
Thus, it seems that the best way forward is to re-use protobufs for the index cache files with snappy compression. It would be fine to use protobufs here too since index cache for us is "all-or-nothing" - we do not care about any streaming features that other encoding formats like flatbufs provide. Also, we could reuse the protobuf knowledge/code that we have here as well. |
cc my TSDB index superhero @gouthamve friend (: I need to dive into why reusing index does not work, but I get why you don't have any size improvment. What if we would compress it? I would think in this case, protobuf + snappy seems like the solution here, but we were just talking with @gouthamve on IRC about some improvements to index itself, wonder if there is something we can improve in index-cache (Table, Symbols, Posting starts). |
I looked it through and I think we should actually try the binary format due to mapping (mostly for symbols) and quick load time. Producing such a binary format can be also quicker as we could just do a straight copy of certain elements from the index itself. Still however loading on-demand those might tricky due to current LabelValue implementation. |
See also the proposal for moving to index-header binary format (#1839). |
#1943 replaces this PR as agreed with @GiedriusS |
Saving the index cache is very, very painful since it does a lot of reflection underneath and it balloons the RAM consumption and it makes much more time. Switch to using to a binary format seamlessly by saving new index caches in a binary Protobuf format with Snappy encoding on top. Migrate them over to the TSDB format during compaction.
Changes
Verification
go verify
tests that conversion between JSON and binary works, and that no data is lost.