Migrate from JSON to Protobuf+Snappy format for index cache #1013

GiedriusS · 2019-04-05T14:07:32Z

Saving the index cache is very, very painful since it does a lot of reflection underneath and it balloons the RAM consumption and it makes much more time. Switch to using to a binary format seamlessly by saving new index caches in a binary Protobuf format with Snappy encoding on top. Migrate them over to the TSDB format during compaction.

Changes

Index cache operations were broken out into a separate interface
Compactor converts JSON index caches into the Protobuf format + Snappy on top (TBD)
Benchmarks for the JSON index cache operations vs. Protobuf/Snappy index cache operations (TBD)

Verification

go verify tests that conversion between JSON and binary works, and that no data is lost.

bwplotka · 2019-04-05T14:15:16Z

Can we discuss first why flat buffer not proto buffer? (:

bwplotka · 2019-04-05T14:15:42Z

Essentially why adding another dep

GiedriusS · 2019-04-05T15:51:28Z

IMO Flatbuffers should be used as they allow random reads and you do not have to read everything into memory and parse before accessing the fields. https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html this is a nice comparison from one of the (former?) core developers of protobufs at Google. @bwplotka what's your opinion on this topic? I thought it was already decided

GiedriusS · 2019-04-05T18:12:37Z

However, protobufs support map types so it would integrate into our codebase in a nicer way. Plus, we read it and write it at once so perhaps indeed it would be nicer to stay with protobufs?

bwplotka · 2019-04-06T10:38:01Z

I think it was never decided, at least not aware. This task was "move to some flutbuffer/protobuf". Anyway, I don't have strong opinion, but would prefer conistency and reusing existings deps if possible. But happy to discuss. I know protobufs well, but I am newbie in flatbuffers (:

Also I remember @fabxc had this discussion before when designing TSDB index and he decided to NOT use any of those - just raw binary format with varints (: Maybe I can dig out this discussion from somewhere so we can learn from their arguments. (:

bwplotka · 2019-04-06T10:41:49Z

Also.. if you look on index cache it's actually 1:1 to TSDB index just with series and postings removed. ;>

So maybe it's as easy as stripping out TSDB index? Imagine how was computing this would be. Plus we could reuse exactly the same readers/writers (partially)

povilasv · 2019-04-09T08:26:30Z

Would be great to have some benchmarks prior vs current, as right now we have no idea whether it worked.

fabxc · 2019-04-13T09:42:03Z

Regarding why TSDB uses a custom format:

Protobuf has no random access, compresses integers reasonably well (but not sequence-aware)
Flatbuf has random access, doesn't really compress
Custom format allows us to make tradeoffs between compression and random access as we need it

In hindsight it would've been a good idea to only use a custom format for the hot parts of the index and use flatbuffers for any scaffolding (like the TOC).

GiedriusS · 2019-05-10T22:28:23Z

It seems like it will be impossible to use TSDB format properly for this use-case since Writer.WritePostings checks the series offsets again and sorts them again before writing out. Thus, the only way to write "proper" postings is to add the series data which means extra information in the index cache which we absolutely do not need. If we were to use empty index postings then it would mean that we would lose the Start/End values - they would only different by 4 due to padding as these examples show. So the only way out of this situation is to write out the series data which we do not want to do.

Also, the TSDB code serializes all of the data into pure big-endian numbers. AFAICT no compression is supported in these code paths that we are interested in. From these simple test cases that I have made I can tell that the index cache files do not differ much in size:

$ wc -c index*
     276 index.cache.dat
     397 index.cache.fresh.dat
     415 index.cache.json
    1088 total

Thus, it seems that the best way forward is to re-use protobufs for the index cache files with snappy compression. It would be fine to use protobufs here too since index cache for us is "all-or-nothing" - we do not care about any streaming features that other encoding formats like flatbufs provide. Also, we could reuse the protobuf knowledge/code that we have here as well.

Thoughts, @fabxc @bwplotka? Perhaps I missed something.

bwplotka · 2019-06-06T17:10:38Z

cc my TSDB index superhero @gouthamve friend (:

I need to dive into why reusing index does not work, but I get why you don't have any size improvment. What if we would compress it?

I would think in this case, protobuf + snappy seems like the solution here, but we were just talking with @gouthamve on IRC about some improvements to index itself, wonder if there is something we can improve in index-cache (Table, Symbols, Posting starts).

bwplotka · 2019-11-14T10:15:53Z

I looked it through and I think we should actually try the binary format due to mapping (mostly for symbols) and quick load time. Producing such a binary format can be also quicker as we could just do a straight copy of certain elements from the index itself.

Still however loading on-demand those might tricky due to current LabelValue implementation.

pracucci · 2019-12-11T15:29:11Z

See also the proposal for moving to index-header binary format (#1839).

bwplotka · 2020-01-06T15:21:03Z

#1943 replaces this PR as agreed with @GiedriusS

Giedrius Statkevičius added 2 commits April 5, 2019 17:04

block/flatbuf: add initial structs and Go files

c3302a2

block: fix flatbuf structure

9dd06bf

Giedrius Statkevičius added 3 commits April 5, 2019 19:44

block: reorganize structure

240bf53

block/indexcache: add TOFBCache stub

adb22af

block/indexcache: update ReadIndexCache signature to match

afd6f8e

bwplotka mentioned this pull request Apr 7, 2019

store+compactor: process index cache during compaction #986

Merged

GiedriusS added 6 commits April 13, 2019 11:46

indexcache: generalize into BinaryCache

960b938

indexcache: add initial BinaryCache skeleton

b3375e1

indexcache/BinaryCache: finish WriteCache()

9cc920b

indexcache/json: fix method definition

10cf687

indexcache/json: add implementation of WriteCache

43f612a

indexcache/common: refactor

72d989a

indexcache/JSON: add ReadIndexCache

a603d53

GiedriusS changed the title ~~Migrate from JSON to Flatbuf for index cache~~ Migrate from JSON to the TSDB format for index cache Apr 13, 2019

GiedriusS added 5 commits April 13, 2019 20:29

indexcache/binary: add reader

866ad34

indexcache/binary: add unit test

8a8ef79

indexcache/binary: add tests

601a190

indexcache/JSON: add ToBCache

70b9f75

indexcache/JSON: clean up unneeded Println

da6595c

bwplotka mentioned this pull request Apr 17, 2019

store/compactor: Improve index-cache logic. #942

Closed

3 tasks

GiedriusS added 2 commits April 20, 2019 11:04

indexcache/json: add tests

8156780

indexcache: clean up tests

eb3cfc2

block/indexcache: add initial snappy + protobuf ver

1fc41e7

GiedriusS changed the title ~~Migrate from JSON to the TSDB format for index cache~~ Migrate from JSON to Protobuf+Snappy format for index cache May 13, 2019

block/indexcache: add rudimentary bench

940ced6

GiedriusS force-pushed the json_flatbuf branch from 1fc41e7 to afd6f8e Compare May 14, 2019 13:08

GiedriusS mentioned this pull request May 18, 2019

Use of json.Decoder #1160

Closed

bwplotka mentioned this pull request Nov 1, 2019

Long Term Storage Improvements [Tracking Issue] #1705

Closed

34 tasks

bwplotka mentioned this pull request Jan 6, 2020

Hide usage of index-cache.json under interface. #1943

Merged

bwplotka closed this Jan 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate from JSON to Protobuf+Snappy format for index cache #1013

Migrate from JSON to Protobuf+Snappy format for index cache #1013

GiedriusS commented Apr 5, 2019 •

edited

Loading

bwplotka commented Apr 5, 2019

bwplotka commented Apr 5, 2019

GiedriusS commented Apr 5, 2019

GiedriusS commented Apr 5, 2019

bwplotka commented Apr 6, 2019

bwplotka commented Apr 6, 2019

povilasv commented Apr 9, 2019 •

edited

Loading

fabxc commented Apr 13, 2019 •

edited

Loading

GiedriusS commented May 10, 2019

bwplotka commented Jun 6, 2019

bwplotka commented Nov 14, 2019

pracucci commented Dec 11, 2019 •

edited

Loading

bwplotka commented Jan 6, 2020

Migrate from JSON to Protobuf+Snappy format for index cache #1013

Migrate from JSON to Protobuf+Snappy format for index cache #1013

Conversation

GiedriusS commented Apr 5, 2019 • edited Loading

Changes

Verification

bwplotka commented Apr 5, 2019

bwplotka commented Apr 5, 2019

GiedriusS commented Apr 5, 2019

GiedriusS commented Apr 5, 2019

bwplotka commented Apr 6, 2019

bwplotka commented Apr 6, 2019

povilasv commented Apr 9, 2019 • edited Loading

fabxc commented Apr 13, 2019 • edited Loading

GiedriusS commented May 10, 2019

bwplotka commented Jun 6, 2019

bwplotka commented Nov 14, 2019

pracucci commented Dec 11, 2019 • edited Loading

bwplotka commented Jan 6, 2020

GiedriusS commented Apr 5, 2019 •

edited

Loading

povilasv commented Apr 9, 2019 •

edited

Loading

fabxc commented Apr 13, 2019 •

edited

Loading

pracucci commented Dec 11, 2019 •

edited

Loading