perf(decoder): avoid lots of calls to ensure data and field length/offset lookups #144

mkjois · 2023-03-23T16:38:00Z

I've made three changes (three commits) to reduce the number of calls to ensureReadable() and to the LengthOffsetCache internal maps. We've seen 2x speed improvement in TCString.decode() and 2% overall CPU usage improvement in our production workload, which at high scale results in tangible cost savings.

Just to note, our use case doesn't benefit much from lazy decoding, since we need to decode the vendor consent, vendor legitimate interest, and publisher restrictions for almost all requests, and those three seem to dominate the decoding time in our profiling measurements when TCStringV2.hashCode() is invoked upon eager decoding.

There is no reason to compute the exact same field lengths and offsets at every iteration of these loops. At high scale, this results in tons of redundant lookups into the `LengthOffsetCache`.

Implements `readBits64(int)` with new unit tests, and uses all variants of the bit reading methods to implement reading into a bit set with larger chunks. Also adds a `resultShift` parameter to make the implementation of reading vendor consent much easier, i.e. just need to set this to 1 instead of the default 0.

Implements `readBits32(int)` with new unit tests, and eliminate half the reads for publisher restrictions by reading two fields at a time when possible. Also uses the new bit set reading implementation for vendor consent. All tests pass.

mkjois · 2023-03-23T16:41:46Z

Tagging some common contributors to help review: @laktech @imayankmishra @iabmayank @srinivas81

laktech · 2023-03-23T23:00:19Z

thanks! i'll be able to review in a few days.

mkjois · 2023-03-24T11:33:04Z

I also have another change on top of this where we make sure to use primitive int in decoding everywhere instead of Integer. Looks like that can squeeze out another 0.5% overall on our production workload. I'll submit that as another PR.

mkjois · 2023-03-24T16:40:00Z

@laktech As promised, I've made a couple more PRs on top of this to test even further optimizations:

(perf(decoder): use primitive integers mkjois/iabtcf-java#1) Avoids any boxed integers for a ~25% speedup.
(perf(decoder): leaner bit set with less invariant checking mkjois/iabtcf-java#2) A more speculative change to avoid lots of invariant checking in BitSet for a ~15% speedup.

Let me know what you think, when you're ready.

laktech · 2023-04-10T15:46:58Z

hey, are you able to share the benchmark that you're running?

mkjois · 2023-04-10T21:10:17Z

@laktech It wasn't a proper offline benchmark like with JMH or anything. I basically just tried it with a production-like workload as part of a much larger code path, and viewed some flame graph profiles. If you have any better JMH or other benchmarks set up, please feel free to test this out.

mkjois · 2023-04-19T22:02:04Z

@laktech Any progress in reviewing?

ChristopherWirt · 2023-05-25T20:55:54Z

Nice

lukhar · 2023-05-25T20:56:28Z

Hey, any progress on reviewing this?

mkjois · 2023-07-11T13:50:35Z

@laktech Another bump 🙏

laktech · 2023-08-28T19:54:55Z

sorry i stepped away from this for a few months. i'll be reviewing this once again.

laktech · 2023-09-05T17:52:46Z

iabtcf-decoder/src/main/java/com/iabtcf/decoder/TCStringV2.java

            if (isRangeEntry) {
-                int endVendorId = bbv.readBits16(offset);
-                offset += FieldDefs.START_OR_ONLY_VENDOR_ID.getLength(bbv);
+                final int content = bbv.readBits32(offset);


there is significant code here to eliminate a call to readBits16. I don't think it's worth it unless you can show a benchmark that it adds value.

laktech · 2023-09-05T17:58:17Z

Overall, this looks great. I just have some concerns around the two areas I commented above.

mkjois added 3 commits March 22, 2023 23:16

perf(decoder): save constant results for field length and offsets

a638299

There is no reason to compute the exact same field lengths and offsets at every iteration of these loops. At high scale, this results in tons of redundant lookups into the `LengthOffsetCache`.

mkjois added 2 commits March 23, 2023 18:56

style: tiny trailing whitespace removal

4116c98

refactor(decoder): only use powers of 2 for reading bit sets

5340b30

laktech reviewed Sep 5, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(decoder): avoid lots of calls to ensure data and field length/offset lookups #144

perf(decoder): avoid lots of calls to ensure data and field length/offset lookups #144

mkjois commented Mar 23, 2023 •

edited

Loading

mkjois commented Mar 23, 2023

laktech commented Mar 23, 2023

mkjois commented Mar 24, 2023

mkjois commented Mar 24, 2023 •

edited

Loading

laktech commented Apr 10, 2023

mkjois commented Apr 10, 2023

mkjois commented Apr 19, 2023

ChristopherWirt commented May 25, 2023

lukhar commented May 25, 2023

mkjois commented Jul 11, 2023

laktech commented Aug 28, 2023

laktech Sep 5, 2023

laktech commented Sep 5, 2023

perf(decoder): avoid lots of calls to ensure data and field length/offset lookups #144

Are you sure you want to change the base?

perf(decoder): avoid lots of calls to ensure data and field length/offset lookups #144

Conversation

mkjois commented Mar 23, 2023 • edited Loading

mkjois commented Mar 23, 2023

laktech commented Mar 23, 2023

mkjois commented Mar 24, 2023

mkjois commented Mar 24, 2023 • edited Loading

laktech commented Apr 10, 2023

mkjois commented Apr 10, 2023

mkjois commented Apr 19, 2023

ChristopherWirt commented May 25, 2023

lukhar commented May 25, 2023

mkjois commented Jul 11, 2023

laktech commented Aug 28, 2023

laktech Sep 5, 2023

Choose a reason for hiding this comment

laktech commented Sep 5, 2023

mkjois commented Mar 23, 2023 •

edited

Loading

mkjois commented Mar 24, 2023 •

edited

Loading