Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add full set of RecyclerPool implementations #1064

Merged
merged 31 commits into from
Aug 31, 2023

Conversation

mariofusco
Copy link
Contributor

This is an adaptation for branch 2.16 of this pull request for master.

As requested I didn't remove any public method, but added a few new ones, deprecating the old methods that they are intended to replace. I also improved many tests trying to make sure that all Closeable resources are properly closed after their usage. This is necessary to guarantee the correct reuse of the BufferRecycler taken from the pool and in general to adhere with the lifecycle of those resources.

The last outstanding part is having an efficient implementation of the object pool itself. If you agree on the general idea of this pull request I will work on it. /cc @cowtowncoder @pjfanning @franz1981

Note that at the moment this pull request is the indirect cause of a couple of test failures that are actually caused by the bug I reported here.

@mariofusco
Copy link
Contributor Author

With this commit I tested different object pools implementations in order to benchmark them and pick the one that fit in the best possible way the jackson use case. I also temporarily add a dependency to JCTools because I wanted to also test a pool specifically designed to be very good with contention and use it at least as a baseline. Moreover the JCTools' one is the only pool implementation that I found not using ThreadLocals internally.

@mariofusco
Copy link
Contributor Author

mariofusco commented Jul 26, 2023

I made an extensive performance comparison of all the different pool implementations that I introduced with this commit in order to choose the best fit for jackson needs.

For this comparison I used the following benchmark that serialize in json a very simple pojo made of only 3 fields (a Person with firstName, lastName and age), performing this operation in parallel on 10, 100 or 1000 (native or virtual) threads.

    @Benchmark
    @OutputTimeUnit(TimeUnit.SECONDS)
    public void writePojoMediaItem(Blackhole bh) throws Exception {
        CountDownLatch countDown = new CountDownLatch(parallelTasks);

        for (int i = 0; i < parallelTasks; i++) {
            runner.accept(() -> {
                bh.consume(write(item, json));
                countDown.countDown();
            });
        }

        try {
            countDown.await();
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
    }

    protected final int write(Object value, JSON writer) {
        NopOutputStream out = new NopOutputStream();
        try {
            writer.write(value, out);
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
        return out.size();
    }

I used this smaller Person object instead of the usual MediaItems.stdMediaItem() used in other benchmarks because I wanted to maximize the portion of time used by the benchmark to access and use the pool, thus making as evident as possible the performance gaps among the different implementations.

In the remaining part of this analysis I will consider only 5 different possibilities:

  1. using no pool at all.
  2. the existing BufferRecyclers.
  3. the pool using JCTools that as written before has been introduced only to have a comparison with a pool specifically designed to be very good with contention and use it as a baseline.

plus the 2 best performing implementations not requiring the introduction of any external dependency

  1. the one based on a LinkedTransferQueue
  2. the one using a very simple lock free algorithm based on the CAS on an AtomicReference.

The charts below summarize the performances of these implementations for both virtual and native threads. The Y-axis reports the number of operations done per milliseconds, taking count that the number of operations in a benchmark loop is equal to the number of parallel tasks used when running that benchmark. In other words these results are normalized with the number of parallel tasks.

As expected when running with virtual threads the ThreadLocal based BufferRecyclers performs particularly bad, even worse than not having any pool at all. Conversely the lock free pool is the one with best performances regardless of the number of parallel virtual threads used to run the benchmark.

image

On the other side, when running with the traditional native threads, the BufferRecyclers still outperforms all other solutions, with the lock free algorithm being a good second and not far at all from the BufferRecyclers especially with a relatively low number of threads. When the number of parallel threads reaches 100 or above the performances of this lock free implementation (while remaining acceptable) tend to worsen a bit and become even lower than the LinkedTransferQueue solution, very likely because the highest contention increases the probability of a failure in the CAS.

image

Given all these considerations, and keeping in mind that we are introducing this new pool mostly to deal with virtual threads, I decided to keep the lock free implementation. As suggested by @pjfanning, at least for jackson 2.x, the actual use of this pool will be a feature opt-in that in this way could be used by projects and frameworks already leveraging virtual threads, while the existing BufferRecyclers will remain the default implementation in use.


@Override
public T borrow() {
int pos = counter.getAndUpdate(i -> i > 0 ? i-1 : i);
Copy link

@franz1981 franz1981 Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same performance problem of the Lock free variant pool: counter is highly contended (they are both stacks, actually - linked or array-based) hence would limit scalability.
I would instead use a striped variant of this

eg

add a parameter "concurrency" which represent the number of "slots" of your pool.
each slot contains a fixed size partition of the total size (ie total size/concurrency).
If you make both total size and concurrency power of 2 would be better (more info later why).

You can then use the thread id to decide which slot to hit first (using power of 2 size allow to perform modulus with threadId & (size -1)) and you have 2 strategies here:

  • performing cas on the counter owning that slot's index, moving to the next counter if failing, till trying all concurrency counters (that means that we need to keep track "where" we have borrowed it, or we risk to unbalance the pool on release - searching for some available slot)
  • just keep on trying (via a cas loop) on the same slot's index: thanks to the thread id distributions (with v threads in particular) you'll probably be lucky to spread the contention the same

To better deal with the contention, too, the counters per partition could be all stored into an AtomicLongArray at 8/16 long distance to each others, to keep each counter separated by 1 or 2 cache lines to avoid false-sharing.

Another interesting way to deal with this, could be by make the algorithm a bit unbalanced
ie in order to make progress we need to first to borrow right?
Then we could just perform getAndIncrement there....

  • if we overflow (more then slot capacity) we just switch to a mix of get and compareAndSet (because no available stuff is in) or just allocate
  • the release side instead can check what's the status of the counter and use get + compareAndSet: if we overflow it means we have exhausted the capacity, and we just need to decrease it by 1 from the max capacity
    Last one is really optional anyway, and still be proved to work good under contention (give that getAndIncrement is proved to be decent under contention only on x86 and few modern ARM archs)


@Override
public T borrow() {
T t = queue.poll();
Copy link

@franz1981 franz1981 Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pjfanning
Copy link
Member

pjfanning commented Jul 26, 2023

For the feature, could I suggest adding something alongside JsonFactory.Feature.USE_THREAD_LOCAL_FOR_BUFFER_RECYCLING - maybe JsonFactory.Feature.USE_BUFFER_RECYCLING_POOL (default = false).

@mariofusco could you load test disabling the USE_THREAD_LOCAL_FOR_BUFFER_RECYCLING feature when you are testing the BufferRecyclers case? It would be useful to know the performance of BufferRecyclers with this feature enabled (the default) and with it disabled.

@mariofusco
Copy link
Contributor Author

@cowtowncoder @pjfanning @franz1981 I now consider this pull request completed and ready to be reviewed and merged.

I made the new ObjectPool optional and disabled by default, so this shouldn't break backward compatibility in any way. Also there is still probably room to improve the performances of the pool implementation, I left in only the 2 more promising ones for now. As I wrote before the main goal of this pull request is clearly define the life cycle of all objects involved and make them compatible with the use of a pool that is different from the old ThreadLocal based one. The use of the pool is well encapsulated and it defines an interface with only 2 methods, so if we will find a better performing implementation it will be straightforward to replace it. This could be eventually investigated with a subsequent pull request.

Comment on lines 195 to 199
BufferRecycler withPool(ObjectPool<BufferRecycler> pool) {
this._pool = pool;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to throw an exception in case is set twice? or will replace anything within?
or adding an assert to make evident the impl invariant here?

@mariofusco
Copy link
Contributor Author

@cowtowncoder did you give a look at this? any comment? If you think it's ok I'd appreciate if it could be merged in a reasonable time: I'm keeping having conflicts with other commits (I will fix the last ones asap) and it's becoming increasingly hard to resolve them.

@mariofusco
Copy link
Contributor Author

@cowtowncoder sorry if I insist and ask again, but I'm literally resolving conflicts on this pull request on a daily basis and it's almost becoming a full time job. I'd appreciate it if you could at least provide some feedback. /cc @pjfanning @franz1981

@cowtowncoder
Copy link
Member

@mariofusco Sorry, I haven't had any time to look into this. I am going on a vacation, so on a plus side there shouldn't be anything to merge for that time (2 weeks). Reading through this PR is high on my TODO list but it requires quite a bit of focus.

I'll add one more (separate) note on a general approach I think makes sense, for 2.16 timeline -- apologies for not trying to reconcile it with your work so far.

@cowtowncoder
Copy link
Member

So, to me what makes sense is (and once again, apologies for not figuring out how close this PR is from these ideas) as follows:

  1. Starting point like BufferRecyclers is associated with a single JsonFactory (or in 3.0 TokenStreamFactory). This is overridable
  2. Each new JsonParser/JsonGenerator is allocated a BufferRecycler by a call to BufferRecyclers -- and then char/byte arrays are taken from recycler by parser/generator
  3. Each BufferRecycler needs to be returned to BufferRecyclers on close() of parser/generator; parser/generator should return actual array to BufferRecycler right before this
  4. It may be that access to arrays via BufferRecycler still needs to use atomic operations? I am not 100% sure about this, but this is probably not a performance hot spot since these should always be uncontested

Once this life-cycle works in 2.16; and with one backing implementation (existing ThreadLocal based one), I think there's need to get experience with release, and target alternate implementations for later releases.
This could be 2.17, or, if there are blockers, then master/3.0.

Approach above is based on my strong preference for allowing per-factory pooling as one of the options -- but also allowing global: latter case is achieved by just using single BufferRecyclers (with appropriate pooling), former by separate recyclers instance.

@pjfanning
Copy link
Member

@cowtowncoder I think the PR as is achieves more or less what you have highlighted. The default behaviour remains as the ThreadLocal based BufferRecycler. Users can opt in to use the per-JsonFactory pooled BufferRecycler instead.

@mariofusco
Copy link
Contributor Author

The main intent of this pull request is clearly define a lifecycle for JsonParser/JsonGenerator. This is necessary to make sure that a resource, in this case the BufferRecycler, is properly released into the pool from where it has been acquired, without relying on a more forgiving mechanism like the one provided by the ThreadLocal-based pool.

I also enforced the whole test suite to follow this lifecycle. In particular I ran all tests using the DebugPoolDecorator, suggested by @pjfanning, printing a line every time a BufferRecycler is acquired from the pool or released into it and making sure that the 2 methods are invoked exactly the same amount of times.

Regarding your points:

1. Starting point like `BufferRecyclers` is associated with a single `JsonFactory` (or in 3.0 `TokenStreamFactory`). This is overridable

At the moment the BufferRecyclers isn't associated with anything, but it just provides a few static methods and in practice behaves like a singleton. What you wrote is (incidentally) true only because the JsonFactory is the only class accessing and using it in its _getBufferRecycler() method. I followed exactly the same pattern introducing my BufferRecyclerPool that has the same intent and behavior of the BufferRecyclers, just without using any ThreadLocal. Eventually both the old BufferRecyclers and the new BufferRecyclerPool could easily become proper instances of the JsonFactory, but I believe that this change is out of scope for this pull request and could be implemented with a subsequent commit (I could also take care of this if required).

2. Each new `JsonParser`/`JsonGenerator` is allocated a `BufferRecycler` by a call to `BufferRecyclers` -- and then char/byte arrays are taken from recycler by parser/generator

This is already what happens, or more precisely this is what the JsonParser/JsonGenerator do via their IOContext which is the only class holding a reference to the BufferRecycler.

3. Each `BufferRecycler` needs to be returned to `BufferRecyclers` on `close()` of parser/generator; parser/generator should return actual array to `BufferRecycler` right before this

This is indeed the main point of this pull request: now IOContext also implements AutoCloseable, so when the JsonParser/JsonGenerator are closed they also close the underlying IOContext which in turn releases the BufferRecycler it owns to the pool.

4. It may be that access to arrays via `BufferRecycler` still needs to use atomic operations? I am not 100% sure about this, but this is probably not a performance hot spot since these should always be uncontested

This has not been changed in any way and it's also out of scope for this pull request.

Once this life-cycle works in 2.16; and with one backing implementation (existing ThreadLocal based one), I think there's need to get experience with release, and target alternate implementations for later releases. This could be 2.17, or, if there are blockers, then master/3.0.

The existing implementation based on ThreadLocal is still there and in use by default. My new ThreadLocal-free pool is an opt-in option that has to be explicitly enabled with a new JsonFactory feature. This should ensure the best and safest backward-compatibility while allowing to frameworks and tools leveraging virtual threads to do so correctly by simply enabling that feature.

{
_streamReadConstraints = src;
_streamWriteConstraints = swc;
_streamReadConstraints = (src == null) ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, probably related to changes in 2.16 in the meantime -- no null checks should be made, caller should pass default values, never null (probably just need to merge/rebase from 2.16)

@pjfanning
Copy link
Member

Would this work?

  • BufferRecycler is still updated to be AutoCloseable. The changes in this PR to close the BufferRecycler will be kept.
  • A new BufferRecyclerProvider interface is created. We can have 2 built-in implementations. One based on the existing ThreadLocal approach and one based on this PR's approach.
  • JsonFactory has a setter that let's you override the BufferRecyclerProvider.
  • No need for a new Feature to control the behaviour.

@cowtowncoder
Copy link
Member

Would this work?

* BufferRecycler is still updated to be AutoCloseable. The changes in this PR to close the BufferRecycler will be kept.

* A new BufferRecyclerProvider interface is created. We can have 2 built-in implementations. One based on the existing ThreadLocal approach and one based on this PR's approach.

* JsonFactory has a setter that let's you override the BufferRecyclerProvider.

* No need for a new Feature to control the behaviour.

I started thinking along similar lines, although working from BufferRecyclers -- however, your suggestion of new provider would be cleaner, basically deprecating BufferRecyclers (but using it for legacy support).

I think I can on PR for doing integration pieces separate from this PR (to be merged here when ready).

Aside from mechanics of gettting to BufferRecycler (which above would tackle) I think this PR looks reasonable.

@cowtowncoder
Copy link
Member

@mariofusco @pjfanning Ok, so, created PR #1083 for suggested integration. I can merge that in 2.16 and master, but wanted to get your feedback first.

@cowtowncoder
Copy link
Member

@mariofusco Ok, apologies for making tons of tweaking, but I think results are now something I could merge.

Changes mostly concern following things:

  1. JDK serializability of ThreadLocalPools: this is only needed in 2.x if ObjectMapper (or in theory JsonFactory instances) are JDK serialized -- I know some frameworks do that (whether sensibly or not), like Hadoop and Spark (when distributing jobs). In Jackson 3.x this is actually much more efficient thing to do. At any rate, ThreadLocalPool is part of serializable configuration of JsonFactory (and hence ObjectMapper) so it needs to be supported
  2. Factory methods for pools: I think distinction between "shared" (global) and "non-shared" (private per-factory instances) needs to be made clear

On (1) there is one thing I didn't yet figure out, and that is keeping identity of global LockFree / DeQue -based pools.
For now JDK ser/deser will not re-create Shared instance. It is doable but for now I didn't have time to make that work.
(for No-op and ThreadLocal one it'd be easy to use readResolve() to return shared instance but I didn't bother since they are stateless).

Remaining questions or (minor) open issues from my end are:

  1. Can we get one size-bound, stateful instance to complete the set
  2. Would it be possible to have unit tests that actually exercise pooling aspect (just verify, without even multi-threading, that BufferRecycler instances are recycled as expected.

Neither of these is really blocker; but if we can resolve them, great. Either way I hope to actually finally get this merged tomorrow if all goes well.
And then have fun merging it into master (3.0.0)!

import java.io.IOException;
import java.io.OutputStream;

public class BufferRecyclerPoolTest extends BaseTest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to actually test recycling aspects: no need to use multi-threading, just create sequence of operations, starting with a single JsonFactory and single parser/generator, to allocate buffers, check that underlying BufferRecycler is properly acquired/released.

If an accessor needs to be added in parser/generator implementation for testing, that's fine.

This is a nice to have tho and just for basic sanity checking. No need to create complicated schemes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 5394156

* to compensate, need to re-create proper instance using constructor.
*/
protected Object readResolve() {
return new ConcurrentDequePool();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we'd need to choose between returning SHARED vs creating new one, if we serialized some sort of marker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readResolve() method is typically used to implement the Singleton pattern, where the same object needs to be returned after deserialization. I'd rather return the SHARED instance here, otherwise you're silently duplicating the instances available after deserialization, or in other words you will have JsonFactory created before and after deserialization that should have the same SHARED instance but in reality have different pool instances.

More in general I don't understand the need of a "non-shared" pool instance (at least for these "default" pool implementations provided by us). Can you please clarify in which situation an user may want to have such "non-shared" instance? If there isn't any evident need, I'd rather keep the pools constructors private and only expose the shared instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I believe pretty much everything should be scoped by ObjectMapper / JsonFactory, and about nothing shared across. So if and when Mapper/Factory gets dropped, garbage-collected, Recycler Pool would similarly be disposed of. That's the fundamental sentiment. This is why Jackson has very few static stateful global singletons (many stateless of course).
This in turn is because Jackson itself is used as a private dependency by many other libraries and frameworks, where isolation is typically beneficial.
So using globally shared stateful recycler pools is against this general idea.

Two specific concerns I have:

  1. Does global sharing allow over-retention of RecyclerBuffers (and thereby underlying byte[]/char[] buffers)? Thinking it through, it probably doesn't (since recyclers created for then-drop Factories will be happily recycled by remaining ones) -- but here I would want bounded max size to avoid retaining peak number of recycles
  2. Global sharing increases lock-contention compared to per-Factory recyclers; so for busiest cases isolation would help.

I guess (1) really isn't much of a concern specifically wrt globally shared instances, esp. in absence of maximum size limits.

But (2) is something that could become an issue for some users.

So... not quite sure how to go about that. I still feel there is need to allow both use of convenient globally shared pools -- with convenient access.
But also allow instances to be constructed by users, for isolation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On readResolve(): it (or writeReplace()) can also be used to prevent serializing part of state, such as actual cached contents. Since JDK serialization does not call any constructors on deserialization (nor, I think, initialization statements? Or does it?), it is necessary to either;

  • Clear out cache contents Map before write (writeReplace), to read back empty Map (etc)
  • Mark Map field transient; use readResolve() to create properly initialized empty instance (with empty Map)

@mariofusco
Copy link
Contributor Author

Remaining questions or (minor) open issues from my end are:

1. Can we get one size-bound, stateful instance to complete the set

2. Would it be possible to have unit tests that actually exercise pooling aspect (just verify, without even multi-threading, that `BufferRecycler` instances are recycled as expected.

I did both things with my 2 latest commits.

My last outstanding concern is on the non-shared pools versions. At the moment I see them useful only to test the pools themselves, but nothing else, and they are also problematic with serialization. Also, even if we want to keep them, I also have some doubts about the naming: shared/non-shared doesn't entirely make sense to me (or at least I don't think it would make sense for somebody not knowing much of this internal implementation detail). I'd keep only the shared version, maybe simply calling it INSTANCE, leaving to users the possibility to directly invoke one of the pools' constructors if they really want/need to do so.

@cowtowncoder
Copy link
Member

Remaining questions or (minor) open issues from my end are:

1. Can we get one size-bound, stateful instance to complete the set

2. Would it be possible to have unit tests that actually exercise pooling aspect (just verify, without even multi-threading, that `BufferRecycler` instances are recycled as expected.

I did both things with my 2 latest commits.

My last outstanding concern is on the non-shared pools versions. At the moment I see them useful only to test the pools themselves, but nothing else, and they are also problematic with serialization. Also, even if we want to keep them, I also have some doubts about the naming: shared/non-shared doesn't entirely make sense to me (or at least I don't think it would make sense for somebody not knowing much of this internal implementation detail). I'd keep only the shared version, maybe simply calling it INSTANCE, leaving to users the possibility to directly invoke one of the pools' constructors if they really want/need to do so.

I feel strongly in this case that plain "instance" is not adequate -- I was thinking of global instead of shared, as an alternative.

I agree that convenient access to the globally shared instances, along with allowing construction of non-shared, makes sense.

For the last part, then, the only (?) question is that of JDK serialization properly linking back shared/global to that, but creating new instance for non-shared.
It is doable by f.ex writing an int or boolean for type (or even String). Just some work.

@cowtowncoder
Copy link
Member

cowtowncoder commented Aug 30, 2023

Added couple of notes, but I think my thinking now is that:

  1. I am ok with globally shared instances as the main default things (and am open to calling them whatever, global, shared), as long as there's a way to construct non-shared instances
  2. To support shared/non-shared instances, we do need bit more work for JDK serialization to retain identities; and for size-bound pool impl, need to retain size setting too. jackson-databind type LRUMap implements something similar. But in this case, could just use -1 (or 0) as marker for size field to denote shared (not the best explanation, maybe I should implement and show).

With that I think we would be done here.

EDIT: I am working on this (JDK serialization, tests) -- and then should be able to FINALLY merge this thing.
That doesn't mean we could not make further changes, just that bulk of work should be done (including merge to 3.0 which is not going to be fun to do).

@cowtowncoder
Copy link
Member

Ok: to expedite things I will go ahead and merge -- this does not mean that aspects, naming etc could not be changed based on discussions; I absolutely expect some minor tweaking.
But I figured that since I now have time I can get merging done.

@cowtowncoder
Copy link
Member

Ok: I am looking for comments to #1117 -- changing the default pool used by Jackson 2.17 and later.

My strawman argument is that we should use:

  1. per-factory
  2. lock-free
  3. That is: one returned by JsonRecyclerPool.newLockFreePool()

pool as the default. This is mostly to get discussion going; I don't have strong objection over alternatives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow pluggable buffer recycling via new RecyclerPool extension point
4 participants