Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache accumulator factory #11358

Merged
merged 3 commits into from
Mar 14, 2022
Merged

Cache accumulator factory #11358

merged 3 commits into from
Mar 14, 2022

Conversation

dain
Copy link
Member

@dain dain commented Mar 8, 2022

Description

Cache generated aggregation accumulator classes. When accumulator factory was split from function management we lost the caching of part of the generated code.

Related issues, pull requests, and links

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# General
* Fix regression in aggregation performance caused by not reusing generated accumulator factories. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Mar 8, 2022
findepi
findepi previously requested changes Mar 8, 2022
Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Add getUnchecked to NonEvictableCache"

I do think we shouldn't add this method.

Comment on lines 33 to 34
// this can not happen because a supplier can not throw a checked exception
throw new RuntimeException("Unexpected checked exception from cache load", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can -> should

(actually "it can happen", if you do some things you shouldn't do)

/**
* A {@link com.google.common.cache.Cache} that does not support eviction.
*/
public interface NonEvictableCache<K, V>
extends Cache<K, V>
{
default V getUnchecked(K key, Supplier<V> loader)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Guava, getUnchecked is available on a LoadingCache only (and only without a loader).

I would prefer to stay in line with Guava APIs, so that we don't need to undo these simplifications, eg when switching back to Guava (should they fix the issue), or refactoring the code.

Thus, while I introduced all the "EvictableCache" related class, I wouldn't yet want to embrace them as "an API".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If, however, we wanted to depart from Guava APIs, i would prefer we have a different signature

V <T extends Exception> get(K key, ThrowingSupplier<T> loader) throws T;

this would erase the (logically useless) distinction between get and getUnchecked, i.e. the caller wouldn't need to make the choice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi and I talked about this off line. The problem with an API like this is exceptions can cross threads when there are concurrent calls to a single element. The cache synchronizes and shares the result for all callers, which means the threads will shared the stacks from the one thread that actuallu executed the loader.

@findepi
Copy link
Member

findepi commented Mar 8, 2022

Was the leak only due to the fact we weren't caching things?
i.e. would the generated classes be eventually garbage-collected anyway?

@guyco33
Copy link
Member

guyco33 commented Mar 8, 2022

i.e. would the generated classes be eventually garbage-collected anyway?

There are still other generated classes that are not garbage-collected like:
io.trino.$gen.CursorProcessor
io.trino.$gen.PageFilter
io.trino.$gen.PageProjectionWork

trino-test-372-3-g59c5ca1 (worker) root@$ 
jcmd trino VM.class_hierarchy  | egrep 'io[.]trino[.][$]gen[.].+[0-9]{8,8}[_][0-9]{6,6}[_][0-9]+[/]' | awk -F_ '{arr[$1]++}END{for (a in arr) print a, arr[a]}' | sort

|  |--io.trino.$gen.GroupedGenericLongState 3
|  |--io.trino.$gen.GroupedLongLongState 3
|  |--io.trino.$gen.GroupedLongState 12
|--io.trino.$gen.BlockPositionComparison 1
|--io.trino.$gen.BlockPositionEqual 1
|--io.trino.$gen.BlockPositionHashCode 1
|--io.trino.$gen.CursorProcessor 2540
|--io.trino.$gen.GenericLongStateFactory 3
|--io.trino.$gen.LongLongStateFactory 3
|--io.trino.$gen.LongLongStateSerializer 3
|--io.trino.$gen.LongStateFactory 12
|--io.trino.$gen.LongStateSerializer 12
|--io.trino.$gen.PageFilter 1975
|--io.trino.$gen.PageProjectionWork 4376
|--io.trino.$gen.SingleGenericLongState 3
|--io.trino.$gen.SingleLongLongState 3
|--io.trino.$gen.SingleLongState 12
|--io.trino.$gen.countAccumulator 12
|--io.trino.$gen.countGroupedAccumulator 12
|--io.trino.$gen.maxAccumulator 3
|--io.trino.$gen.maxGroupedAccumulator 3
|--io.trino.$gen.sumAccumulator 3
|--io.trino.$gen.sumGroupedAccumulator 3

@sopel39
Copy link
Member

sopel39 commented Mar 8, 2022

There are still other generated classes that are not garbage-collected like:

io.trino.$gen.CursorProcessor
io.trino.$gen.PageFilter
io.trino.$gen.PageProjectionWork

these are actually cached. Do you say cache is disabled and they are not GCed anyway?

@guyco33
Copy link
Member

guyco33 commented Mar 8, 2022

Do you say cache is disabled and they are not GCed anyway

JVM is up for more than 4 hours and I can still see generated classes from 06:58:11

trino-test-372-3-g59c5ca1 (worker) root@$ 
jcmd trino VM.uptime | tail -1 && jcmd trino VM.class_hierarchy | egrep 'io[.]trino[.][$]gen[.]CursorProcessor' | sort| more
15018.827 s
|--io.trino.$gen.CursorProcessor_20220308_065811_20/0x00005598c03a6140
|--io.trino.$gen.CursorProcessor_20220308_065812_25/0x00007ffa60106940
|--io.trino.$gen.CursorProcessor_20220308_065828_74/0x00007ffa3c10c620

@guyco33
Copy link
Member

guyco33 commented Mar 8, 2022

Was the leak only due to the fact we weren't caching things?

I can approve that io.trino.$gen.*Accumulator_* is not leaking as before

@sopel39
Copy link
Member

sopel39 commented Mar 8, 2022

JVM is up for more than 4 hours and I can still see generated classes from 06:58:11

PageFunctionCompiler cache is not time based, only size based

@guyco33
Copy link
Member

guyco33 commented Mar 8, 2022

PageFunctionCompiler cache is not time based, only size based

Got it. And the default is 10,000 entries

@guyco33
Copy link
Member

guyco33 commented Mar 8, 2022

It seems that we need also time based cache here

trino-test-372-3-g59c5ca1 (worker) root@$ 
jcmd trino VM.class_hierarchy  | egrep 'io[.]trino[.][$]gen[.].+[0-9]{8,8}[_][0-9]{6,6}[_][0-9]+[/]' | awk -F_ '{arr[$1]++}END{for (a in arr) if (arr[a] > 100) {print a, arr[a]}}' | sort

|--io.trino.$gen.CursorProcessor 5029
|--io.trino.$gen.PageFilter 4968
|--io.trino.$gen.PageProjectionWork 29948

@guyco33
Copy link
Member

guyco33 commented Mar 10, 2022

After running a load test for almost 2 days, I approve that this fix did the job.
It solve the memory leak seen in 367+ and fixes the SQL performance degradation.

@dain
Copy link
Member Author

dain commented Mar 10, 2022

JVM is up for more than 4 hours and I can still see generated classes from 06:58:11

PageFunctionCompiler cache is not time based, only size based

@sopel39 we should consider limiting it to an hour like the others

@sopel39
Copy link
Member

sopel39 commented Mar 11, 2022

@sopel39 we should consider limiting it to an hour like the others

Makes sense. I don't think it would hurt. @guyco33 would you like to do so?

@findepi findepi dismissed their stale review March 11, 2022 11:34

for "Add getUnchecked to NonEvictableCache" -- discussed the plan

@findepi
Copy link
Member

findepi commented Mar 11, 2022

Was the leak only due to the fact we weren't caching things? i.e. would the generated classes be eventually garbage-collected anyway?

@dain can you please explain this?

@dain
Copy link
Member Author

dain commented Mar 11, 2022

Was the leak only due to the fact we weren't caching things? i.e. would the generated classes be eventually garbage-collected anyway?

I believe so. Based on @guyco33's testing I'm pretty confident.

@martint
Copy link
Member

martint commented Mar 11, 2022

can you please explain this?

Yes, the "leak" is not truly a leak -- the VM will eventually collect those classes if they are no longer referenced. The performance issue is due to not reusing the generated classes, so the VM is in an eternal warming up state for aggregations.

@dain dain force-pushed the cache-accumulator-factory branch 3 times, most recently from 11d268c to 2c04ca5 Compare March 12, 2022 03:28
@guyco33
Copy link
Member

guyco33 commented Mar 13, 2022

the VM will eventually collect those classes if they are no longer referenced

but it seems that they retained since the reference never gone

dain added 3 commits March 13, 2022 13:06
The uncheckedCacheGet method uses a Supplier instead of a Callable and
therefore can not throw a checked exception, so a checked is not thrown
from uncheckedCacheGet.  This simplifies most callers which do not use
checked exceptions.
Remove LambdaProvider instances from AccumulatorFactory to simplify
caching of the factories.
@dain dain force-pushed the cache-accumulator-factory branch from 2c04ca5 to 081f153 Compare March 13, 2022 20:19
@dain dain merged commit 2bfd417 into trinodb:master Mar 14, 2022
@dain dain deleted the cache-accumulator-factory branch March 14, 2022 18:30
@github-actions github-actions bot added this to the 374 milestone Mar 14, 2022
@rice668
Copy link
Contributor

rice668 commented Dec 6, 2022

the VM will eventually collect those classes if they are no longer referenced

but it seems that they retained since the reference never gone

Maybe you are interesting in this #15237 why reference never gone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Class loader leak since 367
6 participants