Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Compiler cache to soft value reference to reduce metaspace oom #15237

Closed
wants to merge 1 commit into from

Conversation

rice668
Copy link
Contributor

@rice668 rice668 commented Nov 29, 2022

Description

This PR can reduces metaspace oom issues, make some compiler that use guava cache value from strong reference to soft reference without obvious performance regression under 18,110 production queries.

Fixes #15232

image

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Nov 29, 2022
Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make Compiler cache to soft value reference to reduce metaspace oom

Trino caches do not use soft references, to avoid "invalidation storm" when system would otherwise OOM.
Instead, the system needs to be configured so that it doesn't OOM at all.

Also, if we really want to change this direction -- first make sure we want, which I personally doubt -- and start using soft references for all the caches, the change should focus on ensuring that all new caches are soft. I.e. please think about how to automatically ensure that. Modernizer may be helpful here, or some other code analysis.

@findepi
Copy link
Member

findepi commented Nov 30, 2022

cc @dain @lukasz-stec

@lukasz-stec
Copy link
Member

Trino caches do not use soft references, to avoid "invalidation storm" when system would otherwise OOM.

Also, there are no guarantees for when soft references are freed other than before OOM. This can cause performance to be very variable.

That said, I see how cache size based on the number of elements can cause issues since the elements have variable lengths.
Generated classes bytecode, which I think is kept in the meta space, can vary between a few hundred bytes to hundreds of kilobytes or more.
One way to mitigate it is to limit the cache by bytes used (e.g. by using com.google.common.cache.Weigher).

@findepi
Copy link
Member

findepi commented Dec 1, 2022

That said, I see how cache size based on the number of elements can cause issues since the elements have variable lengths.

good point

This is also visible in #15232 description where singled out entry is 1/3 in size of the average

One way to mitigate it is to limit the cache by bytes used (e.g. by using com.google.common.cache.Weigher).

let's do that -- unless we know that max entry size * max entry count is still acceptable cache occupancy.

cc @dain for function & expression execution
cc @martint for #14237

@rice668
Copy link
Contributor Author

rice668 commented Dec 2, 2022

Thanks to @findepi @lukasz-stec for discussion and related ideas. Limit the cache by bytes used com.google.common.cache.Weigher seems to be a more acceptable way to solve this problem. I'll revisit this issue in the near future and give some feedback or an update on the PR.

@rice668
Copy link
Contributor Author

rice668 commented Dec 3, 2022

If we use weigher which requires calculate the class size of each entry value in int weight(K key, V value) . It seems that we can not find a suitable way to get it in currently. After discuss with @lukasz-stec , one possible solution is to modify ClassGenerator in airlift that let it return the cached generated classes and its size, etc. In addition to this solution, do we have a better way to get the size of the class ?

@electrum
Copy link
Member

@zhangminglei apologies for the delay in responding.

I think the easiest way is to modify DynamicClassLoader in https://github.com/airlift/bytecode to have a method getDefinedClassesBytecodeSize(). It could track the size in defineClass(). We then add a method in CompilerUtils:

public static long getBytecodeSize(Class<?> clazz)
{
    if (clazz.getClassLoader() instanceof DynamicClassLoader loader) {
        return loader.getDefinedClassesBytecodeSize();
    }
    throw new IllegalArgumentException("Invalid class loader [%s] for %s".formatted(clazz.getClassLoader(), clazz));
}

@findepi
Copy link
Member

findepi commented Apr 13, 2023

    if (clazz.getClassLoader() instanceof DynamicClassLoader loader) {
        return loader.getDefinedClassesBytecodeSize();

do we assume DynamicClassLoader loads one class only?
or all classes loaded by one loader should report same weight?

@mosabua
Copy link
Member

mosabua commented Jan 11, 2024

👋 @zhangminglei - this PR has become inactive. We hope you are still interested in working on it. Please let us know, and we can try to get reviewers to help with that.

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

Copy link

github-actions bot commented Sep 4, 2024

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Sep 4, 2024
@mosabua
Copy link
Member

mosabua commented Sep 5, 2024

Is this still relevant @electrum @findepi @rice668 ?

@electrum
Copy link
Member

electrum commented Sep 5, 2024

We can close this, as the cache weigher approach discussed above is a better direction.

@electrum electrum closed this Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Worker process cause metaspace OOM after the cluster running for a long time
6 participants