Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Class loader leak since 367 #11234

Closed
guyco33 opened this issue Feb 28, 2022 · 6 comments · Fixed by #11358
Closed

Class loader leak since 367 #11234

guyco33 opened this issue Feb 28, 2022 · 6 comments · Fixed by #11358
Assignees
Labels
bug Something isn't working performance

Comments

@guyco33
Copy link
Member

guyco33 commented Feb 28, 2022

I did a stress test (70 queries executed in 2 minutes cycles for 3 hours) on both 366 and 367 and found that class loaders are retained in JVM 5 times more.
This class loader leak might be the reason for a performance degradation that appear after running for a while.

367

Mon Feb 21 15:15:15 UTC 2022

$ jcmd trino VM.metaspace | head -13 | tail -11
Total Usage - 150852 loaders, 175304 classes:
  319889 chunks,     1.32 GB capacity,     1.01 GB ( 76%) used,   304.59 MB ( 23%) free,    25.03 KB ( <1%) waste,    19.52 MB (  1%) overhead, deallocated: 167697 blocks with 43.46 MB

Virtual space:
      1.32 GB reserved,       1.32 GB (>99%) committed

Chunk freelist:

$ jcmd trino VM.metaspace | tail -8 | head -5
MaxMetaspaceSize: unlimited
No class space
Initial GC threshold: 20.80 MB
Current GC threshold: 2.12 GB
CDS: off

$ jcmd trino VM.metaspace show-loaders | grep io.airlift.bytecode.DynamicClassLoader | wc -l
122795

$ jcmd trino VM.metaspace show-loaders | grep jdk.internal.reflect.DelegatingClassLoader | wc -l
22637

$ jcmd trino VM.metaspace show-loaders | grep '<anonymous class>, loaded by' | wc -l
6562

$ jcmd trino VM.metaspace show-loaders | grep 'CLD 0x0' | egrep -v 'io.airlift.bytecode.DynamicClassLoader|jdk.internal.reflect.DelegatingClassLoader|<anonymous class>, loaded by' | wc -l
50

366

Mon Feb 21 15:15:30 UTC 2022

$ jcmd trino VM.metaspace | head -13 | tail -11
Total Usage - 27284 loaders, 51623 classes:
  66539 chunks,   368.38 MB capacity,   292.73 MB ( 79%) used,    71.57 MB ( 19%) free,    17.00 KB ( <1%) waste,     4.06 MB (  1%) overhead, deallocated: 37313 blocks with 12.59 MB


Virtual space:
    402.00 MB reserved,     400.74 MB (>99%) committed



Chunk freelist:


$ jcmd trino VM.metaspace | tail -8 | head -5
MaxMetaspaceSize: unlimited
No class space
Initial GC threshold: 20.80 MB
Current GC threshold: 667.91 MB
CDS: off

$ jcmd trino VM.metaspace show-loaders | grep io.airlift.bytecode.DynamicClassLoader | wc -l
15950

$ jcmd trino VM.metaspace show-loaders | grep jdk.internal.reflect.DelegatingClassLoader | wc -l
6134

$ jcmd trino VM.metaspace show-loaders | grep '<anonymous class>, loaded by' | wc -l
5412

$ jcmd trino VM.metaspace show-loaders | grep 'CLD 0x0' | egrep -v 'io.airlift.bytecode.DynamicClassLoader|jdk.internal.reflect.DelegatingClassLoader|<anonymous class>, loaded by' | wc -l
50

Here is a list of classes that retained in JVM in 367 but not in 366

+-----------------------------------------------------------------------------------------+--------+------+
|class_name                                                                               |versions|counts|
+-----------------------------------------------------------------------------------------+--------+------+
|[Lio.airlift.joni.StackEntry;                                                            |{367}   |{1}   |
|[Lio.airlift.log.RollingFileMessageOutput$CompressionType;                               |{367}   |{17}  |
|[Lio.trino.operator.aggregation.AggregationFunctionAdapter$AggregationParameterKind;     |{367}   |{17}  |
|[Lio.trino.spi.connector.RetryMode;                                                      |{367}   |{17}  |
|[Lsun.security.ec.point.ProjectivePoint$Immutable; (jdk.crypto.ec@11.0.14.1)             |{367}   |{1}   |
|[Lsun.security.ssl.Plaintext; (java.base@11.0.14.1)                                      |{367}   |{1}   |
|com.fasterxml.jackson.core.JsonLocation                                                  |{367}   |{17}  |
|com.fasterxml.jackson.core.io.ContentReference                                           |{367}   |{34}  |
|com.fasterxml.jackson.databind.PropertyNamingStrategies$UpperSnakeCaseStrategy           |{367}   |{17}  |
|com.fasterxml.jackson.databind.deser.std.StdKeyDeserializers$$Lambda$%                   |{367}   |{17}  |
|com.google.common.base.$Proxy%                                                           |{367}   |{68}  |
|com.google.common.collect.$Proxy%                                                        |{367}   |{34}  |
|com.google.common.collect.ImmutableSet$EmptySetBuilderImpl                               |{367}   |{136} |
|com.google.inject.internal.BoundProviderFactory                                          |{367}   |{17}  |
|com.google.inject.internal.LinkedProviderBindingImpl                                     |{367}   |{17}  |
|io.airlift.http.server.HttpRequestEvent                                                  |{367}   |{1}   |
|io.airlift.joni.ByteCodeMachine                                                          |{367}   |{1}   |
|io.airlift.joni.StackEntry                                                               |{367}   |{1}   |
|io.airlift.log.RollingFileMessageOutput$CompressionType                                  |{367}   |{17}  |
|io.trino.$gen.GroupedLongLongState%                                                      |{367}   |{1}   |
|io.trino.$gen.sumGroupedAccumulator%                                                     |{367}   |{3}   |
|io.trino.FeaturesConfig                                                                  |{367}   |{17}  |
|io.trino.array.BooleanBigArray                                                           |{367}   |{1}   |
|io.trino.metadata.BlockEncodingManager                                                   |{367}   |{17}  |
|io.trino.metadata.InternalBlockEncodingSerde$$Lambda$%                                   |{367}   |{34}  |
|io.trino.metadata.TypeRegistry$InternalTypeManager                                       |{367}   |{17}  |
|io.trino.operator.WindowInfo$IndexInfo                                                   |{367}   |{3}   |
|io.trino.operator.aggregation.AggregationFunctionAdapter$$Lambda$%                       |{367}   |{68}  |
|io.trino.operator.aggregation.AggregationFunctionAdapter$AggregationParameterKind        |{367}   |{17}  |
|io.trino.operator.aggregation.AggregationMetadata                                        |{367}   |{17}  |
|io.trino.operator.aggregation.AggregationMetadata$AccumulatorStateDescriptor             |{367}   |{17}  |
|io.trino.operator.aggregation.Aggregator                                                 |{367}   |{8}   |
|io.trino.operator.aggregation.AggregatorFactory                                          |{367}   |{8}   |
|io.trino.operator.aggregation.CompiledAccumulatorFactory                                 |{367}   |{8}   |
|io.trino.operator.aggregation.GroupedAggregator                                          |{367}   |{7}   |
|io.trino.operator.window.MappedWindowFunction                                            |{367}   |{8}   |
|io.trino.operator.window.MappedWindowIndex                                               |{367}   |{8}   |
|io.trino.operator.window.PagesWindowIndex                                                |{367}   |{3}   |
|io.trino.operator.window.ReflectionWindowFunctionSupplier$ConstructorType$%              |{367}   |{68}  |
|io.trino.orc.MemoryOrcDataSource                                                         |{367}   |{1}   |
|io.trino.plugin.hive.metastore.procedure.FlushHiveMetastoreCacheProcedure                |{367}   |{17}  |
|io.trino.plugin.jdbc.procedure.FlushJdbcMetadataCacheProcedure                           |{367}   |{34}  |
|io.trino.plugin.jdbc.procedure.FlushJdbcMetadataCacheProcedure$$FastClassByGuice$$%      |{367}   |{17}  |
|io.trino.spi.connector.RetryMode                                                         |{367}   |{17}  |
|io.trino.sql.PlannerContext                                                              |{367}   |{17}  |
|io.trino.sql.analyzer.StatementAnalyzerFactory                                           |{367}   |{17}  |
|io.trino.sql.planner.NodePartitioningManager$$Lambda$%                                   |{367}   |{17}  |
|java.security.KeyPairGenerator$Delegate (java.base@11.0.14.1)                            |{367}   |{1}   |
|java.util.Spliterators$ArraySpliterator (java.base@11.0.14.1)                            |{367}   |{2}   |
|java.util.regex.Pattern$Pos (java.base@11.0.14.1)                                        |{367}   |{17}  |
|java.util.stream.FindOps$FindSink$OfInt$$Lambda$%                                        |{367}   |{68}  |
|org.apache.hadoop.mapred.FileSplit                                                       |{367}   |{1}   |
|org.apache.hadoop.mapred.TextInputFormat                                                 |{367}   |{1}   |
|org.apache.hadoop.mapreduce.lib.input.FileSplit                                          |{367}   |{1}   |
|org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader                        |{367}   |{1}   |
|org.apache.http.conn.routing.RouteTracker                                                |{367}   |{1}   |
|org.glassfish.jersey.internal.util.collection.MultivaluedStringMap                       |{367}   |{1}   |
|sun.security.ec.ECKeyPairGenerator (jdk.crypto.ec@11.0.14.1)                             |{367}   |{1}   |
|sun.security.ec.ECOperations (jdk.crypto.ec@11.0.14.1)                                   |{367}   |{1}   |
|sun.security.ec.point.ProjectivePoint$Immutable (jdk.crypto.ec@11.0.14.1)                |{367}   |{1}   |
|sun.security.ec.point.ProjectivePoint$Mutable (jdk.crypto.ec@11.0.14.1)                  |{367}   |{1}   |
|sun.security.jca.ProviderList$ServiceList (java.base@11.0.14.1)                          |{367}   |{1}   |
|sun.security.jca.ProviderList$ServiceList$% (java.base@11.0.14.1)                        |{367}   |{1}   |
|sun.security.ssl.Authenticator$MacImpl (java.base@11.0.14.1)                             |{367}   |{1}   |
|sun.security.ssl.Authenticator$SSLNullMac (java.base@11.0.14.1)                          |{367}   |{1}   |
|sun.security.ssl.ClientHandshakeContext (java.base@11.0.14.1)                            |{367}   |{1}   |
|sun.security.ssl.ClientHello$ClientHelloMessage (java.base@11.0.14.1)                    |{367}   |{1}   |
|sun.security.ssl.ECDHKeyExchange$ECDHECredentials (java.base@11.0.14.1)                  |{367}   |{1}   |
|sun.security.ssl.ECDHKeyExchange$ECDHEPossession (java.base@11.0.14.1)                   |{367}   |{1}   |
|sun.security.ssl.HandshakeHash$NonCloneableHash (java.base@11.0.14.1)                    |{367}   |{1}   |
|sun.security.ssl.HandshakeHash$T12HandshakeHash (java.base@11.0.14.1)                    |{367}   |{1}   |
|sun.security.ssl.HandshakeOutStream (java.base@11.0.14.1)                                |{367}   |{1}   |
|sun.security.ssl.KeyShareExtension$CHKeyShareSpec (java.base@11.0.14.1)                  |{367}   |{1}   |
|sun.security.ssl.KeyShareExtension$KeyShareEntry (java.base@11.0.14.1)                   |{367}   |{1}   |
|sun.security.ssl.SSLCipher$NullReadCipherGenerator$NullReadCipher (java.base@11.0.14.1)  |{367}   |{1}   |
|sun.security.ssl.SSLCipher$NullWriteCipherGenerator$NullWriteCipher (java.base@11.0.14.1)|{367}   |{1}   |
|sun.security.ssl.SSLExtensions (java.base@11.0.14.1)                                     |{367}   |{1}   |
|sun.security.ssl.ServerNameExtension$CHServerNamesSpec (java.base@11.0.14.1)             |{367}   |{1}   |
|sun.security.ssl.SignatureAlgorithmsExtension$SignatureSchemesSpec (java.base@11.0.14.1) |{367}   |{1}   |
|sun.security.ssl.SupportedGroupsExtension$SupportedGroupsSpec (java.base@11.0.14.1)      |{367}   |{1}   |
|sun.security.ssl.SupportedVersionsExtension$CHSupportedVersionsSpec (java.base@11.0.14.1)|{367}   |{1}   |
|sun.security.ssl.X509Authentication$X509Credentials (java.base@11.0.14.1)                |{367}   |{1}   |
|sun.security.util.math.intpoly.IntegerPolynomial$ImmutableElement (java.base@11.0.14.1)  |{367}   |{1}   |
|sun.security.util.math.intpoly.IntegerPolynomial$Limb (java.base@11.0.14.1)              |{367}   |{1}   |
|sun.security.util.math.intpoly.IntegerPolynomial$MutableElement (java.base@11.0.14.1)    |{367}   |{1}   |
+-----------------------------------------------------------------------------------------+--------+------+

Issue might relates to #9837 or #10386 (airlift upgrade)

cc @dain @sopel39

@guyco33
Copy link
Member Author

guyco33 commented Mar 1, 2022

It might relates to #10408 and to aa5ce5e (

DynamicClassLoader classLoader = new DynamicClassLoader(AccumulatorCompiler.class.getClassLoader());
)

@guyco33
Copy link
Member Author

guyco33 commented Mar 1, 2022

The situation get worse as the JVM gets bigger.
1M class loaders for 86G JVM (124G OS memory) running the same exact queries with the same load on 371

Tue Mar  1 17:30:42 UTC 2022

$ jcmd trino VM.metaspace | head -13 | tail -11
Total Usage - 1181129 loaders, 1205895 classes:
  2451842 chunks,     9.36 GB capacity,     6.97 GB ( 74%) used,     2.25 GB ( 24%) free,   122.28 KB ( <1%) waste,   149.65 MB (  2%) overhead, deallocated: 1285213 blocks with 329.22 MB


Virtual space:
      9.36 GB reserved,       9.36 GB (>99%) committed

Chunk freelist:

$ jcmd trino VM.metaspace | tail -8 | head -5
MaxMetaspaceSize: unlimited
No class space
Initial GC threshold: 20.80 MB
Current GC threshold: 10.01 GB

@guyco33
Copy link
Member Author

guyco33 commented Mar 7, 2022

Update: Issue still exists in 372 even more significantly.
JVM is quickly filled and queries become significantly slower after running for a while.
Running a worker with c5.4xlarge (30G RAM and 19G JVM) crashes by OS with OOM.
Running a worker with r5.4xlarge (124G RAM and 86G JVM) eliminates the OOM crashes but queries become more and more slower.

@guyco33
Copy link
Member Author

guyco33 commented Mar 7, 2022

@dain
Finally, I found that those dynamic class patterns io.trino.$gen.countAccumulator_20220201_094443_924858 retain in JVM and I think it comes around this change

makeClassName(boundSignature.getName() + WindowAccumulator.class.getSimpleName()),

@dain
Copy link
Member

dain commented Mar 7, 2022

I think we are missing a cache around AccumulatorCompiler.generateAccumulatorFactory. I think this used to be cached inside of the FunctionManager before the aggergation refactoring, where I split the specialized aggregation functions from their "combination" into an accumulator factory. Without this we will generate an accumulator factor for ever aggregation usage in every query.

@guyco33
Copy link
Member Author

guyco33 commented Mar 7, 2022

Collecting jmap stat files to s3 and parsing them with TEXTFILE table wrapped by a view, I used the following query to demonstrate the issue:

SELECT trino_version, count(distinct class_name) gens,
       sum(if(class_name like '%Accumulator%',1,0)) gen_accumulators,
       min(ts) first_ts, max(ts) last_ts
FROM hive.temp.jmap_stats_v 
WHERE regexp_like(class_name,'^io[.]trino[.][$]gen[.].+[0-9]{8,8}[_][0-9]{6,6}[_]') 
GROUP BY 1 ORDER BY 1
+-------------+-----+----------------+---------------------------+---------------------------+
|trino_version|gens |gen_accumulators|first_ts                   |last_ts                    |
+-------------+-----+----------------+---------------------------+---------------------------+
|363          |50   |0               |2022-02-17 10:57:44.000 UTC|2022-02-20 06:26:52.000 UTC|
|364          |7    |0               |2022-02-17 10:57:48.000 UTC|2022-02-17 14:36:10.000 UTC|
|366          |754  |186             |2022-02-21 13:15:52.000 UTC|2022-02-24 09:40:55.000 UTC|
|367          |15600|11707           |2022-02-01 08:50:43.000 UTC|2022-02-24 09:40:48.000 UTC|
|369          |74141|56087           |2022-02-02 07:30:01.000 UTC|2022-02-07 08:45:01.000 UTC|
+-------------+-----+----------------+---------------------------+---------------------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance
Development

Successfully merging a pull request may close this issue.

2 participants