Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increased build-time RSS usage in micrometer-prometheus integration test after moving Kubernetes to @ConfigMapping #44216

Closed
zakkak opened this issue Oct 31, 2024 · 27 comments · Fixed by #44448

Comments

@zakkak
Copy link
Contributor

zakkak commented Oct 31, 2024

Description

Since #43862 the build-time RSS usage of native-image builds have increased in the micrometer-prometheus integration test (note that I haven't observed similar changes in other tests).

Image

The change didn't have any negative effects to the native executable size and the reachable types/methods, on the contrary it decreased the total image size by ~400KB.

The purpose of this issue is to bring awareness of this increase and keep track of it.

Note

quarkusio/quarkus-images#282 was also merged on the 24th of October but I believe the issue is related to a Quarkus change since it popped up in the upstream oracle/graal nightly builds that are not using the quarkus builder images.


Note from @gsmet: here is another report posted by @mocenas here: that comes with a reproducer: #44436

Compiling quarkus app into native executable using docker image, requires a lot more memory since quarkus is moving to @ConfigMapping.

We have an example where our testing app used to work with xmx=5g for native compilation a month ago. But since #43862 and #44140 requires at least 9g to successfully compile (otherwise compilation fails on OutOfMemory).

Increasing memory requirement to almost double, just by moving to @ConfigMapping is alarming.

At this moment it is somehow manageable, but with more planned in #44079 it is hard to say, how much memory will it require at the end.

Copy link

quarkus-bot bot commented Oct 31, 2024

/cc @brunobat (micrometer), @ebullient (micrometer), @geoand (kubernetes), @iocanel (kubernetes), @radcortez (config)

@michalvavrik
Copy link
Member

#44220 was closed as duplicate, but our application cannot be build into native executable at all, while what this issue says is: The purpose of this issue is to bring awareness of this increase and keep track of it.

Are you sure this is acceptable solution? If on my workstation that is very strong I cannot build application anymore, is it appropriate action?

@gsmet
Copy link
Member

gsmet commented Oct 31, 2024

From what I can see you have a Xmx build limit:

  • 4.44GB of memory (14.3% of 31.10GB system memory, set via '-Xmx5g')

Try to raise that and see how it goes. Not saying it's not an issue that needs fixing or at least diagnosing.

@zakkak do you have any idea how we could understand what's going on?

@michalvavrik
Copy link
Member

michalvavrik commented Oct 31, 2024

From what I can see you have a Xmx build limit:

  • 4.44GB of memory (14.3% of 31.10GB system memory, set via '-Xmx5g')

Try to raise that and see how it goes.

Setting it to 7g did the trick. I'll try experiment with it if I can set even less. Thank you

@michalvavrik
Copy link
Member

Setting quarkus.native.native-image-xmx to 6g failes over OOM.

@zakkak
Copy link
Contributor Author

zakkak commented Oct 31, 2024

@zakkak do you have any idea how we could understand what's going on?

One would need to attach a Java profiler to native-image (it's a Java application) and see what's causing the additional allocations. Note that this might not be trivial through as the increased number of allocations could be the result of some new call-paths being explored by the analysis.

@radcortez
Copy link
Member

I wonder if we were close to the limit, and the change tipped it off. Mapping do add an additional class per root and group.

@michalvavrik
Copy link
Member

michalvavrik commented Nov 4, 2024

I could misunderstood this, but in case of the QE TS many-extensions module, it used to work with 5g and now 6g is not enough, so close is fuzzy term, you still may be right. BTW we had several other modules affected, I just had to raise 5g limit for quite simple Funqy serverless test in native from 4g.

@radcortez
Copy link
Member

Are all the modules using the kubernetes extension?

@michalvavrik
Copy link
Member

Yes, all of them use kubernetes or openshift extensions.

@rsvoboda
Copy link
Member

Experiment on my MBP M3 with xmx set to 8g. Quarkus 3.16, 3.15, 3.14 can handle "many extensions" module from https://github.com/quarkus-qe/beefy-scenarios with peak RSS ~ 5GB. Quarkus main (7a6b0eea9b8) utilizes full available memory and spends massive time on GC activity

sdk use java 23.1.5.r21-mandrel
mvn clean package -DskipTests -Dnative -Dquarkus.native.native-image-xmx=8g -f '003-quarkus-many-extensions' -Dquarkus.platform.version=999-SNAPSHOT -Dquarkus-plugin.version=999-SNAPSHOT | grep -e "Peak RSS" -e "Finished generating"
mvn clean package -DskipTests -Dnative -Dquarkus.native.native-image-xmx=8g -f '003-quarkus-many-extensions' -Dquarkus.platform.version=3.16.2 -Dquarkus-plugin.version=3.16.2 | grep -e "Peak RSS" -e "Finished generating"
mvn clean package -DskipTests -Dnative -Dquarkus.native.native-image-xmx=8g -f '003-quarkus-many-extensions' -Dquarkus.platform.version=3.15.1 -Dquarkus-plugin.version=3.15.1 | grep -e "Peak RSS" -e "Finished generating"
mvn clean package -DskipTests -Dnative -Dquarkus.native.native-image-xmx=8g -f '003-quarkus-many-extensions' -Dquarkus.platform.version=3.14.4 -Dquarkus-plugin.version=3.14.4 | grep -e "Peak RSS" -e "Finished generating"


Using java version 23.1.5.r21-mandrel in this shell.
                       48.2s (45.7% of total time) in 279 GCs | Peak RSS: 7.47GB | CPU load: 8.25
Finished generating '003-quarkus-many-extensions-1.0.0-SNAPSHOT-runner' in 1m 45s.
                        5.2s (8.9% of total time) in 499 GCs | Peak RSS: 5.30GB | CPU load: 7.84
Finished generating '003-quarkus-many-extensions-1.0.0-SNAPSHOT-runner' in 58.7s.
                       6.1s (10.1% of total time) in 936 GCs | Peak RSS: 4.93GB | CPU load: 7.81
Finished generating '003-quarkus-many-extensions-1.0.0-SNAPSHOT-runner' in 1m 0s.
                        5.7s (9.8% of total time) in 742 GCs | Peak RSS: 4.88GB | CPU load: 7.89
Finished generating '003-quarkus-many-extensions-1.0.0-SNAPSHOT-runner' in 58.2s.

@maxandersen @cescoffier, could you please assign somebody to investigate more in detail what causes such memory demand when building a native image?

Set of defined extensions: https://github.com/quarkus-qe/beefy-scenarios/blob/main/003-quarkus-many-extensions/pom.xml

Side note: I also experimented with different versions of Mandrel (23.1.4.r21-mandrel, 23.1.5.r21-mandrel, 24.0.2.r22-mandrel, 24.1.1.r23-mandrel). I noticed a difference between 23.1.4.r21-mandrel and 23.1.5.r21-mandrel when the .4 version needed more memory and more GC cycles.

@rsvoboda
Copy link
Member

The thing to investigate deeper is kubernetes extension as hinted in comments above.

Peak RSS goes to ~5 GB level with 999-SNAPSHOT when kubernetes extension is removed from 003-quarkus-many-extensions/pom.xml

sdk use java 23.1.5.r21-mandrel
mvn clean package -DskipTests -Dnative -Dquarkus.native.native-image-xmx=8g -f '003-quarkus-many-extensions' -Dquarkus.platform.version=999-SNAPSHOT -Dquarkus-plugin.version=999-SNAPSHOT | grep -e "Peak RSS" -e "Finished generating"
mvn clean package -DskipTests -Dnative -Dquarkus.native.native-image-xmx=8g -f '003-quarkus-many-extensions' -Dquarkus.platform.version=3.16.2 -Dquarkus-plugin.version=3.16.2 | grep -e "Peak RSS" -e "Finished generating"

Using java version 23.1.5.r21-mandrel in this shell.
                       7.8s (10.9% of total time) in 1372 GCs | Peak RSS: 4.95GB | CPU load: 7.88
Finished generating '003-quarkus-many-extensions-1.0.0-SNAPSHOT-runner' in 1m 11s.
                       7.0s (10.3% of total time) in 1264 GCs | Peak RSS: 4.48GB | CPU load: 7.80
Finished generating '003-quarkus-many-extensions-1.0.0-SNAPSHOT-runner' in 1m 7s.

@gsmet
Copy link
Member

gsmet commented Nov 12, 2024

I think we will need some help from the Mandrel team here @zakkak @galderz @Karm as something in the new code seems to trigger an unwanted behavior of Mandrel/GraalVM but I think we will need some Mandrel/GraalVM knowledge to try to figure out what's going on.

@radcortez
Copy link
Member

Something to consider is that we never noticed such behavior with other extensions using mappings, and we have already used them for some time. So, it may be something only related to the Kubernetes configuration.

I'm going to try to investigate, but I agree with @gsmet that it would be helpful if someone from Mandrel also helped.

@zakkak
Copy link
Contributor Author

zakkak commented Nov 12, 2024

@radcortez I am happy to work with you on this. I have it in my TODO list anyway but need to find some free time for it. Let's sync on Zulip. Note: anyone else willing to jump in is very much welcome on my side :)

@radcortez
Copy link
Member

Sounds good :)

@franz1981
Copy link
Contributor

franz1981 commented Nov 12, 2024

@rsvoboda
https://github.com/quarkus-qe/beefy-scenarios
what it is and used for? I wasn't aware of this

@rsvoboda
Copy link
Member

https://github.com/quarkus-qe/beefy-scenarios is test coverage that we built during RHBQ 1.x times (together with https://github.com/quarkus-qe/quarkus-openshift-test-suite), with RHBQ 2+ we moved to https://github.com/quarkus-qe/quarkus-test-suite (with https://github.com/quarkus-qe/quarkus-test-framework/) which helped us to simplify test development for both baremetal and OpenShift target environments.

https://github.com/quarkus-qe/beefy-scenarios is kept around for continuous community testing with Quarkus main, it's also part of ecosystem CI / https://status.quarkus.io.

@radcortez
Copy link
Member

radcortez commented Nov 14, 2024

Ok, I think I've got something:

First, one thing to consider is that changing the config classes is not an issue directly. In fact, in this particular case, the config classes are available only at build time and are not even available during the native image build. So, what is the issue?

In our generated startup Config, we match the configuration to determine if it is build time or runtime to warn about possible issues. As you know, build-time configuration may be present in application.properties, but we have to ignore it. If we don't have the config classes because they are in the deployment module, we need to do it another way.

The old config implementation would generate a method similar to this, which either ignores keys or populates the config object:

    private static void siParseKey_quarkus_test_container/* $FF was: siParseKey:quarkus:test:container*/(SmallRyeConfig var0, NameIterator var1) {
        if (!var1.hasNext()) {
            String var2 = var1.getName();
            ((HashSet)unknown).add(var2);
        } else if (!var1.nextSegmentEquals("additional-exposed-ports")) {
            if (!var1.nextSegmentEquals("labels")) {
                if (!var1.nextSegmentEquals("network")) {
                    if (!var1.nextSegmentEquals("volume-mounts")) {
                        String var3 = var1.getName();
                        ((HashSet)unknown).add(var3);
                    } else {
                        var1.next();
                        siParseKey:quarkus:test:container:volume-mounts(var0, var1);
                    }
                } else {
                    var1.next();
                    emptyParseKey(var0, var1);
                }
            } else {
                var1.next();
                siParseKey:quarkus:test:container:labels(var0, var1);
            }
        } else {
            var1.next();
            siParseKey:quarkus:test:container:additional-exposed-ports(var0, var1);
        }
    }

The new implementation adds the names to a KeyMap to do the matching instead because it is SmallRyeConfig that populates the object; we only need to know if a build time property is matched or not and ignore it:

        KeyMap var0 = new KeyMap();
        var0.findOrAdd("quarkus.knative.rbac.roles.*.policy-rules.*.resource-names").putRootValue(true);
        var0.findOrAdd("quarkus.knative.init-containers.*.env-vars.*.value").putRootValue(true);
        var0.findOrAdd("quarkus.rest-client.proxy-address").putRootValue(true);
        var0.findOrAdd("quarkus.kubernetes.init-containers.*.mounts.*.name").putRootValue(true);
        var0.findOrAdd("quarkus.kubernetes.app-secret").putRootValue(true);
        var0.findOrAdd("quarkus.kubernetes.prometheus.port").putRootValue(true);
        ...

So, when config is moved from the old implementation to the new one, the population method is deleted, and new entries are added to the KeyMap.

I was already aware that KeyMap had some runtime performance issues. Still, those were not very noticeable because the number of entries was low. With the continuous move to the new implementation, the entries started to grow. Apparently, this also greatly impacts the memory of the image build (which I was not aware of, but due to the large number of Kubernetes config entries, it became very noticeable).

I already had a more lightweight replacement implemented in https://github.com/smallrye/smallrye-config/blob/3edebd4b38f85aaaf3c1872845a2add89b6374ad/implementation/src/main/java/io/smallrye/config/PropertyName.java, and replaced the KeyMap match in Quarkus with #44448.

I was able to observe the memory increase by building the native image with the #44216, and I do see the memory decrease with #44448:

Current:

17.1s (13.0% of total time) in 178 GCs | Peak RSS: 8.05GB | CPU load: 10.12

With #44448:

9.3s (8.6% of total time) in 156 GCs | Peak RSS: 6.62GB | CPU load: 10.20

@zakkak @michalvavrik, can you try that and see if you notice any improvement? Thanks!

@geoand
Copy link
Contributor

geoand commented Nov 14, 2024

Very nice @radcortez!

@zakkak
Copy link
Contributor Author

zakkak commented Nov 14, 2024

Great work @radcortez!

I confirm that the memory usage has decreased significantly

Before

Image

After (with #44448)

Image

@zakkak zakkak linked a pull request Nov 14, 2024 that will close this issue
@radcortez
Copy link
Member

Excellent.

I was now trying to compare to before (#44216) to see if we are the same or if there is still some degradation. I do get some minor differences, but maybe that is only noise.

BTW, for additional reference, the implementation of #44448, is something like:

   private static Set generateMappedProperties() {
      HashSet var0 = new HashSet();
      PropertyName var1 = new PropertyName("quarkus.config-tracking.file");
      var0.add(var1);
      PropertyName var2 = new PropertyName("quarkus.tls.*.trust-store.pem.certs");
      var0.add(var2);
      PropertyName var3 = new PropertyName("quarkus.log.socket.filter");
      var0.add(var3);
      PropertyName var4 = new PropertyName("quarkus.vertx.resolver.rotate-servers");
      var0.add(var4);
      PropertyName var5 = new PropertyName("quarkus.vertx.cluster.public-port");
      var0.add(var5);
      PropertyName var6 = new PropertyName("quarkus.log.file.async");
      var0.add(var6);
      PropertyName var7 = new PropertyName("quarkus.log.handler.syslog.*.syslog-type");
      ...

The check just calls a contains on the Set to match the name.

@gsmet
Copy link
Member

gsmet commented Nov 14, 2024

And the Set is stored in the image heap for native image or you generate it even for native image?

@radcortez
Copy link
Member

It is stored, because we need it for both phases.

I'm still unsure if this would be the best approach, (I'm still playing with it), because after we start and validate we don't need this anymore.

@rsvoboda
Copy link
Member

Just for the record, Peak RSS is in the same level as 3.16.2 with #44448 PR for the case mentioned in #44216 (comment) @michalvavrik no need to do additional verification

@zakkak
Copy link
Contributor Author

zakkak commented Nov 18, 2024

FTR the metrics are back to the original values more or less:

Image

@michalvavrik
Copy link
Member

That's fantastic. Thanks everyone

@gsmet gsmet modified the milestones: 3.18 - main, 3.17.0 Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants