Smoketest setup can OOM #869

andrewazores · 2022-03-18T18:18:14Z

I created some recordings with the web UI, then tested some queries with curl. The backend keeps terminating after the second or third graphql query. Are there any other logs I can check to find out what happened?

POST /api/beta/graphql HTTP/1.1
Accept: application/json, */*;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 621
Content-Type: application/json
Host: localhost:8181
User-Agent: HTTPie/3.0.2

{
    "query": "query {\n    environmentNodes(filter: { name: \"JDP\" }) {\n        name\n        nodeType\n        descendantTargets {\n            recordings {\n                active(filter: { continuous: true, templateLabel: \"template=Cryostat,type=TARGET\" }) {\n                    name\n                    state\n                    metadata {\n                        labels\n                    }\n                }\n                archived {\n                    name\n                    metadata {\n                        labels\n                    }\n                }\n            }\n        }\n    }\n}\n"
}



http: error: ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) while doing a POST request to URL: http://localhost:8181/api/beta/graphql

INFO: (10.0.2.100:51816): GET /api/v1/targets/service%3Ajmx%3Armi%3A%2F%2F%2Fjndi%2Frmi%3A%2F%2Fcryostat%3A9091%2Fjmxrmi/recordings 200 21ms
Mar 18, 2022 5:47:30 PM io.cryostat.core.log.Logger info
INFO: (10.0.2.100:51810): GET /api/v1/recordings 200 28ms
Mar 18, 2022 5:47:40 PM io.cryostat.core.log.Logger info
INFO: Removing cached connection for service:jmx:rmi:///jndi/rmi://cryostat:9091/jmxrmi: EXPIRED
Mar 18, 2022 5:47:40 PM io.cryostat.core.log.Logger info
INFO: Connection for service:jmx:rmi:///jndi/rmi://cryostat:9091/jmxrmi closed
Mar 18, 2022 5:48:45 PM io.cryostat.core.log.Logger info
INFO: Creating connection for service:jmx:rmi:///jndi/rmi://cryostat:10000/jmxrmi
+ cleanup
+ podman pod kill cryostat
cab957e3c614d9e0bafd13451c105f400dc7ffc19cf78aec08744ce478ec993a
+ podman pod rm cryostat
cab957e3c614d9e0bafd13451c105f400dc7ffc19cf78aec08744ce478ec993a

Originally posted by @jan-law in https://github.com/cryostatio/cryostat/pull/868#discussion_r830231817

The text was updated successfully, but these errors were encountered:

andrewazores · 2022-03-18T18:20:32Z

It would be nice to do some investigation and see what's actually taking up the memory within the Cryostat container and see if there are any resources we're holding open too long, or caches we aren't clearing and leaking memory to, etc. Cryostat itself can be used for this since we can get all kinds of data about that through JFR profiling.
The 512MB limit is quite small for a JVM anyway, so it's probably reasonable to bump up the limit in run.sh from 512MB to 768MB or 1GB.
The pod is named cryostat and the container within just gets a generated name. It would be clearer and more convenient if they were cryostat-pod and cryostat or some other fixed names.

andrewazores · 2022-03-18T19:08:59Z

I did some profiling investigation myself and there is no immediately visible cause for high memory usage - no Old Object Samples indicating things being held onto, no unusually large or frequent allocations. I have been unable to reproduce the OOM or whatever failure it is that @jan-law is experiencing.
At initial startup when all we have done is fire up the JVM and Vert.x we're already sitting at ~340MB. After a couple of warmup requests and a GC cycle apparently occurring I observe it sitting closer to ~430, but often (as expected of the GC behaviour) it appears to be sitting just barely below the 536MB (base10 vs base2) memory limit and only reclaiming what memory is needed as it goes. No major GC was recorded in the JFR profiling, no big STW pause. Regardless, a 512MiB container limit seems like it leaves fairly little overhead.

jan-law · 2022-04-01T16:01:40Z

it appears to be sitting just barely below the 536MB (base10 vs base2) memory limit and only reclaiming what memory is needed as it goes.

Revisiting this, I also haven't been able to reproduce another OOM since I opened this issue. The memory usage is still very close to the 536MB limit though, so I can increase the limit to 768MB.

After I started two batches of recordings and archived them at ~4:43:30, my recording shows cryostat's heap slowly increasing without making any more API calls. Is this expected behaviour due to the additional recording information being stored in cryostat?

andrewazores · 2022-04-01T16:38:20Z

That shouldn't really be causing any real memory consumption increase. The graph is a little suspect looking with such a smooth and consistent slope upward, and there is no data on the other charts either - is there just missing data overall between those points?

Either way, the increased memory usage over time could be allocations occurring for background tasks that Cryostat is processing like periodic rule archivers or JDP discovery. If there is still enough free memory available for the process then you won't see that heap usage drop and the memory reclaimed until the garbage collector decides to run.

jan-law · 2022-04-01T17:36:16Z

is there just missing data overall between those points?

Either way, the increased memory usage over time could be allocations occurring for background tasks that Cryostat is processing

Yeah, there's no other data points on that slope. The rest of the Memory charts also show normal behaviour too.

andrewazores · 2022-04-01T18:12:03Z

Okay. Sounds like something for us to keep an eye on at least. It would also be good if we can spend a bit more time doing some heap analysis before release and make sure we aren't leaking memory or doing anything silly that generates a lot of garbage.

jan-law · 2022-04-07T19:00:50Z

I started a profiling recording for ~1hr after making a few requests with graphql and auto rules. Besides the spikes from my requests, the heap usage stays at ~48 MiB. Heap usage looks normal to me.

Most of the allocations are coming from byte[], Object[] and char[]. Stack traces show that most of these were caused by the vert.x-eventloop-thread-3 reading data from JMX connections. Allocations were also caused by parsing the XML template when starting a recording, loading classes when opening the web client, and starting up GraphQL. None of the Old Object Samples show any objects that might be alive for longer than they should be.

When comparing memory usage with ParallelGC vs G1, I’m seeing similar CPU and memory usage with similar GC frequency and pause times. Not sure if there’s any other incentives for switching to ParallelGC.

Sometimes I get this automated analysis result for TLABs. Is there anything we can do about this?
The thread performing the most allocation is likely 'vert.x-eventloop-thread-3'. This is the most common allocation path for that class: null Many allocations performed by the same thread might indicate a problem in a multi-threaded program. Look at the aggregated stack traces for the thread with the highest allocation rate. See if the allocation rate can be brought down, or balanced among the active threads.

jan-law · 2022-04-12T14:37:41Z

Revisiting this, I think you can trigger an OOM if you open the web client on localhost:9000, start an automated rule that periodically archives recordings, then use a GraphQL query to start a recording on all targets at the moment when the auto rule is in the middle of archiving several recordings. Is it worth increasing the container memory limit to 768M for this scenario?

andrewazores · 2022-04-12T14:43:52Z

Probably. That isn't a very far-fetched scenario in a real deployment that actually sees some use. If that sequence can take it down then I'm sure there are many other similar sequences that can, too, with similar low traffic levels.

andrewazores added chore Refactor, rename, cleanup, etc. good first issue Good for newcomers and removed chore Refactor, rename, cleanup, etc. labels Mar 18, 2022

andrewazores added question Further information is requested chore Refactor, rename, cleanup, etc. labels Mar 18, 2022

jan-law self-assigned this Mar 30, 2022

jan-law mentioned this issue Apr 12, 2022

chore(smoketest): Increase memory limit to 768M #888

Merged

andrewazores closed this as completed in #888 Apr 12, 2022

jan-law mentioned this issue Apr 21, 2022

Operator should set CRYOSTAT_REPORT_GENERATION_MAX_HEAP cryostatio/cryostat-operator#277

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smoketest setup can OOM #869

Smoketest setup can OOM #869

andrewazores commented Mar 18, 2022 •

edited

Loading

andrewazores commented Mar 18, 2022

andrewazores commented Mar 18, 2022

jan-law commented Apr 1, 2022

andrewazores commented Apr 1, 2022

jan-law commented Apr 1, 2022

andrewazores commented Apr 1, 2022

jan-law commented Apr 7, 2022

jan-law commented Apr 12, 2022 •

edited

Loading

andrewazores commented Apr 12, 2022

Smoketest setup can OOM #869

Smoketest setup can OOM #869

Comments

andrewazores commented Mar 18, 2022 • edited Loading

andrewazores commented Mar 18, 2022

andrewazores commented Mar 18, 2022

jan-law commented Apr 1, 2022

andrewazores commented Apr 1, 2022

jan-law commented Apr 1, 2022

andrewazores commented Apr 1, 2022

jan-law commented Apr 7, 2022

jan-law commented Apr 12, 2022 • edited Loading

andrewazores commented Apr 12, 2022

andrewazores commented Mar 18, 2022 •

edited

Loading

jan-law commented Apr 12, 2022 •

edited

Loading