OutOfMemoryError when NASA validate v3.5.2 is executed through a library for a batch of products #979

daniel-rincon-garcia · 2024-08-23T11:18:38Z

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

When NASA validate v3.5.2 is executed through a JAR library for a batch of products in a Java v17 project, I noticed that the heap memory is not released in each product validate iteration so when the library validated some products an OutOfMemoryError exception is raised.

java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at org.apache.catalina.core.StandardServer.startPeriodicLifecycleEvent(StandardServer.java:937)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.OutOfMemoryError: Java heap space

I've seen a similar issue: #826 but it seems that this is a bit different.

🕵️ Expected behavior

I'd expect that the memory was released in each product validate iteration or at least that the memory was released before reach the heap size to avoid the out of memory exception.

📜 To Reproduce

Create a Java 17 project that uses NASA validate v3.5.2 as a jar project library.
Create a ValidateLauncher object and a String array intended to be use as arguments for ValidateLauncher object.

final ValidateLauncher launcher = new ValidateLauncher();
String[] arguments = {
                    "-C", "pds4-validate-catalog.xml",
                    "-e", "xml",
                    "-E", String.valueOf(Long.MAX_VALUE),
                    "-s", "json",
                    "--add-context-products", "local_context_products.json",
                    "-t", "cam_raw_sc_cam3_image_20210812t041402_61_f__t0005.xml",
                    "-r", "/tmp/PDS4_VALIDATION_REPORTS/pds_validate_report_bc_mtm_mcam_1724407870771_b296d73d-9536-4ac3-baee-9b40faf4fab3.json"
            };

pds4-validate-catalog.zip
local_context_products.json
cam_raw_sc_cam3_20210812t041402_61_f__t0005.zip
2. Call launcher.processMain(arguments); and release the object:

            launcher.flushValidators();
            CrossLabelFileAreaReferenceChecker.reset();

Repeat 1 and 2 several times and check the memory heap usage (you can use the VisualVM tool to analyze it).

🖥 Environment Info

Version of this software v3.5.2
Operating System: Ubuntu 20.04.6 LTS
Kernel: Linux 5.4.0-193-generic
Architecture: x86-64
Java version: 17

📚 Version of Software Used

Version 3.5.2

🩺 Test Data / Additional context

VisualVM v1.4.3 was used to analyze the heap memory and as we can see, the heap memory is not released:

Finally, an out of memory exception happen.

🦄 Related requirements

🦄 #xyz

⚙️ Engineering Details

No response

🎉 Integration & Test

No response

The text was updated successfully, but these errors were encountered:

jordanpadams · 2024-08-24T23:29:27Z

@daniel-rincon-garcia thanks for profiling this for us! we will take a look. there must be some objects being left open.

jordanpadams · 2024-08-24T23:30:46Z

@daniel-rincon-garcia in the interim, I would look to increase your heap size to support your validation needs. Will let you know when we have a fix ready.

jordanpadams · 2024-08-24T23:31:31Z

@al-niessner Registry / API work takes precedence, but this is top priority for validate

al-niessner · 2024-08-25T17:22:15Z

@jordanpadams

Yes, validate's memory leak is well known and the rest of this reply is not pleasant to read either. validate is meant (current design and implementation) to be restarted each time for a batch of items; meaning, it leaks memory so exiting the process is the best cleanup. One user has 2+ million labels and had to increase the heap size to 64 GB or more to get the job done. Making the heap large enough to accommodate the full job or spawning validate as a new java process are unsatisfying but the shortest and most robust path to a solution.

The regression/unit test makes a minor attempt to clean up behind itself (the clean up calls you list), but it is only good enough to run through the suite of tests; meaning, those calls are not perfect and no substitute for shutting validate down and restarting. Part of the issue is that some third party units, like schematron, have its own permanent memory (leaks over labels because it is never released) that is outside the control of validate. Given the number of labels that some user process, any of these leaks will eventually become dominate/noticeable.

The predominate memory leak is the philosophy of validate. It makes the assumption that all labels when launched belong to a cohort that needs to be crossed analyzed. There are lots of caches, like schematron, that use this cache to prevent long operations, downloading the schematron, from dominating. validate itself does the same with content in the label like the lidvid and the URL to fetch it are retained. These then get stored in forever lists that are never clean up because if more labels are found those cross links are thought to be necessary. Therefore, validate does not reclaim them until all labels are found which is defined by validate exiting. Not a good design for you, but the one that exists and it is very clear without ambiguity when to reclaim all memory.

daniel-rincon-garcia · 2024-08-26T07:24:58Z

Thank you both very much for the detailed explanation of the problem and for the suggested idea to increase the heap size. For the moment, we restart each time the process to avoid the memory leak.

msbentley · 2024-08-26T07:31:27Z

I wonder if some combination of switches could help in our case? In particular I would note that due to the "operational" way in which we are archiving we:

never go online to read labels (always local and referenced by a catalogue file)
don't care about referential integrity checks etc. (we are only validating individual products, we are not doing collection or bundle validation).

It sounds like our use case could in principle avoid many of the issues above, if we use the right switches, and if those switches indeed avoid those issues at the software level?

Edit: The only thing I can see right now is --skip-context-reference-check since by default the rule is already pds4.label, so perhaps there is not much to gain here?

al-niessner · 2024-08-26T15:25:49Z

Restart is the more robust approach. It is also preferred. If you are trying to clear all caches for memory performance, then restarting adds no appreciable start up overhead to the clearing of memory - going to reload from network or disk everything either way.

Whether you download from the network or read from disk, the third party libraries always cache. It will always save you time to read from disk though because the bus is faster, but makes no difference in memory usage.

Skipping reference checking switch does exactly that and only that. It does not prevent the collection of the information for reference checking because it is also needed for reporting and aggregate labels. Again, may save you some time but will not have much if any impact on memory usage.

jordanpadams · 2024-08-26T16:16:53Z

@msbentley @daniel-rincon-garcia we will work to improve our documentation here as well to note that it recommended to batch validate executions to avoid this out of memory issue. The size of the batch is dependent upon the size of the memory you are able to allocate to the JVM.

Refs #979

daniel-rincon-garcia added bug Something isn't working needs:triage labels Aug 23, 2024

daniel-rincon-garcia assigned jordanpadams Aug 23, 2024

jordanpadams added B15.1 s.medium and removed needs:triage labels Aug 24, 2024

jordanpadams added this to B15.1 Aug 24, 2024

github-project-automation bot moved this to Release Backlog in B15.1 Aug 24, 2024

jordanpadams assigned al-niessner Aug 24, 2024

jordanpadams removed their assignment Aug 24, 2024

jordanpadams added a commit that referenced this issue Sep 27, 2024

Update documentation with batching recommendation

7991859

Refs #979

jordanpadams mentioned this issue Sep 27, 2024

Update documentation with batching recommendation #1012

Merged

jordanpadams closed this as completed in #1012 Oct 3, 2024

github-project-automation bot moved this from Release Backlog to 🏁 Done in B15.1 Oct 3, 2024

jordanpadams added the i&t.skip label Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OutOfMemoryError when NASA validate v3.5.2 is executed through a library for a batch of products #979

OutOfMemoryError when NASA validate v3.5.2 is executed through a library for a batch of products #979

daniel-rincon-garcia commented Aug 23, 2024 •

edited

Loading

jordanpadams commented Aug 24, 2024

jordanpadams commented Aug 24, 2024

jordanpadams commented Aug 24, 2024

al-niessner commented Aug 25, 2024

daniel-rincon-garcia commented Aug 26, 2024 •

edited

Loading

msbentley commented Aug 26, 2024 •

edited

Loading

al-niessner commented Aug 26, 2024

jordanpadams commented Aug 26, 2024

OutOfMemoryError when NASA validate v3.5.2 is executed through a library for a batch of products #979

OutOfMemoryError when NASA validate v3.5.2 is executed through a library for a batch of products #979

Comments

daniel-rincon-garcia commented Aug 23, 2024 • edited Loading

Checked for duplicates

🐛 Describe the bug

🕵️ Expected behavior

📜 To Reproduce

🖥 Environment Info

📚 Version of Software Used

🩺 Test Data / Additional context

🦄 Related requirements

⚙️ Engineering Details

🎉 Integration & Test

jordanpadams commented Aug 24, 2024

jordanpadams commented Aug 24, 2024

jordanpadams commented Aug 24, 2024

al-niessner commented Aug 25, 2024

daniel-rincon-garcia commented Aug 26, 2024 • edited Loading

msbentley commented Aug 26, 2024 • edited Loading

al-niessner commented Aug 26, 2024

jordanpadams commented Aug 26, 2024

daniel-rincon-garcia commented Aug 23, 2024 •

edited

Loading

daniel-rincon-garcia commented Aug 26, 2024 •

edited

Loading

msbentley commented Aug 26, 2024 •

edited

Loading