Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutOfMemoryError when NASA validate v3.5.2 is executed through a library for a batch of products #979

Closed
daniel-rincon-garcia opened this issue Aug 23, 2024 · 8 comments · Fixed by #1012
Assignees
Labels

Comments

@daniel-rincon-garcia
Copy link

daniel-rincon-garcia commented Aug 23, 2024

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

When NASA validate v3.5.2 is executed through a JAR library for a batch of products in a Java v17 project, I noticed that the heap memory is not released in each product validate iteration so when the library validated some products an OutOfMemoryError exception is raised.

java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at org.apache.catalina.core.StandardServer.startPeriodicLifecycleEvent(StandardServer.java:937)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.OutOfMemoryError: Java heap space

I've seen a similar issue: #826 but it seems that this is a bit different.

🕵️ Expected behavior

I'd expect that the memory was released in each product validate iteration or at least that the memory was released before reach the heap size to avoid the out of memory exception.

📜 To Reproduce

  1. Create a Java 17 project that uses NASA validate v3.5.2 as a jar project library.
  2. Create a ValidateLauncher object and a String array intended to be use as arguments for ValidateLauncher object.
final ValidateLauncher launcher = new ValidateLauncher();
String[] arguments = {
                    "-C", "pds4-validate-catalog.xml",
                    "-e", "xml",
                    "-E", String.valueOf(Long.MAX_VALUE),
                    "-s", "json",
                    "--add-context-products", "local_context_products.json",
                    "-t", "cam_raw_sc_cam3_image_20210812t041402_61_f__t0005.xml",
                    "-r", "/tmp/PDS4_VALIDATION_REPORTS/pds_validate_report_bc_mtm_mcam_1724407870771_b296d73d-9536-4ac3-baee-9b40faf4fab3.json"
            };

pds4-validate-catalog.zip
local_context_products.json
cam_raw_sc_cam3_20210812t041402_61_f__t0005.zip
2. Call launcher.processMain(arguments); and release the object:

            launcher.flushValidators();
            CrossLabelFileAreaReferenceChecker.reset();
  1. Repeat 1 and 2 several times and check the memory heap usage (you can use the VisualVM tool to analyze it).

🖥 Environment Info

  • Version of this software v3.5.2
  • Operating System: Ubuntu 20.04.6 LTS
  • Kernel: Linux 5.4.0-193-generic
  • Architecture: x86-64
  • Java version: 17

📚 Version of Software Used

Version 3.5.2

🩺 Test Data / Additional context

VisualVM v1.4.3 was used to analyze the heap memory and as we can see, the heap memory is not released:

heapMemory

Finally, an out of memory exception happen.

🦄 Related requirements

🦄 #xyz

⚙️ Engineering Details

No response

🎉 Integration & Test

No response

@jordanpadams
Copy link
Member

@daniel-rincon-garcia thanks for profiling this for us! we will take a look. there must be some objects being left open.

@jordanpadams
Copy link
Member

@daniel-rincon-garcia in the interim, I would look to increase your heap size to support your validation needs. Will let you know when we have a fix ready.

@jordanpadams
Copy link
Member

@al-niessner Registry / API work takes precedence, but this is top priority for validate

@jordanpadams jordanpadams removed their assignment Aug 24, 2024
@al-niessner
Copy link
Contributor

@jordanpadams

Yes, validate's memory leak is well known and the rest of this reply is not pleasant to read either. validate is meant (current design and implementation) to be restarted each time for a batch of items; meaning, it leaks memory so exiting the process is the best cleanup. One user has 2+ million labels and had to increase the heap size to 64 GB or more to get the job done. Making the heap large enough to accommodate the full job or spawning validate as a new java process are unsatisfying but the shortest and most robust path to a solution.

The regression/unit test makes a minor attempt to clean up behind itself (the clean up calls you list), but it is only good enough to run through the suite of tests; meaning, those calls are not perfect and no substitute for shutting validate down and restarting. Part of the issue is that some third party units, like schematron, have its own permanent memory (leaks over labels because it is never released) that is outside the control of validate. Given the number of labels that some user process, any of these leaks will eventually become dominate/noticeable.

The predominate memory leak is the philosophy of validate. It makes the assumption that all labels when launched belong to a cohort that needs to be crossed analyzed. There are lots of caches, like schematron, that use this cache to prevent long operations, downloading the schematron, from dominating. validate itself does the same with content in the label like the lidvid and the URL to fetch it are retained. These then get stored in forever lists that are never clean up because if more labels are found those cross links are thought to be necessary. Therefore, validate does not reclaim them until all labels are found which is defined by validate exiting. Not a good design for you, but the one that exists and it is very clear without ambiguity when to reclaim all memory.

@daniel-rincon-garcia
Copy link
Author

daniel-rincon-garcia commented Aug 26, 2024

Thank you both very much for the detailed explanation of the problem and for the suggested idea to increase the heap size. For the moment, we restart each time the process to avoid the memory leak.

@msbentley
Copy link

msbentley commented Aug 26, 2024

I wonder if some combination of switches could help in our case? In particular I would note that due to the "operational" way in which we are archiving we:

  • never go online to read labels (always local and referenced by a catalogue file)
  • don't care about referential integrity checks etc. (we are only validating individual products, we are not doing collection or bundle validation).

It sounds like our use case could in principle avoid many of the issues above, if we use the right switches, and if those switches indeed avoid those issues at the software level?

Edit: The only thing I can see right now is --skip-context-reference-check since by default the rule is already pds4.label, so perhaps there is not much to gain here?

@al-niessner
Copy link
Contributor

Restart is the more robust approach. It is also preferred. If you are trying to clear all caches for memory performance, then restarting adds no appreciable start up overhead to the clearing of memory - going to reload from network or disk everything either way.

Whether you download from the network or read from disk, the third party libraries always cache. It will always save you time to read from disk though because the bus is faster, but makes no difference in memory usage.

Skipping reference checking switch does exactly that and only that. It does not prevent the collection of the information for reference checking because it is also needed for reporting and aggregate labels. Again, may save you some time but will not have much if any impact on memory usage.

@jordanpadams
Copy link
Member

@msbentley @daniel-rincon-garcia we will work to improve our documentation here as well to note that it recommended to batch validate executions to avoid this out of memory issue. The size of the batch is dependent upon the size of the memory you are able to allocate to the JVM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🏁 Done
Development

Successfully merging a pull request may close this issue.

4 participants