Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent errors with CWL Viewer #424

Closed
kinow opened this issue Jun 12, 2022 · 14 comments · Fixed by #426
Closed

Intermittent errors with CWL Viewer #424

kinow opened this issue Jun 12, 2022 · 14 comments · Fixed by #426

Comments

@kinow
Copy link
Member

kinow commented Jun 12, 2022

Description

Last week @mr-c reported CWL Viewer was not working, showing an error message when trying to import a new workflow. We fixed it after cleaning the temporary directory and restarting some services in AWS.

Today I was reviewing the user guide (for common-workflow-language/user_guide#219) and in the chapter 22 there are links to some CWL Viewer workflows. Both were broken.

I opened view.commonwl.org, clicked "Explore" and then clicked on the first workflow I found. This was broken as well.

Expected Behavior

Either a more useful error message (e.g. no disk space, DB instance is busy, etc), or no error.

Actual Behavior

Intermittent errors in the CWL Viewer UI.

Possible Fix

  1. Review if we are not doing something inefficient in CWL Viewer regarding DB transactions (I noticed in the logs numerous transactions for each step or input, I think)
  2. Check if we can add a finally call that removes any unnecessary files from the temporary directories
  3. Add some indicator of what's wrong with the server (something like a Prometheus endpoint?)

Steps to Reproduce

  1. Open view.commonwl.org and click on Explore
  2. Choose a workflow
  3. You may have to wait a while, or import some Workflows… it's not clear what is causing this issue exactly

Context

Possibly related to the PostgreSQL migration? Not sure if it happened before, if we got new users, or some new workflow that triggered it. I'll work with the hypothesis, at first, that it is somehow related to either PostgreSQL or to the dependencies upgraded.

Your Environment

  • Version used: 1.4.5
  • Environment name and version (e.g. Chrome 39, Java 17): Firefox Ubuntu LTS
@kinow
Copy link
Member Author

kinow commented Jun 13, 2022

@kinow kinow changed the title Fix intermitent errors with CWL Viewer Fix intermittent errors with CWL Viewer Jun 13, 2022
@kinow
Copy link
Member Author

kinow commented Jun 14, 2022

Reading @stain's comment in the linked issue above gave me a clue where to start looking at these temporary files.

Looking at the source code of JGit, the Git instance that we create in multiple parts of the code is a Java Autocloseable.

Which means that we can call .close() on it, or even better, with a try-with-resources that auto-closes it.

That tells JGit to free up the resources of that repository. So eventually the JVM must garbage-collect the unused objects/closed objects, which should result in file descriptors being freed too, and files deleted from the file system.

@kinow kinow changed the title Fix intermittent errors with CWL Viewer Intermittent errors with CWL Viewer Jun 14, 2022
@kinow
Copy link
Member Author

kinow commented Jun 14, 2022

Created a draft pull request to start adding the fixes for this issue. Not sure if I will disable the cron job that clears directories. To be honest, I might just update the pull request #344 since the current job is cheap and could work as a fallback option (i.e. doesn't hurt to have it running, IMO).

@kinow
Copy link
Member Author

kinow commented Jun 14, 2022

Note too, that closing a Git repository should free up the resources in the JVM. But as with anything related to garbage collection in Java, there are no 100% promises that the objects will be freed, nor that the files will be deleted, nor that it will work with every JVM implementation. So, again, doesn't hurt to have a cron job deleting older files, that may not have been deleted for some reason 🤷‍♂️

@swzCuroverse
Copy link

@kinow you should talk to @cure as well since I am pretty sure he knows about these issues and has been periodically triaging them.

@kinow
Copy link
Member Author

kinow commented Jun 14, 2022

Already did so a couple days ago, but will mention again in today's meeting. Not sure if I will have a working branch this week, so the new devops will probably have to deploy this new version as one of their first tasks I think. Thanks @swzCuroverse !

@kinow
Copy link
Member Author

kinow commented Jun 14, 2022

Note to self: a good way to test the temporary files issues is to start Jena + PostgreSQL, and then start the Java application using a different temporary directory.

mkdir -pv /tmp/cwlviewer/
java -Djava.io.tmpdir=/tmp/cwlviewer/ ....

The tomcat folders are used while the Spring application is running (at least one is left behind, but empty). Then import a workflow, and observe the temporary files and folders created. Stop the server, start again, and keep analyzing the behavior of the system and comparing with the source code…

@kinow
Copy link
Member Author

kinow commented Jun 14, 2022

And using the technique above I can confirm the pull request linked, even though it is closing the JGit repository, is still leaving the git repository behind 🤔

@kinow
Copy link
Member Author

kinow commented Jun 14, 2022

For the compile.cwl workflow linked in the main page, here's what is created in the temporary directory (ignore the two tomcat folders).

image

2 folders, 4 files 🤓

@kinow
Copy link
Member Author

kinow commented Jun 14, 2022

Huh, interesting. I modified the CWLToolRunner, to delete the Git repository directory once it has finished running (success or not).

Then I did a test where I imported compile.cwl. Success. Then I related the RO bundle, image files, and refreshed the page. The Git repository directory was there again now 😥

@kinow
Copy link
Member Author

kinow commented Jun 14, 2022

The ROBundleService cloned the repository again if the RO Bundle file had been deleted (e.g. cron job, manually), but also didn't delete the repository once it was done.

It looks like the RO bundle ZIP file is streamed to the user in a download, but then left behind in the server. I will take a look if there is a way to delete it (.deleteOnExit?). But there is also a RO Bundle empty directory left behind that can be deleted (just need to find out where it's created). Getting there…

@kinow
Copy link
Member Author

kinow commented Jun 14, 2022

What's left behind now in the temporary directory:

image

@kinow
Copy link
Member Author

kinow commented Jun 16, 2022

I implemented the solution suggested in this stackoverflow answer, and it resulted in a strange endless loop after trying to import the compile.cwl 😵

The RO bundle files are generated when you import the workflow, and it looks like it is necessary to generate the download link displayed in the UI, and also when the user clicks to download it. Either one of the methods in the WorkflowController will need to be left as-is, or we need to change the UI to create the link even if the RO bundle file does not exist.

Back to the whiteboard to think how to fix it…

@kinow
Copy link
Member Author

kinow commented Jun 20, 2022

I think I finished understanding the main flow from the moment the user clicks the submit button in the HTML landing page, up to the moment we have the graphs & bundle data ready to serve to the user, also noting where temporary files are created.

@mr-c the Wiki of CWL Viewer appears to be a good place for these diagrams, WDYT? My intention was to document it to myself and to any other developer that may be interested in modifying or understanding how we parse and create the data to serve in CWL Viewer.

cwlviewer_flowchat_01

cwlviewer_flowchat_02

cwlviewer_flowchat_03

Current workflow

There are three main tasks executed when the user submits a workflow. The first is a synchronous task, holding up the user request. That task will parse and validate the workflow, try to return an existing Workflow if any, create a QueueWorkflow in the database, and then trigger the next task.

This second task will try to re-use the git repository from the first task. It may or may not re-use the repository. If the second task starts while the first task still has the lock (on the semaphore) of the git repository, then a new clone is created. This is an asynchronous task, handled by Java Spring in a separate Java thread, and its main objective is to parse the workflow (using Jena/RDF), populate the workflow object from the database and save it. Finally, it will trigger the third task, also an async Java Spring task.

Lastly, we have the Research Objects bundle task. This will use Taverna API to create a bundle. It will again try to re-use the git repository if not busy, but will also create the images if not existent, and add the git files and graphs to the bundle.

The bundle directory is left on the temporary directory, along with the git repository(ies). When a user opens the workflow page, an Ajax call will be executed every 5 seconds, triggering a new RO Bundle creation if it has been removed, which also creates more temporary files.

We have a cron job that is actually managed by Spring (not in the diagrams). This cron job deletes old QueuedWorkflows from the PostgreSQL database.

Proposed changes

  • Remove git repositories in the catch block of exceptions, i.e. if any exception happens, then delete the git repository and let the process clone a new repository (could happen with corrupt git/network issues?)
  • Remove bundle directory and any git repository by the end of the last task, whether it succeeds or not
  • Update the Cron job as per Clear tmp dir #344

The last item would be, in theory, the fix for the temporary directory issues, and also for this intermittent errors issue. 🤞

Will update my pull request this week to try and close these issues.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants