Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running comprehensive suite of WE2E tests overwhelms some platforms #462

Closed
mkavulich opened this issue Nov 10, 2022 · 1 comment · Fixed by #637
Closed

Running comprehensive suite of WE2E tests overwhelms some platforms #462

mkavulich opened this issue Nov 10, 2022 · 1 comment · Fixed by #637
Assignees
Labels
bug Something isn't working

Comments

@mkavulich
Copy link
Collaborator

mkavulich commented Nov 10, 2022

Expected behavior

All comprehensive tests should run

Current behavior

The large number of cron tasks needing to be executed simultaneously (57 for the current comprehensive suite) as well as the large number of submitted jobs (several hundred) results in problems on certain platforms. The known issues are detailed in the next section.

Machines affected

Cheyenne

The large number of cron tasks needing to be executed simultaneously (57 for the current comprehensive suite) strains the capability of the node(s) handling cron jobs on the Cheyenne machine. CISL has asked us specifically to limit our use of cron in this way.

Hera

The large number of individual tests means that on platforms with a by-user job submission limit (300 on Hera) can be surpassed. While this does not create errors, attempting to submit additional jobs results in the new jobs being rejected, which rocoto counts as a failure and does not handle well.

Steps To Reproduce

Run the comprehensive suite of WE2E tests on the platforms described above.

Detailed Description of Fix

There are a few potential fixes for these issues individually, but the best fix would be to create a python script that will set up each test in the comprehensive suite, generate all of them, and track and manage the active jobs using rocoto rather than creating a crontab entry for each test. This python script would be intended to run continuously until all WE2E tasks are complete, and could be run via the command line or submitted to a compute node. By functioning this way, we will bypass the cron functionality to avoid those issues, and it will be relatively easy to implement any necessary throttling to avoid issues related to the number of submitted jobs.

Possible Implementation

My proposal for this implementation would be to create a script that initially has just the basic functionality described above, which could be created quickly and used to run the comprehensive test suite in the short term. Once implemented, the script could be fleshed out to include all the necessary functionality of run_WE2E_tests.sh, eventually superseding it completely.

Bonuses of a re-write in python include

  • Working nicely with the existing efforts to python-ize the app
  • Take advantage of existing python tools easily
  • Would make adding additional useful functionality, such as a test summary file detailing the status of each test, much easier.

Specific requirements:

  • Should be able to run from login node or batch queue
  • Must be able to stop specific tests without killing all tests
  • desired but not required Should be able to resume interrupted test suite
@mkavulich mkavulich added the bug Something isn't working label Nov 10, 2022
@mkavulich mkavulich self-assigned this Nov 10, 2022
@mkavulich
Copy link
Collaborator Author

#466 Represents a good interim solution for the Cheyenne crontab issue. I will be working on a more complete re-write of run_WE2E_tests.sh in python in the coming days for a more complete solution to these issues.

natalie-perlin pushed a commit to natalie-perlin/ufs-srweather-app that referenced this issue Jun 2, 2024
…/framework submodule pointer update for ufs-community#462 (#1654)

* update FV3 submodule and .gitmodules for testing of 20230313_combo

* turn off cpld_control_p8_faster cheyenne
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant