Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect all test results and handle cancelled tests properly #58

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

sfc-gh-kmakino
Copy link
Contributor

This PR addresses 3 issues:

  • When multiple tests start simultaneously, started can go beyond max_runs. In this scenario, the agent should wait for all tests to complete, rather than stop at max_runs and ignore the still running jobs
  • When agents die or considered to be dead due to not heart beating, we should ignore the cancelled tests as we don't know the results
  • When enough agents are running to serve available ensembles, other agents should timeout

@ammolitor
Copy link
Contributor

Given that the unit tests failed (in the Github Actions build), I think this needs another look.

@sfc-gh-kmakino
Copy link
Contributor Author

@ammolitor Can you tell what failed? build.sh works totally fine locally here.

@sfc-gh-kmakino
Copy link
Contributor Author

@ammolitor CI passed. It would great if you can take another quick look. Thanks!

@sfc-gh-kmakino
Copy link
Contributor Author

Now I realized the scaler needs to be aware of this change. Converting this as a draft for now.

@sfc-gh-kmakino sfc-gh-kmakino marked this pull request as draft October 13, 2021 15:52
# TODO(qhoang) let's try this but there must be a better way
# When an agent is cancelled, it has already incremented the __started__ counter
# but will never get to increment the __ended__ counter
_decrement(tr, ensemble_id, "started")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants