-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to run a test on the same node as a previous test #103
Comments
That is a good idea @white238 . We have a new person @MishaZakharchanka joining the ATS project, it may be something I can have him look at in a bit. These types of new features may be something we want to implement in 'flux' first (even if running flux under an salloc) so we don't have to figure it out for slurm and lsf and flux. |
This is an outstanding issue. It may be time to revisit this again with the flux module to see if we can make progress here |
@white238 We are finally revisiting this issue and we have a promising direction with a flux option. This will mean that running on the same node will be limited to flux and slurm (or flux under slurm) at least for now. So far we have thought of 3 implementations of this feature but are open to discussing other ideas.
Let us know if you have a preference for any of these approaches or had a different idea for how this might work. |
I don't personally care which specific nodes they run on... but the pattern is described above but basically I need my checker (which in this example is a I don't know what an ATS group is. |
Think you can accomplish this without having to do convoluted things like running flux inside slurm/etc too -> see the Also believe our test suite uses the test groups to deal with this too, though i think we do still have a few cases of this being a problem (haven't been able to track down said rumor yet though). |
|
So do you have an example Jeremy of what that looks like with slurm. Slurm has issues, so didn't want to spend a lot of time forcing slurm to do the right thing. But if this works we could look at it. That is we don't want to use absolute hardward nodes for instance, too burdensome. But if we come up with a syntax for the interface, we can start with flux, then see if ATS could use slurm directly as well. But due to manpower issues, if we get flux working then just using 'flux under salloc' means less maintenance in the future. |
Chris, if you can drop an example of what your ATS test lines look like for your 'parent' example, we can play with that. #ATS:t1=atswrapper(deck=SELF, level=10, np=10 ) Where t11 depends on t1 finishing first, then runs a silodiff on output created from t1.. But not sure what your syntax is Chris. For the above, we would have to demarcate both t1 and t11 somehow, as we need to put t1 on a specific node then put t11 on that same node, rather than query after the fact whic node t1 ran on. And we don't want to generally assign tests to nodes but rather allow flux or slurm to do the scheduling. But for this use case, we make an exception. |
for the simplest solution, which maps to flux usage, we could do somthing like so, where we specify for both t1 and t11 to run on node 0 (relative to the allocation of nodes). But that does put the burden on the user to specify the node, but could likely be used by both flux and slurm (via nodelist) if that works with slurm, and ATS maps '0' to a specify hardware node in the allocation. #ATS:t1=atswrapper(deck=SELF, level=10, np=10, node_rank=0) So can we do something more abstract: #ATS:t1=atswrapper(deck=SELF, level=10, np=10, same_node=A) Where A just is a string which is used to mean 'i don't care which node, but put all tests with the same_node=A on the same node. Then ATS assigns A to some node in the allocation as it deems fit. Or we could say But that will be more complicated on install, as now ATS has to put in logic to determine who is the parent of a job and what node they were assigned. I mean it's doable, but more work and more room for bad logic. Just some thoughts. |
It would be very useful to add the ability to run a follow up command if the test passed that runs on the same node. For example, I have a test that outputs files that I want to check the answers in. Unfortunately, if I do the following:
The
foo_checker.py
test runs on a completely different node fromfoo
and the file is not there yet. So my team (and others I have spoken to) have to addsleep(60)
to the start of our checker.This could be avoided if they ran on the same node. Something like:
or:
Edit another idea after talking to @dawson6:
The text was updated successfully, but these errors were encountered: