Add ability to run a test on the same node as a previous test #103

white238 · 2023-02-14T21:08:50Z

It would be very useful to add the ability to run a follow up command if the test passed that runs on the same node. For example, I have a test that outputs files that I want to check the answers in. Unfortunately, if I do the following:

foo_test = test(executable="foo") // outputs some file filled with answers
testif(foo_test, executable="foo_checker.py") // checks file output by foo for correctness

The foo_checker.py test runs on a completely different node from foo and the file is not there yet. So my team (and others I have spoken to) have to add sleep(60) to the start of our checker.

This could be avoided if they ran on the same node. Something like:

test(executable="foo", checker="foo_checker.py") // runs executable and then runs checker only if foo returns non-zero

or:

foo_test = test(executable="foo")
testif(foo_test, executable="foo_checker.py", same_node=True)

Edit another idea after talking to @dawson6:

// command_group maybe a better name than wrapper
my_wrapper = wrapper(executable=foo)
my_wrapper.add(executable=checker)
test(my_wrapper)

The text was updated successfully, but these errors were encountered:

dawson6 · 2023-02-14T21:14:13Z

That is a good idea @white238 . We have a new person @MishaZakharchanka joining the ATS project, it may be something I can have him look at in a bit. These types of new features may be something we want to implement in 'flux' first (even if running flux under an salloc) so we don't have to figure it out for slurm and lsf and flux.

dawson6 · 2023-08-16T15:07:00Z

This is an outstanding issue. It may be time to revisit this again with the flux module to see if we can make progress here

MishaZakharchanka · 2023-08-17T23:35:20Z

@white238 We are finally revisiting this issue and we have a promising direction with a flux option. This will mean that running on the same node will be limited to flux and slurm (or flux under slurm) at least for now.

So far we have thought of 3 implementations of this feature but are open to discussing other ideas.

Option to run a test on node = X.
Option to run a test on the same node as test X.
Option to run an ATS group on the same node.

Let us know if you have a preference for any of these approaches or had a different idea for how this might work.

white238 · 2023-08-17T23:48:44Z

I don't personally care which specific nodes they run on... but the pattern is described above but basically I need my checker (which in this example is a testif to run on the same node as the previous test. Maybe a parameter called same_node_as_parent or something?

I don't know what an ATS group is.

jwhite242 · 2023-08-18T19:35:54Z

Think you can accomplish this without having to do convoluted things like running flux inside slurm/etc too -> see the --nodelist= option you can pass to srun.

Also believe our test suite uses the test groups to deal with this too, though i think we do still have a few cases of this being a problem (haven't been able to track down said rumor yet though).

dawson6 · 2023-08-18T19:48:27Z

   -w, --nodelist={<node_name_list>|<filename>}
          Request  a specific list of hosts.  The job will contain all of these hosts and possibly additional hosts as needed to satisfy resource requirements.
          The list may be specified as a comma-separated list of hosts, a range of hosts (host[1-5,7,...] for example), or a filename.  The host list  will  be
          assumed  to  be a filename if it contains a "/" character.  If you specify a minimum node or processor count larger than can be satisfied by the sup-
          plied host list, additional resources will be allocated on other nodes as needed.  Rather than repeating a host name multiple times, an asterisk  and
          a repetition count may be appended to a host name. For example "host1,host1" and "host1*2" are equivalent. If the number of tasks is given and a list
          of requested nodes is also given, the number of nodes used from that list will be reduced to match that of the number of tasks if the number of nodes
          in the list is greater than the number of tasks. This option applies to job and step allocations.

dawson6 · 2023-08-18T19:51:00Z

So do you have an example Jeremy of what that looks like with slurm. Slurm has issues, so didn't want to spend a lot of time forcing slurm to do the right thing. But if this works we could look at it.

That is we don't want to use absolute hardward nodes for instance, too burdensome. But if we come up with a syntax for the interface, we can start with flux, then see if ATS could use slurm directly as well. But due to manpower issues, if we get flux working then just using 'flux under salloc' means less maintenance in the future.

dawson6 · 2023-08-18T19:55:44Z

Chris, if you can drop an example of what your ATS test lines look like for your 'parent' example, we can play with that.
That is, we have examples like so:

#ATS:t1=atswrapper(deck=SELF, level=10, np=10 )
#ATS:t11=testif(t1, executable=run_silodiff, level=10, nn=1, np=1, checker_test=False, )

Where t11 depends on t1 finishing first, then runs a silodiff on output created from t1.. But not sure what your syntax is Chris.

For the above, we would have to demarcate both t1 and t11 somehow, as we need to put t1 on a specific node then put t11 on that same node, rather than query after the fact whic node t1 ran on. And we don't want to generally assign tests to nodes but rather allow flux or slurm to do the scheduling. But for this use case, we make an exception.

dawson6 · 2023-08-18T20:02:17Z

for the simplest solution, which maps to flux usage, we could do somthing like so, where we specify for both t1 and t11 to run on node 0 (relative to the allocation of nodes). But that does put the burden on the user to specify the node, but could likely be used by both flux and slurm (via nodelist) if that works with slurm, and ATS maps '0' to a specify hardware node in the allocation.

#ATS:t1=atswrapper(deck=SELF, level=10, np=10, node_rank=0)
#ATS:t11=testif(t1, executable=run_silodiff, level=10, nn=1, np=1, checker_test=False, node_rank=0)

So can we do something more abstract:

#ATS:t1=atswrapper(deck=SELF, level=10, np=10, same_node=A)
#ATS:t11=testif(t1, executable=run_silodiff, level=10, nn=1, np=1, checker_test=False, same_node=A)

Where A just is a string which is used to mean 'i don't care which node, but put all tests with the same_node=A on the same node. Then ATS assigns A to some node in the allocation as it deems fit.

Or we could say
#ATS:t1=atswrapper(deck=SELF, level=10, np=10, parent_node)
#ATS:t11=testif(t1, executable=run_silodiff, level=10, nn=1, np=1, checker_test=False, same_node_as_parent)

But that will be more complicated on install, as now ATS has to put in logic to determine who is the parent of a job and what node they were assigned. I mean it's doable, but more work and more room for bad logic.

Just some thoughts.

dawson6 pinned this issue Aug 16, 2023

dawson6 added the good first issue Good for newcomers label Aug 16, 2023

MishaZakharchanka mentioned this issue Sep 8, 2023

Same node option #152

Merged

MishaZakharchanka closed this as completed in #152 Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to run a test on the same node as a previous test #103

Add ability to run a test on the same node as a previous test #103

white238 commented Feb 14, 2023 •

edited

Loading

dawson6 commented Feb 14, 2023

dawson6 commented Aug 16, 2023

MishaZakharchanka commented Aug 17, 2023

white238 commented Aug 17, 2023 •

edited

Loading

jwhite242 commented Aug 18, 2023

dawson6 commented Aug 18, 2023

dawson6 commented Aug 18, 2023

dawson6 commented Aug 18, 2023

dawson6 commented Aug 18, 2023

Add ability to run a test on the same node as a previous test #103

Add ability to run a test on the same node as a previous test #103

Comments

white238 commented Feb 14, 2023 • edited Loading

dawson6 commented Feb 14, 2023

dawson6 commented Aug 16, 2023

MishaZakharchanka commented Aug 17, 2023

white238 commented Aug 17, 2023 • edited Loading

jwhite242 commented Aug 18, 2023

dawson6 commented Aug 18, 2023

dawson6 commented Aug 18, 2023

dawson6 commented Aug 18, 2023

dawson6 commented Aug 18, 2023

white238 commented Feb 14, 2023 •

edited

Loading

white238 commented Aug 17, 2023 •

edited

Loading