-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test timeouts for netCDF 4.9.0 #15959
Comments
so it times out. srun -n 20 bash
eb netCDF--etc...eb If so, be aware that this will not work and will typically fail with the same reason, the |
Thanks Mikael. You are absolutely right that I was running this on a node where I'd opened a session using srun.
However, I have now retried this by connecting to the same node using a simple SSH session, such that all 32 of the node's CPUs were available to me, and I got the same error - a timeout on test 161. I attach copies of the new logs. Should I have done anything special in setting up my SSH session? Or is there an srun option that would allow the use of MPI within an interactive session running under it?
Regards
Ed
|
Just to be doubly sure, can you write a batch script that simply submits this job to the queue? i.e. #!/usr/bin/env bash
#SBATCH -n 20
module load EasyBuild
eb netCDF-4.9.0-gompi-2022a.eb -r |
Regarding the other questions:
if this happens to be a node where you are currently running a job, and ssh'ing into it with PAM's slurm_job_adopt active, then it makes no difference. SLURM still considers it busy.
there are surely flags, but, it's not very convenient to try to pass those down. unset SLURMD_NODENAME
unset SLURM_CLUSTER_NAME
unset SLURM_CONF
unset SLURM_CPUS_ON_NODE
unset SLURM_DISTRIBUTION
unset SLURM_GTIDS
unset SLURM_JOBID
unset SLURM_JOB_ACCOUNT
unset SLURM_JOB_CPUS_PER_NODE
unset SLURM_JOB_GID
unset SLURM_JOB_ID
unset SLURM_JOB_NAME
unset SLURM_JOB_NODELIST
unset SLURM_JOB_NUM_NODES
unset SLURM_JOB_PARTITION
unset SLURM_JOB_QOS
unset SLURM_JOB_RESERVATION
unset SLURM_JOB_UID
unset SLURM_JOB_USER
unset SLURM_LAUNCH_NODE_IPADDR
unset SLURM_LOCALID
unset SLURM_NNODES
unset SLURM_NODEID
unset SLURM_NODELIST
unset SLURM_NPROCS
unset SLURM_NTASKS
unset SLURM_PRIO_PROCESS
unset SLURM_PROCID
unset SLURM_PTY_PORT
unset SLURM_PTY_WIN_COL
unset SLURM_PTY_WIN_ROW
unset SLURM_SRUN_COMM_HOST
unset SLURM_SRUN_COMM_PORT
unset SLURM_STEPID
unset SLURM_STEP_GPUS
unset SLURM_STEP_ID
unset SLURM_STEP_LAUNCHER_PORT
unset SLURM_STEP_NODELIST
unset SLURM_STEP_NUM_NODES
unset SLURM_STEP_NUM_TASKS
unset SLURM_STEP_TASKS_PER_NODE
unset SLURM_SUBMIT_DIR
unset SLURM_SUBMIT_HOST
unset SLURM_TASKS_PER_NODE
unset SLURM_TASK_PID
unset SLURM_TOPOLOGY_ADDR
unset SLURM_TOPOLOGY_ADDR_PATTERN
unset SLURM_UMASK
unset SLURM_WORKING_CLUSTER |
Thanks Mikael. I'm afraid nothing seems to make any difference: I have tried your sbatch script, and I have even tried turning the Slurm daemon off on our build machine, but it all gives the same result: a failure at test 161.
The user that is attempting this build has no SLURM* environment variables set: 'env | grep -i slurm' returns nothing.
|
It's best if you answer comments on github instead of directly on the email; your client adds a bunch of extra stuff that you that ends up here, making things very hard to read (i edited those out of your comments). I think you need to dig deeper here;
and try to run the test commands manually, look at what is happening with the calls (is there any load at all?), perhaps strace'ing the test. If think there is just something wrong with this one test (I don't know if there are other mpi tests that pass.. maybe?) then you can also just skip the test step. |
Noted (and sorry) re response method. I've created that .env file and sourced it, and I've found a directory /home/apps/eb/build/netCDF/4.9.0/gompi-2022a. This contains no files and 2 subdirectories, easybuild_obj and netcdf-c-4.9.0. I'm not clear what you're asking me to do from here (I have no idea how to run the test commands manually). |
yes. the build and the source directory respectively.
well i'm just suggesting the steps you can do in order to dig down and debug this problem yourself as to why this one test fails. |
Hmm, I'm pretty lost here. The log file shows: ...before the start of the list of the 227 tests. I can see there's a Makefile, and it contains 2 "special rules" for "target test". One of these specifies a ".../ctest" command with arguments, which also appears in the log file. Are you saying I should try to run that command manually? If I do, won't I simply get the same result that Easybuild does? I am not sure about getting strace output. If I managed this, wouldn't it be voluminous for those 227 tests? And searching for a timeout error in it would be like looking for a needle in a haystack? I think you are saying mine is the only Easybuild site that has reported this error, and that other sites have built this with no problem. If that's the case, that suggests the problem is with our site, and I do understand that you can't spend your time trying to help find such a problem. But it does present us with an issue, given that this is a dependency for R, which is a very important package for us. We may have to look for alternative ways of obtaining R. I do also understand that the Easybuild maintainers can't be responsible for maintaining easyconfigs submitted by the community. It is a shame that Easybuild does not seem to have any mechanism for ensuring that the submitter provides some level of support. |
Well that's how i start tackling most software problems, start be reproducing it. Then continue to dig deeper, and deeper, and deeper. Like checking if there were any other tests that ran MPI and somehow worked, seeing what actually happens on the system while the failing test is running (full load? waiting for MPI init?). Is there some stderr/stdout from the failing test
No, just the exact command (whatever ctest actually runs,
So far yes (as far as I've seen). Whoever merges PRs makes sure there is at least 1 successfull build report from EB, though most of the common ones probably sees +10 different places before it's shipped.
Then noone would volunteer to submit anything, so we certainly won't ever hold anyone to that (how could we? noone is getting payed here). It's all just a collective best effort where everyone submits patches, fixes, improvements. |
OK thanks. My first attempt to run "make test" manually did in fact produce slightly more information than appeared in the Easybuild log file: it suggested I reran with some additional arguments, "--rerun-failed --output-on-failure". Once I'd worked out where to put these (it was in the Makefile), I reran the "make test". Unfortunately, this produced no further useful information: it seems to be a timeout in the "Parallel Performance Test for NASA". I did attach strace to one of the PIDs participating in this - one with no further children. All I got was the following repeated over and over again:
Only the penultimate number (28208 in the extract above) varied between repetitions. I wouldn't know where to begin in making any use of that. This brings me back to my point about maintainers: it might mean something to someone who (a) is a developer and (b) knows this product. I am neither of those things myself. If you've got any further suggestions I'd be interested to hear them, but I sense we've probably got as far as we can with this. I shall have to seek alternatives to Easybuild. But I'd be reluctant to close this issue, because it's not been resolved. |
Sorry, I didn't address a couple of the points you made:
|
I have noticed that this build will succeed for me if I comment out the following line in the easyconfig:
I have no way of telling what this line does: "runtest" does not appear in the Makefile, but it does appear in literally hundreds of other files in the build directory. This is a new line in this easyconfig, compared to the previous version. @branfosj I see you committed this easyconfig. Is it possible to tell who submitted it to you, so that we may pick their brains on "runtest"? |
I have discovered that 'runtest' is just an easyconfig keyword, and "runtest = 'test'" just says to run 'make test' after 'make'. I guess its absence from previous versions of this easyconfig means we weren't running 'make test' for them. I feel back at square one again. @branfosj probably best if you ignore my earlier question. |
Well yes, not running the test suite for a software will ignore the issue, which, might not be affecting you in practice, depending on what is failing. If all the numerical math was wrong in Numpy, i certainly wouldn't wanna install that module without a closer look, but I've skipped failing tests on PyTorch, when it only occurred on nodes with more than 6 GPUs (and it was a true bug that i reported upstream, with no easy fix or workaround). So, as i said earlier, ignoring failing tests is an option.
why? This seems very useful. I had a look and it's just a very simple shell script
@MPIEXEC@ -n 4 ./tst_nc4perf My guesses are
I can only speak for myself, but there is nothing magic about being a maintainer. We are basically all just users ourselves who were foolish enough to also volunteer our spare time drive the project, review contributions to the best of our ability.
indeed, one will need to learn, or pay someone else to know it. As noted in the other PR that is linked above, it seems likely that the same issue occurred on the test build for the iimpi variant on the generoso test cluster. I don't have access there but some other maintainer might find time to investigate where that one failed. |
Thanks @Micket that's really helpful. I reran the failing test with the timeout you suggested. The test now ran to conclusion - but, in taking over 4 hours to complete, it did need all of the scale factor of your suggested timeout. Here is what I got:
I see that the Makefile invokes ctest with ${ARGS}, so I tried Is it possible to get that --timeout option to ctest passed via the easyconfig? I can't see anything in the easyconfig options that would allow this. Of course, this raises the question why this test is taking 100x as long on my platform as on yours. I shall try using /dev/shm next. |
Parallel filesystems can be very picky about what IO patterns are used. Perhaps this benchmark might be inducing some locking behavior, or perhaps it just creates the worst case scenario for your particular filesystem (which i don't know you have). I'd also be interested in seeing if it's just this test, or if any of the others are also slow. We can consider patching the specific NASA performance benchmark out, it might not be broadly applicable. edit: accidentally double commented since my browser hadn't refreshed |
Please test #16050
|
@Micket running this rebuild myself now. This looks very promising - the testing for iteration #0 took 0:18:35 instead of the 4:37:23 previously. I think the testing for iteration # 1 may take a similar time. |
And it's completed. Testing for second iteration took 0:18:47. I believe you'll find a test report on the PR. Just confirming, this build was run without a $ARGS in my environment. (Previously, this had needed to contain the value '--timeout=100000'.) I think we could close this this issue, if you're happy. |
it'll be autoclosed when the PR merges, thanks for the test report. |
Just for reference, I also hit the same issue with a slightly modified toolchain ("just" updating GCC to 12.1.0). |
Not out of the woods yet here, I'm seeing test timeouts for
|
To add to this (might be or might not be relevant): I am seeing similar timeouts with
|
(I tried raising this by e-mailing easybuild-request@lists.ugent.be, but I received no responses.)
I am getting a consistent build error when attempting to build netCDF-4.9.0-gompi-2022a.eb. This is a dependency for the R 4.2.1 build. I have tried this using both 10 CPUs and 20 CPUs on the build system, with the same result, which is a failure in the MPI tests. I attach copies of the build logs. Can anyone shed any light on what's going wrong here?
easybuild-FMwVd0.log
easybuild-netCDF-4.9.0-20220802.094732.cKaNQ.log
easybuild-netCDF-4.9.0-20220802.094732.cKaNQ_test_report.md
The text was updated successfully, but these errors were encountered: