Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Problem when writing to csv from N>1 nodes #1011

Closed
ClaudiaComito opened this issue Aug 20, 2022 · 4 comments
Closed

[Bug]: Problem when writing to csv from N>1 nodes #1011

ClaudiaComito opened this issue Aug 20, 2022 · 4 comments
Assignees
Labels

Comments

@ClaudiaComito
Copy link
Contributor

What happened?

test_io.test_save_csv hangs when processes are distributed over several nodes. Works fine if all processes are on the same node.

Code snippet triggering the error

# On HDFML
salloc --account=haf --nodes=2 --time=00:30:00 --gres=gpu:4
ml GCC OpenMPI PyTorch torchvision mpi4py HDF5 netCDF 
srun -N 2 --ntasks-per-node=1 python -m unittest -vf   heat/core/tests/test_io.py

Error message or erroneous outcome

test_save_csv (heat.core.tests.test_io.TestIO) ... test_save_csv (heat.core.tests.test_io.TestIO) ...

Version

1.2.x

Python version

3.9

PyTorch version

1.11

@bhagemeier
Copy link
Member

Hi Claudia, this surprises me, as it has of course been tested successfully in the past. I'll look into it.

@ClaudiaComito
Copy link
Contributor Author

This persists after bug fix #1058 . Would you like to look into this @mtar or (when you're back) @JuanPedroGHM ?

@mtar
Copy link
Collaborator

mtar commented Jan 13, 2023

The IO unit tests use temporary files. Have you set one of the TMPDIR, TEMP or TMP environment variables to a directory that can be accessed by all nodes?

@ClaudiaComito
Copy link
Contributor Author

thanks @mtar that was it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants