Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.6.1 mask creation fails #58

Closed
azazellochg opened this issue Oct 23, 2023 · 23 comments
Closed

v0.6.1 mask creation fails #58

azazellochg opened this issue Oct 23, 2023 · 23 comments

Comments

@azazellochg
Copy link

azazellochg commented Oct 23, 2023

Hi @thorstenwagner
I'm checking the tutorial with the latest version. All steps are working except mask creation.

source /public/EM/Scipion/conda.rc&& conda activate tomotwin-0.6.1 && tomotwin_tools.py embedding_mask -i tomo.mrc -o ../extra/ fails on a single rtx2080ti GPU with:

00001:   Traceback (most recent call last):
00002:     File "/public/EM/Scipion/miniconda3/envs/tomotwin-0.6.1/bin/tomotwin_tools.py", line 33, in <module>
00003:       sys.exit(load_entry_point('tomotwin-cryoet', 'console_scripts', 'tomotwin_tools.py')())
00004:     File "/public/EM/Scipion/miniconda3/envs/tomotwin-0.6.1/bin/tomotwin_tools.py", line 25, in importlib_load_entry_point
00005:       return next(matches).load()
00006:     File "/public/EM/Scipion/miniconda3/envs/tomotwin-0.6.1/lib/python3.10/importlib/metadata/__init__.py", line 171, in load
00007:       module = import_module(match.group('module'))
00008:     File "/public/EM/Scipion/miniconda3/envs/tomotwin-0.6.1/lib/python3.10/importlib/__init__.py", line 126, in import_module
00009:       return _bootstrap._gcd_import(name[level:], package, level)
00010:     File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
00011:     File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
00012:     File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
00013:     File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
00014:     File "<frozen importlib._bootstrap_external>", line 883, in exec_module
00015:     File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
00016:     File "/public/EM/Scipion/scipion-dev/software/em/tomotwin-0.6.1/tomotwin/tools_main.py", line 390, in <module>
00017:       from tomotwin.modules.tools.umap import UmapTool
00018:     File "/public/EM/Scipion/scipion-dev/software/em/tomotwin-0.6.1/tomotwin/modules/tools/umap.py", line 7, in <module>
00019:       import cuml
00020:     File "/public/EM/Scipion/miniconda3/envs/tomotwin-0.6.1/lib/python3.10/site-packages/cuml/__init__.py", line 17, in <module>
00021:       from cuml.internals.base import Base, UniversalBase
00022:     File "/public/EM/Scipion/miniconda3/envs/tomotwin-0.6.1/lib/python3.10/site-packages/cuml/internals/__init__.py", line 17, in <module>
00023:       from cuml.internals.base_helpers import BaseMetaClass, _tags_class_and_instance
00024:     File "/public/EM/Scipion/miniconda3/envs/tomotwin-0.6.1/lib/python3.10/site-packages/cuml/internals/base_helpers.py", line 20, in <module>
00025:       from cuml.internals.api_decorators import (
00026:     File "/public/EM/Scipion/miniconda3/envs/tomotwin-0.6.1/lib/python3.10/site-packages/cuml/internals/api_decorators.py", line 24, in <module>
00027:       from cuml.internals import input_utils as iu
00028:     File "/public/EM/Scipion/miniconda3/envs/tomotwin-0.6.1/lib/python3.10/site-packages/cuml/internals/input_utils.py", line 19, in <module>
00029:       from cuml.internals.array import CumlArray
00030:     File "/public/EM/Scipion/miniconda3/envs/tomotwin-0.6.1/lib/python3.10/site-packages/cuml/internals/array.py", line 22, in <module>
00031:       from cuml.internals.global_settings import GlobalSettings
00032:     File "/public/EM/Scipion/miniconda3/envs/tomotwin-0.6.1/lib/python3.10/site-packages/cuml/internals/global_settings.py", line 19, in <module>
00033:       from cuml.internals.available_devices import is_cuda_available
00034:     File "/public/EM/Scipion/miniconda3/envs/tomotwin-0.6.1/lib/python3.10/site-packages/cuml/internals/available_devices.py", line 17, in <module>
00035:       from cuml.internals.safe_imports import gpu_only_import_from, UnavailableError
00036:     File "/public/EM/Scipion/miniconda3/envs/tomotwin-0.6.1/lib/python3.10/site-packages/cuml/internals/safe_imports.py", line 21, in <module>
00037:       from cuml.internals import logger
00038:   ImportError: /public/EM/Scipion/miniconda3/envs/tomotwin-0.6.1/lib/python3.10/site-packages/cuml/internals/../../../.././libcublas.so.11: undefined symbol: cublasLt_for_cublas_HSS, version libcublasLt.so.11

I'm happy to provide more information if required.

@thorstenwagner
Copy link
Collaborator

Hi @azazellochg

Thanks for reporting the issue.

I made a fresh setup of TomoTwin on a GPU box with 2080Ti and everything runs fine. I think, the problem is another tool, which it tries to import but fails.

Can you try to calculate a umap? This should also fail according the exception.

I quickly googled the issue and it might be related to path problems.

Whats within your $PATH?

Here is how mine looks like:

/opt/user_software/miniconda3_envs/tomotwin/bin:/opt/user_software/miniconda3/condabin:/opt/user_software/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

What cuml version got installed?

pip freeze | grep cuml gives me 23.4.1

Best,
Thorsten

@azazellochg
Copy link
Author

gsharov@hex:~$ source ~/rc/conda.rc
gsharov@hex:~$ conda activate tomotwin-0.6.1
(tomotwin-0.6.1) gsharov@hex:~$ echo $PATH
/public/EM/Scipion/miniconda3/envs/tomotwin-0.6.1/bin:/public/EM/Scipion/miniconda3/condabin:/net/flash/flash/gsharov/cryosparc/cryosparc_worker/bin:/net/flash/flash/gsharov/cryosparc/cryosparc_master/bin:/net/nfs1/public/EM/CUDA/cuda-11.4nvvm/bin:/net/nfs1/public/EM/CUDA/cuda-11.4/bin:/public/gcc/10_2_0/bin:/net/nfs1/public/EM/OpenMPI/openmpi-2.0.1/build/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
(tomotwin-0.6.1) gsharov@hex:~$ pip list | grep cuml
cuml                    23.4.1

I'm re-running embedding now to do umaps after that.

@azazellochg
Copy link
Author

There's also a warning which is probably not related to this problem:

00001:   [W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:29368 (errno: 97 - Address family not supported by protocol).
00002:   [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:29368 (errno: 97 - Address family not supported by protocol).
00003:   [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:29368 (errno: 97 - Address family not supported by protocol).

@thorstenwagner
Copy link
Collaborator

There's also a warning which is probably not related to this problem:

00001:   [W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:29368 (errno: 97 - Address family not supported by protocol).
00002:   [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:29368 (errno: 97 - Address family not supported by protocol).
00003:   [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:29368 (errno: 97 - Address family not supported by protocol).

Does still run on multiple GPUs?

@azazellochg
Copy link
Author

There's also a warning which is probably not related to this problem:

00001:   [W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:29368 (errno: 97 - Address family not supported by protocol).
00002:   [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:29368 (errno: 97 - Address family not supported by protocol).
00003:   [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:29368 (errno: 97 - Address family not supported by protocol).

Does still run on multiple GPUs?

Seems so:
01084: Calculate embeddings (1): 5%|▌ | 510/9929 [05:09<1:06:54, 2.35it/s]
01085: Calculate embeddings (0): 6%|▌ | 564/9929 [05:08<1:21:14, 1.92it/s]

@thorstenwagner
Copy link
Collaborator

gsharov@hex:~$ source ~/rc/conda.rc
gsharov@hex:~$ conda activate tomotwin-0.6.1
(tomotwin-0.6.1) gsharov@hex:~$ echo $PATH
/public/EM/Scipion/miniconda3/envs/tomotwin-0.6.1/bin:/public/EM/Scipion/miniconda3/condabin:/net/flash/flash/gsharov/cryosparc/cryosparc_worker/bin:/net/flash/flash/gsharov/cryosparc/cryosparc_master/bin:/net/nfs1/public/EM/CUDA/cuda-11.4nvvm/bin:/net/nfs1/public/EM/CUDA/cuda-11.4/bin:/public/gcc/10_2_0/bin:/net/nfs1/public/EM/OpenMPI/openmpi-2.0.1/build/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
(tomotwin-0.6.1) gsharov@hex:~$ pip list | grep cuml
cuml                    23.4.1

I'm re-running embedding now to do umaps after that.

I see cuda libs in the path which I could imagine give that problem?

cuml I guess expects cuda 11.8

@azazellochg
Copy link
Author

you might be right. I'm using cuda-11.4 libs but have installed tomotwin with cudatoolkit 11.8. Let me try with 11.8

@azazellochg
Copy link
Author

Alright, masking works now with 11.8! I guess I should have tried it before opening the issue.. :)

@thorstenwagner
Copy link
Collaborator

That's fine :-) Its good to see all sorts of errors when it comes to debugging :-)

btw, I'm giving a talk soon at LMB. Will we see each other?

Best,
Thorsten

@azazellochg
Copy link
Author

Yep, I'll be here!

The embedding has finished (on 4 gpus) but it is just hanging now...
27165: Calculate embeddings (0): 100%|██████████| 4965/4965 [45:37<00:00, 3.01s/it]

If I login to the node I see:

|    0   N/A  N/A    403291      C   .../envs/tomotwin-0.6.1/bin/python3.10     7778MiB |
|    0   N/A  N/A    403292      C   .../envs/tomotwin-0.6.1/bin/python3.10      154MiB |
|    0   N/A  N/A    403293      C   .../envs/tomotwin-0.6.1/bin/python3.10      154MiB |
|    0   N/A  N/A    403294      C   .../envs/tomotwin-0.6.1/bin/python3.10      154MiB |
|    1   N/A  N/A    403292      C   .../envs/tomotwin-0.6.1/bin/python3.10     8006MiB |
|    2   N/A  N/A    403293      C   .../envs/tomotwin-0.6.1/bin/python3.10     8006MiB |
|    3   N/A  N/A    403294      C   .../envs/tomotwin-0.6.1/bin/python3.10     8006MiB |

And also 137 processes like:

gsharov   403291 41.1  0.5 35370556 2008796 ?    Sl   12:29  20:41 /public/EM/Scipion/miniconda3/envs/tomotwin-0.6.1/bin/python3.10 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=22, pipe_handle=24) --multiprocessing-fork

The machine has only 64 cores (HT)...

@thorstenwagner
Copy link
Collaborator

The distributed data loader thing from pytorch opens quite some processes, so I would say its normal and should give you any problems. I had it running on machine with less cores than 64.

Can you check if it writes the emedding file? Is the file increasing in size?

@azazellochg
Copy link
Author

There's no file, the output folder is empty...

@thorstenwagner
Copy link
Collaborator

Hmm :-/ Are the processes still busy (htop)?

@azazellochg
Copy link
Author

yes, the same. But gpu utilization has changed:
| 0 N/A N/A 403294 C .../envs/tomotwin-0.6.1/bin/python3.10 154MiB |
| 3 N/A N/A 403294 C .../envs/tomotwin-0.6.1/bin/python3.10 8006MiB |

I'll wait a bit more.

@azazellochg
Copy link
Author

I've killed it. Now re-running embedding on a single GPU which worked for reference-based tutorial. I think the embeddng command is the same..

@thorstenwagner
Copy link
Collaborator

thorstenwagner commented Oct 24, 2023 via email

@azazellochg
Copy link
Author

With -d 0 it finished correctly on 4x gpus

@thorstenwagner
Copy link
Collaborator

thorstenwagner commented Oct 27, 2023

I've found a bug that might be related to it. I will let you know when a fix is available.

@thorstenwagner
Copy link
Collaborator

The fix is now available in the current development release:
https://pypi.org/project/tomotwin-cryoet/0.7.0b1/

@azazellochg
Copy link
Author

I've installed 0.7.0. Now I'm getting more errors:

00003:   Calculate embeddings (0):   0%|          | 0/46560 [00:00<?, ?it/s]
00004:                                                                      
00005:   Traceback (most recent call last):
00006:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/bin/tomotwin_tools.py", line 33, in <module>
00007:       sys.exit(load_entry_point('tomotwin-cryoet', 'console_scripts', 'tomotwin_tools.py')())
00008:     File "/home/gsharov/soft/scipion3/software/em/tomotwin-0.7.0/tomotwin/tools_main.py", line 431, in _main_
00009:       tool.run(args)
00010:     File "/home/gsharov/soft/scipion3/software/em/tomotwin-0.7.0/tomotwin/modules/tools/embedding_mask.py", line 605, in run
00011:       mask = self.median_mode(tomo_pth=args.input,
00012:     File "/home/gsharov/soft/scipion3/software/em/tomotwin-0.7.0/tomotwin/modules/tools/embedding_mask.py", line 532, in median_mode
00013:       embed.start(conf)
00014:     File "/home/gsharov/soft/scipion3/software/em/tomotwin-0.7.0/tomotwin/embed_main.py", line 618, in start
00015:       run_distr(config, world_size)
00016:     File "/home/gsharov/soft/scipion3/software/em/tomotwin-0.7.0/tomotwin/embed_main.py", line 604, in run_distr
00017:       mp.spawn(
00018:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
00019:       return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
00020:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
00021:       while not context.join():
00022:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 163, in join
00023:       raise ProcessRaisedException(msg, error_index, failed_process.pid)
00024:   torch.multiprocessing.spawn.ProcessRaisedException: 
00025:   
00026:   -- Process 0 terminated with the following error:
00027:   Traceback (most recent call last):
00028:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
00029:       fn(i, *args)
00030:     File "/home/gsharov/soft/scipion3/software/em/tomotwin-0.7.0/tomotwin/embed_main.py", line 584, in run
00031:       embed_tomogram(tomo, embedor, conf, window_size, mask)
00032:     File "/home/gsharov/soft/scipion3/software/em/tomotwin-0.7.0/tomotwin/embed_main.py", line 512, in embed_tomogram
00033:       embeddings = sliding_window_embedding(tomo=tomo, boxer=boxer, embedor=embedor)
00034:     File "/home/gsharov/soft/scipion3/software/em/tomotwin-0.7.0/tomotwin/embed_main.py", line 409, in sliding_window_embedding
00035:       embeddings = embedor.embed(volume_data=boxes)
00036:     File "/home/gsharov/soft/scipion3/software/em/tomotwin-0.7.0/tomotwin/modules/inference/embedor.py", line 600, in embed
00037:       subvolume = self.model.forward(subvolume).type(torch.HalfTensor)
00038:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
00039:       else self._run_ddp_forward(*inputs, **kwargs)
00040:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
00041:       return self.module(*inputs, **kwargs)  # type: ignore[index]
00042:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
00043:       return self._call_impl(*args, **kwargs)
00044:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
00045:       return forward_call(*args, **kwargs)
00046:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
00047:       return fn(*args, **kwargs)
00048:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
00049:       return self._call_impl(*args, **kwargs)
00050:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
00051:       return forward_call(*args, **kwargs)
00052:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 487, in catch_errors
00053:       return hijacked_callback(frame, cache_entry, hooks, frame_state)
00054:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 641, in _convert_frame
00055:       result = inner_convert(frame, cache_size, hooks, frame_state)
00056:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 133, in _fn
00057:       return fn(*args, **kwargs)
00058:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 389, in _convert_frame_assert
00059:       return _compile(
00060:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 569, in _compile
00061:       guarded_code = compile_inner(code, one_graph, hooks, transform)
00062:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
00063:       r = func(*args, **kwargs)
00064:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 491, in compile_inner
00065:       out_code = transform_code_object(code, transform)
00066:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
00067:       transformations(instructions, code_options)
00068:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 458, in transform
00069:       tracer.run()
00070:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2074, in run
00071:       super().run()
00072:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run
00073:       and self.step()
00074:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step
00075:       getattr(self, inst.opname)(inst)
00076:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2162, in RETURN_VALUE
00077:       self.output.compile_subgraph(
00078:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 833, in compile_subgraph
00079:       self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
00080:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/contextlib.py", line 79, in inner
00081:       return func(*args, **kwds)
00082:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 957, in compile_and_call_fx_graph
00083:       compiled_fn = self.call_user_compiler(gm)
00084:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
00085:       r = func(*args, **kwargs)
00086:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1024, in call_user_compiler
00087:       raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
00088:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1009, in call_user_compiler
00089:       compiled_fn = compiler_fn(gm, self.example_inputs())
00090:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/backends/distributed.py", line 436, in compile_fn
00091:       submod_compiler.run(*example_inputs)
00092:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/fx/interpreter.py", line 138, in run
00093:       self.env[node] = self.run_node(node)
00094:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/backends/distributed.py", line 417, in run_node
00095:       compiled_submod_real = self.compile_submod(
00096:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/backends/distributed.py", line 361, in compile_submod
00097:       self.compiler(input_mod, args),
00098:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
00099:       compiled_gm = compiler_fn(gm, example_inputs)
00100:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/__init__.py", line 1568, in __call__
00101:       return compile_fx(model_, inputs_, config_patches=self.config)
00102:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 961, in compile_fx
00103:       return compile_fx(
00104:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1150, in compile_fx
00105:       return aot_autograd(
00106:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/backends/common.py", line 55, in compiler_fn
00107:       cg = aot_module_simplified(gm, example_inputs, **kwargs)
00108:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 3891, in aot_module_simplified
00109:       compiled_fn = create_aot_dispatcher_function(
00110:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
00111:       r = func(*args, **kwargs)
00112:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 3429, in create_aot_dispatcher_function
00113:       compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
00114:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 2212, in aot_wrapper_dedupe
00115:       return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
00116:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 2392, in aot_wrapper_synthetic_base
00117:       return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
00118:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1573, in aot_dispatch_base
00119:       compiled_fw = compiler(fw_module, flat_args)
00120:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
00121:       r = func(*args, **kwargs)
00122:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1092, in fw_compiler_base
00123:       return inner_compile(
00124:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/contextlib.py", line 79, in inner
00125:       return func(*args, **kwds)
00126:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/repro/after_aot.py", line 80, in debug_wrapper
00127:       inner_compiled_fn = compiler_fn(gm, example_inputs)
00128:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_inductor/debug.py", line 228, in inner
00129:       return fn(*args, **kwargs)
00130:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/contextlib.py", line 79, in inner
00131:       return func(*args, **kwds)
00132:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 54, in newFunction
00133:       return old_func(*args, **kwargs)
00134:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 341, in compile_fx_inner
00135:       compiled_graph: CompiledFxGraph = fx_codegen_and_compile(
00136:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 565, in fx_codegen_and_compile
00137:       compiled_fn = graph.compile_to_fn()
00138:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_inductor/graph.py", line 970, in compile_to_fn
00139:       return self.compile_to_module().call
00140:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
00141:       r = func(*args, **kwargs)
00142:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_inductor/graph.py", line 941, in compile_to_module
00143:       mod = PyCodeCache.load_by_key_path(key, path, linemap=linemap)
00144:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 1139, in load_by_key_path
00145:       exec(code, mod.__dict__, mod.__dict__)
00146:     File "/tmp/torchinductor_gsharov/cn/ccnvvztrhf335j4jmcl5p37x4gzlopfwwvpos4yaoxpbld7ivbop.py", line 452, in <module>
00147:       async_compile.wait(globals())
00148:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 1418, in wait
00149:       scope[key] = result.result()
00150:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 1277, in result
00151:       self.future.result()
00152:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/concurrent/futures/_base.py", line 458, in result
00153:       return self.__get_result()
00154:     File "/home/gsharov/soft/miniconda3/envs/tomotwin-0.7.0/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
00155:       raise self._exception
00156:   torch._dynamo.exc.BackendCompilerFailed: backend='compile_fn' raised:
00157:   FileNotFoundError: [Errno 2] No such file or directory: 'ldconfig'
00158:   
00159:   Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
00160:   
00161:   
00162:   You can suppress this exception and fall back to eager by setting:
00163:       import torch._dynamo
00164:       torch._dynamo.config.suppress_errors = True
00165:   
00166:   
00167:   Traceback (most recent call last):
00168:     File "/home/gsharov/soft/scipion3/scipion-pyworkflow/pyworkflow/protocol/protocol.py", line 203, in run
00169:       self._run()
00170:     File "/home/gsharov/soft/scipion3/scipion-pyworkflow/pyworkflow/protocol/protocol.py", line 254, in _run
00171:       resultFiles = self._runFunc()
00172:     File "/home/gsharov/soft/scipion3/scipion-pyworkflow/pyworkflow/protocol/protocol.py", line 250, in _runFunc
00173:       return self._func(*self._args)
00174:     File "/home/gsharov/soft/scipion-em-plugins/scipion-em-tomotwin/tomotwin/protocols/protocol_create_masks.py", line 111, in createMaskStep
00175:       self.runJob(self.getProgram("tomotwin_tools.py"), " ".join(args),
00176:     File "/home/gsharov/soft/scipion3/scipion-pyworkflow/pyworkflow/protocol/protocol.py", line 1505, in runJob
00177:       self._stepsExecutor.runJob(self._log, program, arguments, **kwargs)
00178:     File "/home/gsharov/soft/scipion3/scipion-pyworkflow/pyworkflow/protocol/executor.py", line 65, in runJob
00179:       process.runJob(log, programName, params,
00180:     File "/home/gsharov/soft/scipion3/scipion-pyworkflow/pyworkflow/utils/process.py", line 56, in runJob
00181:       return runCommand(command, env, cwd)
00182:     File "/home/gsharov/soft/scipion3/scipion-pyworkflow/pyworkflow/utils/process.py", line 71, in runCommand
00183:       check_call(command, shell=True, stdout=sys.stdout, stderr=sys.stderr,
00184:     File "/home/gsharov/soft/miniconda3/envs/scipion3/lib/python3.8/subprocess.py", line 364, in check_call
00185:       raise CalledProcessError(retcode, cmd)
00186:   subprocess.CalledProcessError: Command ' eval "$(/home/gsharov/soft/miniconda3/bin/conda shell.bash hook)"&& . /etc/profile.d/lmod.sh && module load cuda/11.8 && conda activate tomotwin-0.7.0 && CUDA_VISIBLE_DEVICES=0 tomotwin_tools.py embedding_mask median -m /home/gsharov/soft/scipion3/software/em/tomotwin_model-092023/tomotwin_model_p120_092023_loss.pth -i emd_10439.mrc -o ../extra/' returned non-zero exit status 1.
00187:   Protocol failed: Command ' eval "$(/home/gsharov/soft/miniconda3/bin/conda shell.bash hook)"&& . /etc/profile.d/lmod.sh && module load cuda/11.8 && conda activate tomotwin-0.7.0 && CUDA_VISIBLE_DEVICES=0 tomotwin_tools.py embedding_mask median -m /home/gsharov/soft/scipion3/software/em/tomotwin_model-092023/tomotwin_model_p120_092023_loss.pth -i emd_10439.mrc -o ../extra/' returned non-zero exit status 1.

@azazellochg
Copy link
Author

azazellochg commented Nov 6, 2023

If I put these extra debugging flags, I get more output: run.stderr.txt. Same error happens with both 0.7.0 and 0.6.1

@azazellochg
Copy link
Author

Looks like you have already solved it in the past :D https://discuss.pytorch.org/t/dynamo-exceptions-with-distributeddataprallel-compile/186768

@thorstenwagner
Copy link
Collaborator

Interesting, this is the third machine where I encounter this issue. Looks like I should add a check if ldconfig is available

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants