-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v0.6.1 mask creation fails #58
Comments
Hi @azazellochg Thanks for reporting the issue. I made a fresh setup of TomoTwin on a GPU box with 2080Ti and everything runs fine. I think, the problem is another tool, which it tries to import but fails. Can you try to calculate a umap? This should also fail according the exception. I quickly googled the issue and it might be related to path problems. Whats within your Here is how mine looks like:
What
Best, |
I'm re-running embedding now to do umaps after that. |
There's also a warning which is probably not related to this problem:
|
Does still run on multiple GPUs? |
Seems so: |
I see cuda libs in the path which I could imagine give that problem? cuml I guess expects cuda 11.8 |
you might be right. I'm using cuda-11.4 libs but have installed tomotwin with cudatoolkit 11.8. Let me try with 11.8 |
Alright, masking works now with 11.8! I guess I should have tried it before opening the issue.. :) |
That's fine :-) Its good to see all sorts of errors when it comes to debugging :-) btw, I'm giving a talk soon at LMB. Will we see each other? Best, |
Yep, I'll be here! The embedding has finished (on 4 gpus) but it is just hanging now... If I login to the node I see:
And also 137 processes like:
The machine has only 64 cores (HT)... |
The distributed data loader thing from Can you check if it writes the emedding file? Is the file increasing in size? |
There's no file, the output folder is empty... |
Hmm :-/ Are the processes still busy (htop)? |
yes, the same. But gpu utilization has changed: I'll wait a bit more. |
I've killed it. Now re-running embedding on a single GPU which worked for reference-based tutorial. I think the embeddng command is the same.. |
Not sure how to reproduce it.
I also run it on our gpu box with 4x 2080ti.
Instead of using a single gpu you can also use the old data loader with the parameter -d 0
24.10.2023 16:09:34 Grigory Sharov ***@***.***>:
…
I've killed it. Now re-running embedding on a single GPU which worked for reference-based tutorial. I think the embeddng command is the same..
—
Reply to this email directly, view it on GitHub[#58 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAIP66ODJ2FSHL5OG7FXWSTYA7DZZAVCNFSM6AAAAAA6L5MKT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZXGI4TOMJTG4].
You are receiving this because you were mentioned.
[Verfolgungsbild][https://github.com/notifications/beacon/AAIP66NMSO6NUGTQDIESLFDYA7DZZA5CNFSM6AAAAAA6L5MKT2WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTJ55TPC.gif]
|
With -d 0 it finished correctly on 4x gpus |
I've found a bug that might be related to it. I will let you know when a fix is available. |
The fix is now available in the current development release: |
I've installed 0.7.0. Now I'm getting more errors:
|
If I put these extra debugging flags, I get more output: run.stderr.txt. Same error happens with both 0.7.0 and 0.6.1 |
Looks like you have already solved it in the past :D https://discuss.pytorch.org/t/dynamo-exceptions-with-distributeddataprallel-compile/186768 |
Interesting, this is the third machine where I encounter this issue. Looks like I should add a check if ldconfig is available |
Hi @thorstenwagner
I'm checking the tutorial with the latest version. All steps are working except mask creation.
source /public/EM/Scipion/conda.rc&& conda activate tomotwin-0.6.1 && tomotwin_tools.py embedding_mask -i tomo.mrc -o ../extra/
fails on a single rtx2080ti GPU with:I'm happy to provide more information if required.
The text was updated successfully, but these errors were encountered: