-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linking error #62
Comments
hi @Tissot11, apologies for the late reply. btw, the shock setup should work in the new dev branch (will be released as 1.1.0 shortly). regarding your issue -- Cuda actually has very limited support for intel compilers. you're essentially trying to use the latest Cuda and Intel, which might be a recipe for incompatibility issues. i would try with gcc; starting from version 9 all Cuda versions work perfectly ok. |
Hi @haykh, thanks for replying. I did with intel because on one HPC machine they did't not have hdf5 either as a standalone module or compiled module with gnu and openmpi. Anyway, now I manage to compile on another machine without any problems.
However, on runtime I get errors. I presume it requires some setting of variables in the submit script. I have asked the technical team but I also attach here the error files since they might take time to reply. Could you please have a look and tell what could be the problem or if you encounter this sort of issue before? I also paste the content of job script. I'm extremely happy to hear about the shock setup in the upcoming version of
|
@Tissot11 could you try without OpenMP? I had previously issues with that, since CUDA is getting confused if multiple CPU threads are running. Regardless, you won't gain much by having OpenMP anyway. Also, you might want to enable |
Ok. So I used the following build
I encounted issue now about HDF5 root not being set. I tried
But this doesn't help either. Any suggestion for this? I attach the build log and output CMakeError.log I looked up the page configuring Environment modules but it is not clear to me if this is meant that I should build my own module or used the basic modules provided by the admins? |
That's odd, OpenMP has nothing to do with hdf5. Did you clean the |
I always re-download entity for compiling. WIthout production flag on, it compiles fine. But on the runtime, I see the same error. |
Does it make any difference if entity was compiled on a node with a GPU present? At the moment, I have compiled it on the login node where no physical GPU is present. |
It should not matter, as long as (a) you specified the correct I'll try to make a minimal compilable code to test errors like this (also for the future). |
Yeah, I also suspect that technical team should provide some tips on how they have configured everything on the cluster. The documentation is a bare minimum with no examples being offered for these GPUs+ MPI related jobs. I am essentially trying whatever I could come up with. I'm sorry, I couldn't understand the environment modules page is meant for building your own modules or using the existing ones on a cluster? In principle the tool you provide is quite useful for understanding and avoiding these errors in installations and runtimes. |
It turns out that I could compile and run Entity with langmuir.log Below is the content of the submit script
|
Yes, in theory on a well-maintained cluster you shouldn't have to compile anything yourself. In practice -- oftentimes the MPI you use is not compile with the version of GCC compatible with CUDA (because few people run multi-node GPU jobs). So I just try to give an option to compile your own libraries and use them instead of relying on the cluster admins. You can even go as far as compiling your own Cudatoolkit (see the section on conda) with a proper gcx (also downloaded through conda). But there is so far this can take you. If for whatever reason they have outdated glibc, or they don't give you access to UCX library -- then you sort of have to rely on whatever modules they provide. Regarding this last issue, few comments: in the meantime, i'll try to make a minimum code-example which should be able to test whether the problem has anything to do with the entity itself (at this point, it's unlikely). PS. General comment: this is just for the future; you're using 16 GPUs for just 14M particles (so 1M particles per GPU). That's horribly inefficient. You can/should fit at least 100x more (i.e., increase the box-resolution or ppc, or take less GPUs), otherwise the GPU will be very sub-optimal. |
I did everything you suggested and I can confirm other code works with It turns out that the problem is with Now, I need to visualize the results of this test weibel run. Afterwards, I will try the shock setup I'm actually interested in. Thanks for you tip! I was trying to test |
wow, thanks @Tissot11! this is actually very helpful. i'll have a look |
Hi,
I can seem to configure and build fine. However, I get error at the linking time at the end of build (see the screenshot). What is the likely cause?
I used following commands for building
module load compiler/intel/2023.1.0 mpi/impi/2021.11 lib/hdf5/1.14.4-intel-2023.1.0 devel/cuda/12.4
cmake -B build -D pgen=srpic/langmuir -D mpi=ON -D Kokkos_ENABLE_CUDA=ON -D Kokkos_ARCH_VOLTA70=ON -D Kokkos_ENABLE_OPENMP=ON
cmake --build build -j 8
The text was updated successfully, but these errors were encountered: