Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newer Nvidia GPUs with more than 4GB memory, report max 4GB memory under CUDA #1773

Closed
Ageless93 opened this issue Jan 28, 2017 · 40 comments · Fixed by #4757
Closed

Newer Nvidia GPUs with more than 4GB memory, report max 4GB memory under CUDA #1773

Ageless93 opened this issue Jan 28, 2017 · 40 comments · Fixed by #4757

Comments

@Ageless93
Copy link
Contributor

As reported on the BOINC forums, with newer Nvidia GPUs with more than 4 gigabyte video memory, BOINC reports they have a max of 4096MB for CUDA. OpenCL detects the memory correctly.

Checking https://github.com/BOINC/boinc/blob/master/client/gpu_nvidia.cpp, line 195, total memory is checked with an int. An int is 32bit, thus the maximum memory it can address is 4GB. In this case I think it needs a long long int here (referencing http://en.cppreference.com/w/cpp/language/types) to be able to detect over 4GB of memory (64bit).

@AenBleidd
Copy link
Member

AenBleidd commented Jan 28, 2017

I think 'int' is the result of function call, it is the status. Function return 'Max Memory Value' in its first argument which is size_t.
Are you using 64 bit or 32 bit BOINC?

@ChristianBeer
Copy link
Member

This seems to be a "feature" of the CUDA SDK. Older versions always report the memory as 32bit integer. Exhibit 1, Exhibit 2.

I guess one has to experiment which CUDA SDK reports the correct amount and use this to built the Client with. The client itself uses a size_t to retrieve the value which than gets converted into a double. If we compile on a 64bit machine the size_t should be big enough.

@ChristianBeer ChristianBeer added this to the Client/Manager 8.0 milestone Apr 12, 2017
@AenBleidd
Copy link
Member

@RichardHaselgrove, you've been working with the similar staff while fixing #2706.
Does the issue described here is still present?

@RichardHaselgrove
Copy link
Contributor

And even longer than that. We had a problem when the first 2GB cards came out, with garbage sizes reported for a while. Berkeley couldn't run to the latest hardware, so Rom remoted into a machine I'd just built to debug and create 26f9e38.

The 4 GB reporting limit still applies (and has been reported again in the context of the RTX 2080 release), but it would involve delving into the NVidia SDKs to find a usable 64bit implementation which could co-exist with our legacy 32-bit code and wouldn't lose our support for older cards.

@Ageless93
Copy link
Contributor Author

All right, I got a code fix in for this problem. The problem lies in the use of cuDeviceTotalMem which is a 32bit value. Using cuDeviceTotalMem_v2 will fix this for GPUs with more than 4GB. It needs further testing with older GPUs and in 32bit OS.

diff --git a/client/gpu_nvidia.cpp b/client/gpu_nvidia.cpp
index 5d6ddc2..73ba787 100644
--- a/client/gpu_nvidia.cpp
+++ b/client/gpu_nvidia.cpp
@@ -306,13 +306,19 @@ void* cudalib = NULL;
p_cuDeviceGet = (int(*)(int*, int)) dlsym( cudalib, "cuDeviceGet" );
p_cuDeviceGetAttribute = (int(*)(int*, int, int)) dlsym( cudalib, "cuDeviceGetAttribute" );
p_cuDeviceGetName = (int(*)(char*, int, int)) dlsym( cudalib, "cuDeviceGetName" );
- p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem" );
+ if (sizeof(size_t)<8 || !(p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem_v2")))
+ p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem" );
p_cuDeviceComputeCapability = (int(*)(int*, int*, int)) dlsym( cudalib, "cuDeviceComputeCapability" );
- p_cuCtxCreate = (int(*)(void**, unsigned int, unsigned int)) dlsym( cudalib, "cuCtxCreate" );
- p_cuCtxDestroy = (int(*)(void*)) dlsym( cudalib, "cuCtxDestroy" );
+ if (sizeof(void *)<8 ||
+ !(p_cuCtxCreate = (int(*)(void**, unsigned int, unsigned int)) dlsym( cudalib, "cuCtxCreate_v2")) ||
+ !(p_cuCtxDestroy = (int(*)(void*)) dlsym( cudalib, "cuCtxDestroy_v2"))) {
+ p_cuCtxCreate = (int(*)(void**, unsigned int, unsigned int)) dlsym( cudalib, "cuCtxCreate" );
+ p_cuCtxDestroy = (int(*)(void*)) dlsym( cudalib, "cuCtxDestroy" );
+ }
p_cuMemAlloc = (int(*)(unsigned int*, size_t)) dlsym( cudalib, "cuMemAlloc" );
p_cuMemFree = (int(*)(unsigned int)) dlsym( cudalib, "cuMemFree" );
- p_cuMemGetInfo = (int(*)(size_t*, size_t*)) dlsym( cudalib, "cuMemGetInfo" );
+ if (sizeof(size_t)<8 || !(p_cuMemGetInfo = (int(*)(size_t*, size_t*)) dlsym( cudalib, "cuMemGetInfo_v2" )))
+ p_cuMemGetInfo = (int(*)(size_t*, size_t*)) dlsym( cudalib, "cuMemGetInfo" );
#endif

if (!p_cuDriverGetVersion) {

@Ageless93
Copy link
Contributor Author

And the corresponding explanation:

Boinc uses a function named cuDeviceTotalMem() to get the memory size. That function is ancient and clamps the size to 32 bits even in modern versions of the cuda lib probably to maintain compatibility.

I checked what Special Sauce does to get the memory, but it was using a completely different api so doing that in boinc client would have required huge changes.

Then I snooped what symbol names the libcuda.so where that cuDeviceTotalMem comes from contains. There was a mysterious cuDeviceTotalMem_v2, so I tried what will happen if I make Boinc call that instead. It worked!

I made the change so that if the symbol is not found, then it uses the old version. So if the client is running in a system where the nvidia drivers are so ancient that the _v2 function doesn't exist, then it will just show wrong memory size instead of crashing.

This change is always active regardless of the result of the team check. Boinc does the GPU detection before the team check can be done, so making it conditional wasn't easy to do. But I don't think this is a problem because this isn't a 'cheat' but a legit bug fix.

The library has the old 32 version in it to stay binary compatible with the code compiled against the old version of the library. Any new 64 bit code normally linked with the library would be compiled using 64 bit header files that add that _v2 to the symbol name resolved from the library when the code calls the function and everything works correctly.

But boinc isn't using the library with normal linking. It wants to avoid dependency on Cuda development stuff, so it doesn't use any headers and accesses the library 'the hard way' by the running code finding the library file and extracting indvidual symbol names from it and casting them to function pointers to be used. If you do 'manual' linking this way, then it is your responsibility to handle the different function versions too. Boinc didn't do this, so the bug was entirely in their end.

Also my version is a hack that may fail or even crash if you compile the code on 32 bit environment. My code uses the old version only in case the new version doesn't exist in the library, but to make it compatible with 32 bit environment, another test is needed. Something like:

if (sizeof(size_t)<8 || !(p_cuDeviceTotalMem = (int()(size_t, int)) dlsym( cudalib, "cuDeviceTotalMem_v2")))
p_cuDeviceTotalMem = (int()(size_t, int)) dlsym( cudalib, "cuDeviceTotalMem" );

"sizeof(size_t)<8" is a constant expression evaluating to always true on a 32 bit systems, so the compiler will optimize the entire if test away leaving just the old line.

Also there are other functions using size_t in the code besides the two I modified. They might benefit form doing this same thing too.

The difference is that now I make the 'bitness test' before using the _v2 variants. I also added _v2 variants to cuCtxCreate and cuCtxDestroy and for them I do the test once and replace them both with _v2 or neither of them. Because I don't think it is a smart idea to create a contect with one version and release it with a different version.

cuMemAlloc and cuMemFree also have different versions for 64 bit but I didn't modify them because Boinc isn't actually using them. Just testing their existence. The function signatures for them in this boinc code are actually plain wrong so if it tried to actually use them, it would probably crash or do something unintended.

@Ville-Saari
Copy link

The above explanation may be a bit confusing as it was concatenated together from multiple different posts I wrote on Seti@Home GPUUG team forum.

@AenBleidd
Copy link
Member

@Ville-Saari, if it's your code above, could you please make a proper PR?
Also, it was said by @davidpanderson, that there will be no x86 client anymore, so all 32-bit checks could be removed, I believe.

@RichardHaselgrove
Copy link
Contributor

I'm working on converting this code for both Windows and Linux. The original patch is giving me compiler warnings on VS 2013, but working. If anybody wants to re-code it in accordance with BOINC house style, please feel free.

The theory is simple: if _v2 versions of cuDeviceTotalMem and cuMemGetInfo exist in the active library, use them. Otherwise, keep the originals. I think the _v2 libraries became available around 2012, with CUDA 4.

@Ville-Saari
Copy link

Those 32 bit checks won't actually produce any code as they are constant expressions that make the compiier remove the test and just conditionally compile one of the resulting branches depending on what kind of architecture it is compiling for.

I think using those 32 bit versions of the calls without _v2 in their name when running on 64 bit system may even be dangerous. The fact that the total memory query even works at all with the 32 bit call relies on the code running on a little-endian system and the variable being explicitly initialized to 0, so the result ends up in the less significant half of the 64 bit integer. That function is meant to receive a pointer to a 32 bit value.

@RichardHaselgrove
Copy link
Contributor

Yes, we had that problem when 4GB GPUs first came on to the market in 2012. @romw had to diagnose it for himself to make 2882d5b - we've been OK since then.

@CharlieFenton
Copy link
Contributor

I added a workaround for this problem on Macs in 2013. See commits 631e236 and 4d74c5a (26 and 27 June 2013.) I hope you have found a better way to do this.

@KeithMyers
Copy link

I don't know why this is still hanging out there. I've been using Ville's corrected code for years now. One simple change to a client module and problem is solved.

Why haven't anybody simply merged the code?

@AenBleidd
Copy link
Member

I don't know why this is still hanging out there. I've been using Ville's corrected code for years now. One simple change to a client module and problem is solved.

Why haven't anybody simply merged the code?

Probably because nobody have created a proper PR for this.

@KeithMyers
Copy link

Why can't you do that? You are a developer. Forget about Ville doing it. He left BOINC once Seti finished. No sign of him since then.

@AenBleidd
Copy link
Member

Actually, he replied back two years ago: #1773 (comment)
Personally I don't want to commit code without permissions of its author(s).

@KeithMyers
Copy link

So since that is never going to happen, i.e. he had left no contact information to get ahold of him, this issue will never be fixed?
So how to get around not having permission from the author? None?
He gave permission to use his code for all GPUUG team members. Does that suffice?

@AenBleidd
Copy link
Member

AenBleidd commented Jun 3, 2022

He gave permission to use his code for all GPUUG team members. Does that suffice?

If he gave his permissions and you have a proof of it - I can use that code and create a PR.

So how to get around not having permission from the author? None?

Usually you have to reimplement/fix the functionality without using the original code. Otherwise it's a violation of intellectual property rights.

@KeithMyers
Copy link

So I need to find an explicit permission message from Ville somewhere in all the threads he posted.
I don't think he ever did that. He just created our Pandora client for the team members and implicitly gave permission to use his code.

@KeithMyers
Copy link

Vitalii, I have these forum messages as the closest I can come for an explicit permission case for Ville's code fix.

Hey Ville,

Can you post (or PM to me) the sections of code that you changed to implement the memory fix? I was talking to Richard about it, and he would like the information to make an official report on github to try to get it implemented to the official BOINC code. This is something that I think is beneficial to the community as a whole.

Thanks!

Here you are:

diff --git a/client/gpu_nvidia.cpp b/client/gpu_nvidia.cpp
index 5d6ddc2..b0689c9 100644
--- a/client/gpu_nvidia.cpp
+++ b/client/gpu_nvidia.cpp
@@ -306,13 +306,15 @@ void* cudalib = NULL;
     p_cuDeviceGet = (int(*)(int*, int)) dlsym( cudalib, "cuDeviceGet" );
     p_cuDeviceGetAttribute = (int(*)(int*, int, int)) dlsym( cudalib, "cuDeviceGetAttribute" );
     p_cuDeviceGetName = (int(*)(char*, int, int)) dlsym( cudalib, "cuDeviceGetName" );
-    p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem" );
+    if (!(p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem_v2")))
+       p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem" );
     p_cuDeviceComputeCapability = (int(*)(int*, int*, int)) dlsym( cudalib, "cuDeviceComputeCapability" );
     p_cuCtxCreate = (int(*)(void**, unsigned int, unsigned int)) dlsym( cudalib, "cuCtxCreate" );
     p_cuCtxDestroy = (int(*)(void*)) dlsym( cudalib, "cuCtxDestroy" );
     p_cuMemAlloc = (int(*)(unsigned int*, size_t)) dlsym( cudalib, "cuMemAlloc" );
     p_cuMemFree = (int(*)(unsigned int)) dlsym( cudalib, "cuMemFree" );
-    p_cuMemGetInfo = (int(*)(size_t*, size_t*)) dlsym( cudalib, "cuMemGetInfo" );
+    if (!(p_cuMemGetInfo = (int(*)(size_t*, size_t*)) dlsym( cudalib, "cuMemGetInfo_v2" )))       
+       p_cuMemGetInfo = (int(*)(size_t*, size_t*)) dlsym( cudalib, "cuMemGetInfo" );
 #endif
 
     if (!p_cuDriverGetVersion) {

And Juan's reply right after that.

Richard now has the code for the memory fix since Ville agrees to share.

@KeithMyers
Copy link

OK, you only got the executables. So how did the code snippet end up here on #1773? I see Ville posted an explanation here of the problem and its solution, so I can assume he posted the code snippet.

Why doesn't that satisfy the requirements?

@KeithMyers
Copy link

OK, it looks like Jord got the code from Ville from the posts.

@RichardHaselgrove
Copy link
Contributor

Checked my PMs. Yes, Juan declined to send me code (collective decision of GPUUG), but that was to do with GPU spoofing to break the 'max tasks per GPU' cache limit.

I was sent code to address this current issue, but that was by Ian&Steve C. - and very late in the lifetime of the SETI@Home project, when we were all getting very tired and irritable. The associated discussion thread in the forum was eventually hidden.

@Ageless93
Copy link
Contributor Author

Late last night I saw I was being mentioned, and I must say I didn't even remember any of this. So I looked around and found that I got a PM at Seti from Juan BFP with the code, and one following with a long explanation. I'll add that to this thread. It may explain the problems you guys talk about in the other ticket.

@Ageless93
Copy link
Contributor Author

As requested this is the related info of the dev of the fix.

Boinc uses a function named cuDeviceTotalMem() to get the memory size. That function is ancient and clamps the size to 32 bits even in modern versions of the cuda lib probably to maintain compatibility.

I checked what Special Sauce does to get the memory, but it was using a completely different api so doing that in boinc client would have required huge changes.

Then I snooped what symbol names the libcuda.so where that cuDeviceTotalMem comes from contains. There was a mysterious cuDeviceTotalMem_v2, so I tried what will happen if I make Boinc call that instead. It worked!

I made the change so that if the symbol is not found, then it uses the old version. So if the client is running in a system where the nvidia drivers are so ancient that the _v2 function doesn't exist, then it will just show wrong memory size instead of crashing.

This change is always active regardless of the result of the team check. Boinc does the GPU detection before the team check can be done, so making it conditional wasn't easy to do. But I don't think this is a problem because this isn't a 'cheat' but a legit bug fix.


The library has the old 32 version in it to stay binary compatible with the code compiled against the old version of the library. Any new 64 bit code normally linked with the library would be compiled using 64 bit header files that add that _v2 to the symbol name resolved from the library when the code calls the function and everything works correctly.

But boinc isn't using the library with normal linking. It wants to avoid dependency on Cuda development stuff, so it doesn't use any headers and accesses the library 'the hard way' by the running code finding the library file and extracting indvidual symbol names from it and casting them to function pointers to be used. If you do 'manual' linking this way, then it is your responsibility to handle the different function versions too. Boinc didn't do this, so the bug was entirely in their end.

Also my version is a hack that may fail or even crash if you compile the code on 32 bit environment. My code uses the old version only in case the new version doesn't exist in the library, but to make it compatible with 32 bit environment, another test is needed. Something like:

if (sizeof(size_t)<8 || !(p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem_v2")))
p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem" );

"sizeof(size_t)<8" is a constant expression evaluating to always true on a 32 bit systems, so the compiler will optimize the entire if test away leaving just the old line.

Also there are other functions using size_t in the code besides the two I modified. They might benefit form doing this same thing too.


The difference is that now I make the 'bitness test' before using the _v2 variants. I also added _v2 variants to cuCtxCreate and cuCtxDestroy and for them I do the test once and replace them both with _v2 or neither of them. Because I don't think it is a smart idea to create a contect with one version and release it with a different version.

cuMemAlloc and cuMemFree also have different versions for 64 bit but I didn't modify them because Boinc isn't actually using them. Just testing their existence. The function signatures for them in this boinc code are actually plain wrong so if it tried to actually use them, it would probably crash or do something unintended.

@Ageless93
Copy link
Contributor Author

Found one more private message from Juan, which I read as permission to use the code:

We are happy to share our findings with the rest of the community.
There are some others enhancements added to our client, some are very interesting.
We could talk about them in the future.

@RichardHaselgrove
Copy link
Contributor

Well, at one level, we don't need any permissions from anyone. The calls we're using are all documented in the CUDA Toolkit Documentation (currently at v11.7.0) - at section 6.5, Device Management, of the CUDA Driver API

@Ageless93
Copy link
Contributor Author

It's what Vitalii was asking for:

He gave permission to use his code for all GPUUG team members. Does that suffice?

If he gave his permissions and you have a proof of it - I can use that code and create a PR.

@RichardHaselgrove
Copy link
Contributor

I think David - with #4757 - has somewhat pre-empted that request.

For the record, the majority of the _v2 variants aren't explicitly covered in the toolkit documentation, but are listed in https://forums.developer.nvidia.com/t/driverapi-problem-with-cudevicetotalmem/25585 (January 2012).

@AenBleidd
Copy link
Member

@KeithMyers and @Ageless93 posted two different diffs. Both of them are looks similar to what is done by David in #4757 but the last one is not working properly as far as I see from the comments.
I didn't check both diffs posted above in this thread, but I have a feeling that these two are just an old versions, and is not exactly the same that was used when building your customized client.
So it would be nice if @Ville-Saari replies back in this thread and either create a proper PR or at least share the latest diff that was used to build your customized client.

@Ageless93
Copy link
Contributor Author

So it would be nice if @Ville-Saari replies back in this thread and either create a proper PR or at least share the latest diff that was used to build your customized client.

Yes, but I don't think it was his code, hence why he never made the PR. I think, going by what Juan posted to me (and what I quoted above for the second time, as I quoted it before I see on the 21th March 2020) it was his code.

@AenBleidd
Copy link
Member

@Ageless93, ok, then it would be nice to reach to Juan somehow and kindly ask them to share their work.

@davidpanderson
Copy link
Contributor

Is there a case where #4757 isn't working?

@AenBleidd
Copy link
Member

@davidpanderson, according to this report it doesn't work on linux: #4757 (comment)

@IanSteveC
Copy link

It was always Ville's code. he wrote it. Juan was just acting as a liaison until Ville reached out himself in here.

Both Juan and Ville have been MIA and unreachable for over a year. that's why neither of them have been here to contribute comments. we already tried reaching both of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants