Newer Nvidia GPUs with more than 4GB memory, report max 4GB memory under CUDA #1773

Ageless93 · 2017-01-28T14:27:33Z

As reported on the BOINC forums, with newer Nvidia GPUs with more than 4 gigabyte video memory, BOINC reports they have a max of 4096MB for CUDA. OpenCL detects the memory correctly.

Checking https://github.com/BOINC/boinc/blob/master/client/gpu_nvidia.cpp, line 195, total memory is checked with an int. An int is 32bit, thus the maximum memory it can address is 4GB. In this case I think it needs a long long int here (referencing http://en.cppreference.com/w/cpp/language/types) to be able to detect over 4GB of memory (64bit).

AenBleidd · 2017-01-28T17:00:07Z

I think 'int' is the result of function call, it is the status. Function return 'Max Memory Value' in its first argument which is size_t.
Are you using 64 bit or 32 bit BOINC?

ChristianBeer · 2017-01-30T16:37:11Z

This seems to be a "feature" of the CUDA SDK. Older versions always report the memory as 32bit integer. Exhibit 1, Exhibit 2.

I guess one has to experiment which CUDA SDK reports the correct amount and use this to built the Client with. The client itself uses a size_t to retrieve the value which than gets converted into a double. If we compile on a 64bit machine the size_t should be big enough.

AenBleidd · 2018-10-02T02:28:11Z

@RichardHaselgrove, you've been working with the similar staff while fixing #2706.
Does the issue described here is still present?

RichardHaselgrove · 2018-10-02T08:58:18Z

And even longer than that. We had a problem when the first 2GB cards came out, with garbage sizes reported for a while. Berkeley couldn't run to the latest hardware, so Rom remoted into a machine I'd just built to debug and create 26f9e38.

The 4 GB reporting limit still applies (and has been reported again in the context of the RTX 2080 release), but it would involve delving into the NVidia SDKs to find a usable 64bit implementation which could co-exist with our legacy 32-bit code and wouldn't lose our support for older cards.

Ageless93 · 2020-03-21T18:19:56Z

All right, I got a code fix in for this problem. The problem lies in the use of cuDeviceTotalMem which is a 32bit value. Using cuDeviceTotalMem_v2 will fix this for GPUs with more than 4GB. It needs further testing with older GPUs and in 32bit OS.

diff --git a/client/gpu_nvidia.cpp b/client/gpu_nvidia.cpp
index 5d6ddc2..73ba787 100644
--- a/client/gpu_nvidia.cpp
+++ b/client/gpu_nvidia.cpp
@@ -306,13 +306,19 @@ void* cudalib = NULL;
p_cuDeviceGet = (int(*)(int*, int)) dlsym( cudalib, "cuDeviceGet" );
p_cuDeviceGetAttribute = (int(*)(int*, int, int)) dlsym( cudalib, "cuDeviceGetAttribute" );
p_cuDeviceGetName = (int(*)(char*, int, int)) dlsym( cudalib, "cuDeviceGetName" );
- p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem" );
+ if (sizeof(size_t)<8 || !(p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem_v2")))
+ p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem" );
p_cuDeviceComputeCapability = (int(*)(int*, int*, int)) dlsym( cudalib, "cuDeviceComputeCapability" );
- p_cuCtxCreate = (int(*)(void**, unsigned int, unsigned int)) dlsym( cudalib, "cuCtxCreate" );
- p_cuCtxDestroy = (int(*)(void*)) dlsym( cudalib, "cuCtxDestroy" );
+ if (sizeof(void *)<8 ||
+ !(p_cuCtxCreate = (int(*)(void**, unsigned int, unsigned int)) dlsym( cudalib, "cuCtxCreate_v2")) ||
+ !(p_cuCtxDestroy = (int(*)(void*)) dlsym( cudalib, "cuCtxDestroy_v2"))) {
+ p_cuCtxCreate = (int(*)(void**, unsigned int, unsigned int)) dlsym( cudalib, "cuCtxCreate" );
+ p_cuCtxDestroy = (int(*)(void*)) dlsym( cudalib, "cuCtxDestroy" );
+ }
p_cuMemAlloc = (int(*)(unsigned int*, size_t)) dlsym( cudalib, "cuMemAlloc" );
p_cuMemFree = (int(*)(unsigned int)) dlsym( cudalib, "cuMemFree" );
- p_cuMemGetInfo = (int(*)(size_t*, size_t*)) dlsym( cudalib, "cuMemGetInfo" );
+ if (sizeof(size_t)<8 || !(p_cuMemGetInfo = (int(*)(size_t*, size_t*)) dlsym( cudalib, "cuMemGetInfo_v2" )))
+ p_cuMemGetInfo = (int(*)(size_t*, size_t*)) dlsym( cudalib, "cuMemGetInfo" );
#endif

if (!p_cuDriverGetVersion) {

Ageless93 · 2020-03-21T18:22:32Z

And the corresponding explanation:

Boinc uses a function named cuDeviceTotalMem() to get the memory size. That function is ancient and clamps the size to 32 bits even in modern versions of the cuda lib probably to maintain compatibility.

I checked what Special Sauce does to get the memory, but it was using a completely different api so doing that in boinc client would have required huge changes.

Then I snooped what symbol names the libcuda.so where that cuDeviceTotalMem comes from contains. There was a mysterious cuDeviceTotalMem_v2, so I tried what will happen if I make Boinc call that instead. It worked!

I made the change so that if the symbol is not found, then it uses the old version. So if the client is running in a system where the nvidia drivers are so ancient that the _v2 function doesn't exist, then it will just show wrong memory size instead of crashing.

This change is always active regardless of the result of the team check. Boinc does the GPU detection before the team check can be done, so making it conditional wasn't easy to do. But I don't think this is a problem because this isn't a 'cheat' but a legit bug fix.

The library has the old 32 version in it to stay binary compatible with the code compiled against the old version of the library. Any new 64 bit code normally linked with the library would be compiled using 64 bit header files that add that _v2 to the symbol name resolved from the library when the code calls the function and everything works correctly.

But boinc isn't using the library with normal linking. It wants to avoid dependency on Cuda development stuff, so it doesn't use any headers and accesses the library 'the hard way' by the running code finding the library file and extracting indvidual symbol names from it and casting them to function pointers to be used. If you do 'manual' linking this way, then it is your responsibility to handle the different function versions too. Boinc didn't do this, so the bug was entirely in their end.

Also my version is a hack that may fail or even crash if you compile the code on 32 bit environment. My code uses the old version only in case the new version doesn't exist in the library, but to make it compatible with 32 bit environment, another test is needed. Something like:

if (sizeof(size_t)<8 || !(p_cuDeviceTotalMem = (int()(size_t, int)) dlsym( cudalib, "cuDeviceTotalMem_v2")))
p_cuDeviceTotalMem = (int()(size_t, int)) dlsym( cudalib, "cuDeviceTotalMem" );

"sizeof(size_t)<8" is a constant expression evaluating to always true on a 32 bit systems, so the compiler will optimize the entire if test away leaving just the old line.

Also there are other functions using size_t in the code besides the two I modified. They might benefit form doing this same thing too.

The difference is that now I make the 'bitness test' before using the _v2 variants. I also added _v2 variants to cuCtxCreate and cuCtxDestroy and for them I do the test once and replace them both with _v2 or neither of them. Because I don't think it is a smart idea to create a contect with one version and release it with a different version.

cuMemAlloc and cuMemFree also have different versions for 64 bit but I didn't modify them because Boinc isn't actually using them. Just testing their existence. The function signatures for them in this boinc code are actually plain wrong so if it tried to actually use them, it would probably crash or do something unintended.

Ville-Saari · 2020-03-21T18:36:04Z

The above explanation may be a bit confusing as it was concatenated together from multiple different posts I wrote on Seti@Home GPUUG team forum.

AenBleidd · 2020-03-21T19:03:22Z

@Ville-Saari, if it's your code above, could you please make a proper PR?
Also, it was said by @davidpanderson, that there will be no x86 client anymore, so all 32-bit checks could be removed, I believe.

RichardHaselgrove · 2020-03-21T19:33:30Z

I'm working on converting this code for both Windows and Linux. The original patch is giving me compiler warnings on VS 2013, but working. If anybody wants to re-code it in accordance with BOINC house style, please feel free.

The theory is simple: if _v2 versions of cuDeviceTotalMem and cuMemGetInfo exist in the active library, use them. Otherwise, keep the originals. I think the _v2 libraries became available around 2012, with CUDA 4.

Ville-Saari · 2020-03-21T19:51:27Z

Those 32 bit checks won't actually produce any code as they are constant expressions that make the compiier remove the test and just conditionally compile one of the resulting branches depending on what kind of architecture it is compiling for.

I think using those 32 bit versions of the calls without _v2 in their name when running on 64 bit system may even be dangerous. The fact that the total memory query even works at all with the 32 bit call relies on the code running on a little-endian system and the variable being explicitly initialized to 0, so the result ends up in the less significant half of the 64 bit integer. That function is meant to receive a pointer to a 32 bit value.

RichardHaselgrove · 2020-03-21T20:04:16Z

Yes, we had that problem when 4GB GPUs first came on to the market in 2012. @romw had to diagnose it for himself to make 2882d5b - we've been OK since then.

CharlieFenton · 2020-03-22T00:47:03Z

I added a workaround for this problem on Macs in 2013. See commits 631e236 and 4d74c5a (26 and 27 June 2013.) I hope you have found a better way to do this.

KeithMyers · 2022-06-03T19:32:17Z

I don't know why this is still hanging out there. I've been using Ville's corrected code for years now. One simple change to a client module and problem is solved.

Why haven't anybody simply merged the code?

AenBleidd · 2022-06-03T19:34:01Z

I don't know why this is still hanging out there. I've been using Ville's corrected code for years now. One simple change to a client module and problem is solved.

Why haven't anybody simply merged the code?

Probably because nobody have created a proper PR for this.

KeithMyers · 2022-06-03T19:40:12Z

Why can't you do that? You are a developer. Forget about Ville doing it. He left BOINC once Seti finished. No sign of him since then.

AenBleidd · 2022-06-03T19:43:16Z

Actually, he replied back two years ago: #1773 (comment)
Personally I don't want to commit code without permissions of its author(s).

KeithMyers · 2022-06-03T19:46:09Z

So since that is never going to happen, i.e. he had left no contact information to get ahold of him, this issue will never be fixed?
So how to get around not having permission from the author? None?
He gave permission to use his code for all GPUUG team members. Does that suffice?

AenBleidd · 2022-06-03T19:48:48Z

He gave permission to use his code for all GPUUG team members. Does that suffice?

If he gave his permissions and you have a proof of it - I can use that code and create a PR.

So how to get around not having permission from the author? None?

Usually you have to reimplement/fix the functionality without using the original code. Otherwise it's a violation of intellectual property rights.

KeithMyers · 2022-06-03T19:54:35Z

So I need to find an explicit permission message from Ville somewhere in all the threads he posted.
I don't think he ever did that. He just created our Pandora client for the team members and implicitly gave permission to use his code.

KeithMyers · 2022-06-03T21:06:10Z

Vitalii, I have these forum messages as the closest I can come for an explicit permission case for Ville's code fix.

Hey Ville,

Can you post (or PM to me) the sections of code that you changed to implement the memory fix? I was talking to Richard about it, and he would like the information to make an official report on github to try to get it implemented to the official BOINC code. This is something that I think is beneficial to the community as a whole.

Thanks!

Here you are:

diff --git a/client/gpu_nvidia.cpp b/client/gpu_nvidia.cpp
index 5d6ddc2..b0689c9 100644
--- a/client/gpu_nvidia.cpp
+++ b/client/gpu_nvidia.cpp
@@ -306,13 +306,15 @@ void* cudalib = NULL;
     p_cuDeviceGet = (int(*)(int*, int)) dlsym( cudalib, "cuDeviceGet" );
     p_cuDeviceGetAttribute = (int(*)(int*, int, int)) dlsym( cudalib, "cuDeviceGetAttribute" );
     p_cuDeviceGetName = (int(*)(char*, int, int)) dlsym( cudalib, "cuDeviceGetName" );
-    p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem" );
+    if (!(p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem_v2")))
+       p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem" );
     p_cuDeviceComputeCapability = (int(*)(int*, int*, int)) dlsym( cudalib, "cuDeviceComputeCapability" );
     p_cuCtxCreate = (int(*)(void**, unsigned int, unsigned int)) dlsym( cudalib, "cuCtxCreate" );
     p_cuCtxDestroy = (int(*)(void*)) dlsym( cudalib, "cuCtxDestroy" );
     p_cuMemAlloc = (int(*)(unsigned int*, size_t)) dlsym( cudalib, "cuMemAlloc" );
     p_cuMemFree = (int(*)(unsigned int)) dlsym( cudalib, "cuMemFree" );
-    p_cuMemGetInfo = (int(*)(size_t*, size_t*)) dlsym( cudalib, "cuMemGetInfo" );
+    if (!(p_cuMemGetInfo = (int(*)(size_t*, size_t*)) dlsym( cudalib, "cuMemGetInfo_v2" )))       
+       p_cuMemGetInfo = (int(*)(size_t*, size_t*)) dlsym( cudalib, "cuMemGetInfo" );
 #endif
 
     if (!p_cuDriverGetVersion) {

And Juan's reply right after that.

Richard now has the code for the memory fix since Ville agrees to share.

KeithMyers · 2022-06-04T17:52:03Z

OK, you only got the executables. So how did the code snippet end up here on #1773? I see Ville posted an explanation here of the problem and its solution, so I can assume he posted the code snippet.

Why doesn't that satisfy the requirements?

KeithMyers · 2022-06-04T17:53:27Z

OK, it looks like Jord got the code from Ville from the posts.

RichardHaselgrove · 2022-06-04T18:12:38Z

Checked my PMs. Yes, Juan declined to send me code (collective decision of GPUUG), but that was to do with GPU spoofing to break the 'max tasks per GPU' cache limit.

I was sent code to address this current issue, but that was by Ian&Steve C. - and very late in the lifetime of the SETI@Home project, when we were all getting very tired and irritable. The associated discussion thread in the forum was eventually hidden.

Ageless93 · 2022-06-05T11:24:32Z

Late last night I saw I was being mentioned, and I must say I didn't even remember any of this. So I looked around and found that I got a PM at Seti from Juan BFP with the code, and one following with a long explanation. I'll add that to this thread. It may explain the problems you guys talk about in the other ticket.

Ageless93 · 2022-06-05T11:28:04Z

As requested this is the related info of the dev of the fix.

Boinc uses a function named cuDeviceTotalMem() to get the memory size. That function is ancient and clamps the size to 32 bits even in modern versions of the cuda lib probably to maintain compatibility.

I checked what Special Sauce does to get the memory, but it was using a completely different api so doing that in boinc client would have required huge changes.

Then I snooped what symbol names the libcuda.so where that cuDeviceTotalMem comes from contains. There was a mysterious cuDeviceTotalMem_v2, so I tried what will happen if I make Boinc call that instead. It worked!

I made the change so that if the symbol is not found, then it uses the old version. So if the client is running in a system where the nvidia drivers are so ancient that the _v2 function doesn't exist, then it will just show wrong memory size instead of crashing.

This change is always active regardless of the result of the team check. Boinc does the GPU detection before the team check can be done, so making it conditional wasn't easy to do. But I don't think this is a problem because this isn't a 'cheat' but a legit bug fix.

The library has the old 32 version in it to stay binary compatible with the code compiled against the old version of the library. Any new 64 bit code normally linked with the library would be compiled using 64 bit header files that add that _v2 to the symbol name resolved from the library when the code calls the function and everything works correctly.

But boinc isn't using the library with normal linking. It wants to avoid dependency on Cuda development stuff, so it doesn't use any headers and accesses the library 'the hard way' by the running code finding the library file and extracting indvidual symbol names from it and casting them to function pointers to be used. If you do 'manual' linking this way, then it is your responsibility to handle the different function versions too. Boinc didn't do this, so the bug was entirely in their end.

Also my version is a hack that may fail or even crash if you compile the code on 32 bit environment. My code uses the old version only in case the new version doesn't exist in the library, but to make it compatible with 32 bit environment, another test is needed. Something like:

if (sizeof(size_t)<8 || !(p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem_v2")))
p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem" );

"sizeof(size_t)<8" is a constant expression evaluating to always true on a 32 bit systems, so the compiler will optimize the entire if test away leaving just the old line.

Also there are other functions using size_t in the code besides the two I modified. They might benefit form doing this same thing too.

The difference is that now I make the 'bitness test' before using the _v2 variants. I also added _v2 variants to cuCtxCreate and cuCtxDestroy and for them I do the test once and replace them both with _v2 or neither of them. Because I don't think it is a smart idea to create a contect with one version and release it with a different version.

cuMemAlloc and cuMemFree also have different versions for 64 bit but I didn't modify them because Boinc isn't actually using them. Just testing their existence. The function signatures for them in this boinc code are actually plain wrong so if it tried to actually use them, it would probably crash or do something unintended.

Ageless93 · 2022-06-05T15:02:09Z

Found one more private message from Juan, which I read as permission to use the code:

We are happy to share our findings with the rest of the community.
There are some others enhancements added to our client, some are very interesting.
We could talk about them in the future.

RichardHaselgrove · 2022-06-05T15:52:16Z

Well, at one level, we don't need any permissions from anyone. The calls we're using are all documented in the CUDA Toolkit Documentation (currently at v11.7.0) - at section 6.5, Device Management, of the CUDA Driver API

Ageless93 · 2022-06-05T16:28:05Z

It's what Vitalii was asking for:

He gave permission to use his code for all GPUUG team members. Does that suffice?

If he gave his permissions and you have a proof of it - I can use that code and create a PR.

RichardHaselgrove · 2022-06-05T17:19:17Z

I think David - with #4757 - has somewhat pre-empted that request.

For the record, the majority of the _v2 variants aren't explicitly covered in the toolkit documentation, but are listed in https://forums.developer.nvidia.com/t/driverapi-problem-with-cudevicetotalmem/25585 (January 2012).

AenBleidd · 2022-06-05T21:21:24Z

@KeithMyers and @Ageless93 posted two different diffs. Both of them are looks similar to what is done by David in #4757 but the last one is not working properly as far as I see from the comments.
I didn't check both diffs posted above in this thread, but I have a feeling that these two are just an old versions, and is not exactly the same that was used when building your customized client.
So it would be nice if @Ville-Saari replies back in this thread and either create a proper PR or at least share the latest diff that was used to build your customized client.

Ageless93 · 2022-06-05T21:46:16Z

So it would be nice if @Ville-Saari replies back in this thread and either create a proper PR or at least share the latest diff that was used to build your customized client.

Yes, but I don't think it was his code, hence why he never made the PR. I think, going by what Juan posted to me (and what I quoted above for the second time, as I quoted it before I see on the 21th March 2020) it was his code.

AenBleidd · 2022-06-05T22:00:30Z

@Ageless93, ok, then it would be nice to reach to Juan somehow and kindly ask them to share their work.

davidpanderson · 2022-06-05T22:10:39Z

Is there a case where #4757 isn't working?

AenBleidd · 2022-06-05T22:17:04Z

@davidpanderson, according to this report it doesn't work on linux: #4757 (comment)

IanSteveC · 2022-06-12T17:44:06Z

It was always Ville's code. he wrote it. Juan was just acting as a liaison until Ville reached out himself in here.

Both Juan and Ville have been MIA and unreachable for over a year. that's why neither of them have been here to contribute comments. we already tried reaching both of them.

Ageless93 added C: Client E: 1 day P: Minor labels Jan 28, 2017

ChristianBeer added this to the Client/Manager 8.0 milestone Apr 12, 2017

ChristianBeer added Feature Request PR: Enhancement labels Apr 12, 2017

AenBleidd added the The SCI Campaign label May 31, 2022

AenBleidd added R: fixed and removed The SCI Campaign labels Jun 5, 2022

AenBleidd modified the milestones: Client/Manager Undetermined, Client Release 7.20.0 Jun 5, 2022

AenBleidd closed this as completed in #4757 Jun 6, 2022

Toby-Broom mentioned this issue Jul 7, 2022

Report GPU memory in GB vs MB #4832

Closed

jring-o mentioned this issue Jul 28, 2022

Bounty Board TheSCInitiative/BOINC#5

Closed

5 tasks

AenBleidd removed PR: Enhancement R: fixed labels Oct 23, 2024

Newer Nvidia GPUs with more than 4GB memory, report max 4GB memory under CUDA #1773

Newer Nvidia GPUs with more than 4GB memory, report max 4GB memory under CUDA #1773

Comments

Ageless93 commented Jan 28, 2017

AenBleidd commented Jan 28, 2017 • edited Loading

ChristianBeer commented Jan 30, 2017

AenBleidd commented Oct 2, 2018

RichardHaselgrove commented Oct 2, 2018

Ageless93 commented Mar 21, 2020

Ageless93 commented Mar 21, 2020

Ville-Saari commented Mar 21, 2020

AenBleidd commented Mar 21, 2020

RichardHaselgrove commented Mar 21, 2020

Ville-Saari commented Mar 21, 2020

RichardHaselgrove commented Mar 21, 2020

CharlieFenton commented Mar 22, 2020

KeithMyers commented Jun 3, 2022

AenBleidd commented Jun 3, 2022

KeithMyers commented Jun 3, 2022

AenBleidd commented Jun 3, 2022

KeithMyers commented Jun 3, 2022

AenBleidd commented Jun 3, 2022 • edited Loading

KeithMyers commented Jun 3, 2022

KeithMyers commented Jun 3, 2022

KeithMyers commented Jun 4, 2022

KeithMyers commented Jun 4, 2022

RichardHaselgrove commented Jun 4, 2022

Ageless93 commented Jun 5, 2022

Ageless93 commented Jun 5, 2022

Ageless93 commented Jun 5, 2022

RichardHaselgrove commented Jun 5, 2022

Ageless93 commented Jun 5, 2022

RichardHaselgrove commented Jun 5, 2022

AenBleidd commented Jun 5, 2022

Ageless93 commented Jun 5, 2022

AenBleidd commented Jun 5, 2022

davidpanderson commented Jun 5, 2022

AenBleidd commented Jun 5, 2022

IanSteveC commented Jun 12, 2022

AenBleidd commented Jan 28, 2017 •

edited

Loading

AenBleidd commented Jun 3, 2022 •

edited

Loading