-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] What is the expected discrepancy between simulated and actually computed values? #261
Comments
I think you're right. I suspect @qwopqwop200 focuses mostly on perplexity evaluation to confirm correctness.
Either way, I'm not an expert but I believe, generally, fully deterministic results (to the decimal) are impossible on the GPU because the order of operations affects the floating point error. The GPU launches hundreds or thousands of jobs (threads, warps etc) and chooses how to schedule them. Combine that with dynamic throttling and there's no way to tell what order they'll complete and accumulate to the output elements. (You could add code to synchronise it at a performance cost, which might be worthwhile if you're really in the weeds examining precision problems.)
I don't know on the
So looks like if your error is 1e-5 you're actually doing quite well? Maybe I misunderstood what you're measuring. I used this code: import torch.nn.functional as F
with torch.no_grad():
full_model_output = layer(vec)
quant_model_output = qlayer(vec)
print('Full model output:', full_model_output)
print('Quantized model output:', quant_model_output)
# Compute Mean Absolute Error (MAE)
mae = F.l1_loss(quant_model_output, full_model_output)
print('Mean Absolute Error (MAE):', mae)
# Compute Root Mean Square Error (RMSE)
mse = F.mse_loss(quant_model_output, full_model_output)
rmse = torch.sqrt(mse)
print('Root Mean Square Error (RMSE):', rmse) Incidentally, I believe RMSE is a better way to characterise the error here. Since neural networks have activation functions and normalisation, small amounts of absolute error have little impact (in theory). |
I understand using the perplexity check is the ultimate way to test it. The problem is that I can't even load the 7B models without quantization, they need 7*4 = 28 GB when using FP32. I just have 16 GB of main memory and 8 GB VRAM. And it could take days to finish in a modest GPU (I think my GPU is 1/40 the RTX4090 speed) . Is also a good idea to check the individual components. I found that some of the last commits in the old_cuda branch introduces a lot of error even when the kernels remains the same. So these inaccuracies are introduced in the Python code. Knowing the source of inaccuracies helps to optimize the code.
Thanks for confirming it.
Isn't it generated using random? (a pseudo aleatory random generator)
Ugh! didn't know it. But giving the small size of the kernels I guess these errors should be much smaller than the ones introduced by the approximation.
I'll check using the same metrics you used. Thanks for the information! It gives me a known reference.
Most probably, I think I saw some comments about using RMSE. |
I added MAE and RMSE metrics, here is what I get for the different cases:
I'm using the old-cuda code, but not the last commit, I'm using this repo which is basically a fork of this repo. When I tried the last commits from old-cuda something didn't work in Ooba Gooba. This is why I kept the changes WapaMario used. Now the values: The first part is the timing, I added the speed gain relative to FP32. I named N bits FP32 to the "normal" implementation and FP16 to the "faster" implementation. I uploaded binary wheels here, they just contain the dynamic lib with the bindings. I'll see if I can get a traceback of what is failing with the last commits from old-cuda. P.S. PyTorch 1.13.1 is the last version I can get working on my Radeon board. All 2.x I tried generates memory faults. I think WapaMario also had this problem. |
Right, to clarify I was testing in FP16. So since "faster" is your FP16 run, your MAE error is about 200% and the RMSE error about 150%.
I don't disagree but it sounds like you have successfully tested this component at least? Your error is high but at least in the right ballpark. Maybe it's the best you can get if FP16 support is poor on your HW. You might be able to use mixed precision (like when doing the matmuls, upcast everything into FP32 pre-multiply, accumulate into FP32, and then only when saving the output downcast back to FP16). This should have no impact on VRAM usage since you're working in on chip working memory. Yet it should vastly improve precision. It may not impact performance because we tend to be memory bandwidth limited, reading weights from global memory to shared memory. Although your card may not be able to compute and transfer data at the same time in which case you'll see a performance hit. Anyhow, check your PPL! You should be able to get the quantised model running on 8 GB of VRAM and you can do the reference run for your perplexity numbers on CPU (although you don't have enough main memory either so it's going to be slow as hell, but hey you only need to do it once ). Or just look up the PPL numbers from the README. |
First: thanks for this implementation, I'm using it to load 7B models in my 8 GiB GPU using Ooba Gooba (Which fails to report how much memory did it use, I had to patch the code, and also fails to mention that you need more than 1 GiB extra VRAM for the default 2k tokens context).
Now more context:
So I'm not using CUDA and from what I see test_kernel.py isn't really Verifiying kernel correctness, just checking it doesn't crash. Am I missing something?
My board is a Radeon RX5500XT and it isn't officially supported by ROCm. I know the FP16 implementation has some issues, so it wasn't a surprise that the faster versions wasn't performing really faster and that their error is much bigger.
But what I want to know is if the error of the regular versions is in the expected range. Note that I'm 100% new to PyTorch. I failed to force deterministic values, and I only computed absolute errors. I found a discrepancy that is usually under 1e-5 and sometimes a little bit over. The faster was much worst.
Is this normal? How much error should I expect? Can test_kernel.py really verify the results?
BTW: I uploaded pre-compiled wheels for Python 3.7, 3.8, 3.9 and 3.10 that are usable for PyTorch 1.13.1 (2.x isn't working) compiled with ROCm 5.2 (the official PyTorch release for ROCm). They can be found here
A note for the authors: consider enabling GitHub discussions, this post isn't a real issue that should be posted as a discussion.
The text was updated successfully, but these errors were encountered: