-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wasm] perf pipeline clang crashes on PERFTIGER200 #88148
Comments
Tagging subscribers to 'arch-wasm': @lewing Issue DetailsWe are seeing frequest crashes like this on 1/2 out of 30 wasmaot runs, and all specifically on https://dev.azure.com/dnceng/internal/_build/results?buildId=2210817&view=results:
https://dev.azure.com/dnceng/internal/_build/results?buildId=2210656&view=results:
|
Looks like bad RAM or a bad CPU, based on it being crashes in two different parts of the backend and not asserts. doRAUW is a pretty simple method that does a small number of dereferences and most of them are guarded with null checks. |
yeah :/ @LoopedBard3 we have another one :D |
There are more instances like above, but a little different one I found:
|
@radical is this still happening? |
Yep, every day. |
@LoopedBard3 can we get this machine checked out? |
Yup, working on it now. |
It will be offlined shortly! |
dotnet/dnceng#434. @DrewScoggins we should probably wait to add this machine back until after it is upgraded to the new queue to see if an update fixes the issues we seem to consistently have with it. We can also do some test runs to get better logs. |
Do we have any theories on why we only see these kind of errors in clang, and we don't see any thing like this in other run types? |
I took a look at the data on runs on this machine, and we have not had any errors that look like memory corruption or failure on any runs except the AOT runs. Is it just that LLVM is particularly hard on memory and a good way of finding bad sticks? |
I'm seeing the same kinda issues on PERFTIGER172 now. |
cc @sblom |
This is happening every day. |
PerfTiger172 is offlined: dotnet/dnceng#559 |
Both PerfTiger200 and PerfTiger172 appear to have been added back into the queue with the queue upgrade based on workitem data in data explorer. Since they don't seem to be causing issues in the pipeline anymore, we will leave them in until they start causing issues again. |
PERFTIGER172:
Another:
On PERFTIGER204:
These were on |
PERFTIGER204 seems to be the newest member of the club :(
|
Latest PerfTiger172 error (taking back out of queue as it is the only thing stopping green wasm AOT):
|
If we are seeing this behavior again, we should get one of these machines setup so that we can investigate what is going on and confirm what the issue is. |
Seeing the same issues on PERFTIGER197 now. (20230928.2, wasmaot.partition17) |
The latest errors we are seeing with this on PerfTiger172 are:
which is occurring during the building of the job:
Is the binlog something that will be enough to debug this? I can also try to keep all of the BDN artifact stuff and get a run where we upload all of the artifacts if it fails since this seems to be unreliably hit. If anyone has any ideas for what logs may be helpful, please let me know. |
PERFTIGER172 has been having frequent failures like ones discussed in this issue. |
We already publish the binlog from the build, and in some cases other parts of the build too. |
I setup a quick test run (https://dev.azure.com/dnceng/internal/_build/results?buildId=2306325&view=results) that will hopefully get us a full upload of all of the BDN artifacts outlined here: https://benchmarkdotnet.org/articles/guides/troubleshooting.html#:~:text=How%20to%20troubleshoot%20the%20build%20process%3A%201%20Run,Run%20the%20script%2C%20read%20the%20error%20message.%20. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This issue is specifically about failures on some particular machines in the perflab. The log above is an oom - |
In that case my edit of the filter was misleading here, reverted. |
We are seeing frequest crashes like this on 1/2 out of 30 wasmaot runs, and all specifically on
PERFTIGER200
. I'm not sure if they are real clang issues or not.Two examples:
Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=544187
cc @kg @vargaz
Known Issue Error Message
Fill the error message using step by step known issues guidance.
Report
Summary
The text was updated successfully, but these errors were encountered: