-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PayloadGroup0 is timing out #38284
Comments
I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label. |
Spot checked those logs. Looks like a few of these are #38156. I'm currently investigating that failure. However, it seems the frequency of hitting it has increased recently. If need be, we can temporarily disable this test, but I think it might an actual regression and not just flakiness. I am also noticing several that are varying failures of Crossgen2 Determinism test failures. Some of these are AVs in managed code like the Socket failures I mention below:
I also saw the binder event test timeouts that @safern linked. There also appears to be some AVs coming out of the
This is affecting tests that use sockets like the Binder event tests and the eventpipe tests. |
CC @mangod9 for R2R test failures. |
@trylek have you noticed any AVs within your recent cg2 CI runs like the one noted above? It appears its probably unrelated to cg2, but some corruption similar to the socket AV. |
Given the hit count right now we should be disabling unless a fix is eminent. Even if this is a regression we should still be disabling the test. In the current state it's just serving to mask other failures in the system. |
Sounds good; PR incoming. |
@mangod9 - I didn't hit it in the Crossgen2 runs but I have seen the PayloadGroup0 timeouts pretty regularly in my recent PR runs. |
Tagging subscribers to this area: @ViktorHofer |
Just hit this, it seems by the logs that it hung: |
@safern, can you link to the AzDO run that created that console log? That looks new/different considering it is on Windows. |
Sure, this is the one: https://dev.azure.com/dnceng/public/_build/results?buildId=715290&view=results Note, this happened on the second attempt. |
Unfortunately, it looks like there aren't logs from the tests themselves so I can't triage it without more info. It looks like there are a couple of the Binder tests in there that display similar symptoms, i.e., console output prematurely ending after starting the test wrapper (CC @jashook). Also, @jaredpar, I'm curious: not all of the |
Yes we should have issues tracking the different failures here. The reason for creating this uber issue initially is because they were all bucketing into it and there was no discernable smaller failures. If the followup investigation finds them we should be filing bugs to track them. |
Curious, did we already file follow-up issues? |
I haven't checked up on the status of the non timeout tests in the list, e.g., Binder, base services, the BCL AVs, etc. I've been investigating the workitem timeout as time permits, but haven't discovered anything. #38167 seems to show the same behavior. I'm not sure what is going on yet, but I'm not 100% convinced that this is a test failure. A test timeout in this payload should cause the test wrapper to fail the test and print errors immediately to the log via the xunit runner as they timeout. The per test timeout inside the test wrapper is 10 minutes on a regular run. Without any additional information this log seems to show that the Helix workitem itself timed out while running tests and it isn't necessarily the underlying test timing out. It's possible that there is an underlying test timeout that is happening when the workitem times out, e.g., if the workitem timeout is 30 minutes and tests 1-50 took 25 minutes collectively, then if test 51 is going to timeout and is hung for >5 minutes, the workitem would timeout before the test. I tried the other day to query the Helix data to get some insight for this specific payload and I wasn't able to get anything meaningful. I'll see about querying some more after the p8 snap. One thing that might help with this and any future time related issues is to always have a timestamp logged in the output for everything including the xunit runner. I'm approaching this from the perspective that a test is timing out since that would be easier to diagnose, but it can be hard to prove that when trying to repro locally due to resource differences with CI and a lack of evidence that the test is actually timing out. |
Closing as there are no recent occurrences of this timing out. There are recent occurrences of it failing but it is normal failures, not timeouts. |
This test group is failing in a lot of different runs right now, most commonly it is timing out.
Console Log Summary
Builds
Configurations
Helix Logs
The text was updated successfully, but these errors were encountered: