-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NativeAOT legs timing out in CI #102239
Comments
Tagging subscribers to this area: @dotnet/area-infrastructure-libraries |
Clicking through, the problem is always the same - we finish product build in 20 minutes and send 5 workitems to helix (each of which takes less than a minute to run). We then wait for 100 minutes for these to finish. Then we timeout. Then 2 more hours later, the Helix workitems get finally scheduled and finish. Digging into Helix logs, it always looks something like this:
We could increase the timeout to 5 hours but that feels excessive. |
@markwilkie Could you comment on what "Delay" means here? Is there something holding up the run? |
@chcosta - any thoughts as to what 'Delay' means here?
|
looking |
I haven't had much time to dig into this yet |
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas |
@chcosta any update here? |
Sadly no, I had time to dig in a little, and only got as far as confirming what @MichalStrehovsky was seeing. I couldn't find any additional insight into what caused the Delay, only that it represents the amount of time between queue time and start time. |
In https://dev.azure.com/dnceng-public/public/_build/results?buildId=705793&view=logs&j=ddb4415b-4613-5bce-e937-0da25336f8b9&t=d2b408ad-4ef2-5ad9-4f1a-57f7ca85d7e0 , I see all the helix jobs completing super fast, but it still times out. Could this be an issue with test results relay or something post run? If you look at the helix job list, each one shows nothing abnormal. https://helix.dot.net/api/jobs/448572db-1838-4ee6-b27c-fb612c6cf3b9/workitems?api-version=2019-06-17 |
A lot of these are running on the ubuntu.2204.amd64.open.rt queue which looks like it got backed up yesterday: https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/queues/queue-monitor?orgId=1&var-QueueName=ubuntu.2204.amd64.open.rt&var-UntrackedQueues=%22osx%22,%20%22perf%22,%20%22arm%22,%20%22arcade%22,%20%22xaml%22,%20%22appcompat%22&from=1718694597557&to=1718784903311 |
Yeah still confused why aot is being hit more frequently |
Sven figured it out. It's because the time out for the Native AOT tests is 120 minutes, which is different from many other tests (libraries have 180 for example). This issue is just specifically catching Native AOT tests because the message includes the timeout. The real cause of all of this is queue overload. It has nothing to do with Native AOT. |
Build Information
Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=675463
Build error leg or test failing: Build / linux-x64 Debug NativeAOT
Pull request: #102176
Error Message
Fill the error message using step by step known issues guidance.
Known issue validation
Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=675463
Error message validated:
[ran longer than the maximum time of 120 minutes
]Result validation: ✅ Known issue matched with the provided build.
Validation performed at: 5/15/2024 3:00:04 AM UTC
Report
Summary
The text was updated successfully, but these errors were encountered: