Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable duration of tests running in ADO #12392

Closed
sandreenko opened this issue Apr 1, 2019 · 16 comments
Closed

Unstable duration of tests running in ADO #12392

sandreenko opened this issue Apr 1, 2019 · 16 comments

Comments

@sandreenko
Copy link
Contributor

We hit timeouts with GCStress on tests like baseservices/threading/regressions/beta2/437017 or JIT/HardwareIntrinsics/X86/Avx2/Avx2_ro from time to time. The timeout for GCStress is set to 1 hour per test. Kusto results show that sometimes these tests execution takes 20 minutes, sometimes 40, some fail with 1 hour timeout.

Duratation for beta2_437044_437044_sh Ubuntu.1604 gcstress0x3:

807.2075189
3600.1316769
1395.8818172
3558.8521005
794.5199284
778.0971284
1800.0648784
984.1676776
1586.0055026
1431.6268684
1117.4029117

The similar picture is for other tests, for example beta2_437044 on x64 WIndows.10 minopts:

Duration
10.8403088
9.7603915
4.0987427
23.8235332
10.2850002
7.8984973
23.5484026
29.2190543
4.540873
4.8469147
8.6422568
7.5901681
15.2560697
35.0298974
7.9805417
30.2128801
6.3597494
12.1808179
9.9471955
10.4480389
7.0961536
24.1057954
12.214705
3.0317356
31.2346013
7.5773318
8.5577859
17.8248298
6.1346953

So max is 35, min is 3, that doesn't look right.

(queue for this:

database('engineeringdata').TestResults
| where Method contains 'beta2_437044' 
| join kind=inner (database('engineeringdata').Jobs) on JobId
| where QueueName contains "Windows.10" and Properties contains "minopts" and Properties contains "x64"

`)

We have seen a similar problem with Jenkins but it was because we used XUnit ParallelRun=all and it was creating more threads than CPU cores and it was fixed by changing it to XUnit ParallelRun=collections (#22746). With ADO our understanding was that we run only one assembly on one machine so there were no difference between all and collections.

@sandreenko
Copy link
Contributor Author

cc @echesakov @dotnet/jit-contrib

@echesakov
Copy link
Contributor

@sandreenko
Copy link
Contributor Author

I have checked some simple tests without threads/GC and they show the same instability. I think this should be in 3.0.

@echesakov
Copy link
Contributor

regressions\beta2\437017 frequently times out on Ubuntu.1404.Arm32
This query

TestResults
| where Method == '_regressions_beta2_437017_437017_sh' and Result == 'Fail' and Duration > 600
| join kind=inner (Jobs) on JobId
| extend Configuration=extract('"configuration":"([^"]+)"', 1, Properties)

shows 24 records. 23 of them are Checked-no_tiered_compilation and one is Checked-jitstress1

@CarolEidt
Copy link
Contributor

_tailcall_Desktop__il_relthread_race__il_relthread_race_sh frequently times out in multiple configurations (Checked, Checked-no_tiered_compilation, Checked-gcstress0xc, and Checked-jitstress2_jitstressregs2

@sandreenko sandreenko self-assigned this May 13, 2019
@sandreenko
Copy link
Contributor Author

beta2_437044 is just an unstable threading test. I will look at other failures soon.

@sandreenko
Copy link
Contributor Author

More data about tests\src\JIT\Methodical\tailcall\Desktop_il_relthread-race
All numbers are in seconds for the default mode (Checked without complus vars):

arm32 Linux shows min = 2.2, max = 7, factor 3;
x64 Linux shows min = 1.1, max = 28, factor 25;
arm64 Linux shows min = 1.7, max = 600(timeout), factor 350.

x64 Windows min = 2, max = 256, factor 128;
x86 Windows min = 2, max = 350, factor 175;

x64 OSX min = 2, max = 32, factor 16;

the test uses 4 threads and one mutex.

@sandreenko
Copy link
Contributor Author

If you ask Kusto a query like this:

database('engineeringdata').Jobs | where Source contains "ci/dotnet/coreclr" | project JobId, Properties
| join hint.strategy = broadcast (database('engineeringdata').TestResults 
| where Result == 'Pass' and Traits contains "TestGroup=JIT.jit64" // choose any group that you like
| project JobId, Method, Duration) on JobId
| project Method, Properties, Duration 
| summarize min(Duration), max(Duration), max_min_factor = max(Duration)/min(Duration), count() by Method, Properties
| order by max_min_factor
| project Method, max_min_factor, Properties, min_Duration, max_Duration, count_
| limit 100

it will show you data like this:

Method max_min_factor Properties min_Duration max_Duration count_
_regress_ddb_87766_ddb87766_ddb87766_sh 1714.98374634029 {"architecture":"x64","operatingSystem":"Ubuntu.1804.Amd64.Open","configuration":"Checked"} 0.0066262 11.3638253 167
_opt_cse_HugeField2_HugeField2_sh 602.574396857634 {"operatingSystem":"Ubuntu.1804.Amd64.Open","configuration":"Checked-gcstress0x3","architecture":"x64"} 0.0089614 5.3999102 3
_regress_ddb_87766_ddb87766_ddb87766_sh 499.630942461785 {"operatingSystem":"Debian.9.Amd64.Open","configuration":"Checked-jitminopts","architecture":"x64"} 0.0128419 6.4162106 6
_opt_cse_HugeField2_HugeField2_sh 483.194188782788 {"architecture":"x64","configuration":"Checked-gcstress0xc","operatingSystem":"Ubuntu.1604.Amd64.Open"} 0.0110063 5.3181802 2
_opt_cse_HugeArray_HugeArray_sh 338.851130223 {"architecture":"x64","configuration":"Checked-gcstress0xc_jitstress2","operatingSystem":"Ubuntu.1804.Amd64.Open"} 0.0144529 4.8973815 2
_opt_cse_hugeexpr1_hugeexpr1_cmd 283.013152020376 {"operatingSystem":"Windows.10.Amd64.Open","architecture":"x86","configuration":"Checked-jitminopts"} 0.0546532 15.4675744 12
_opt_cse_HugeField2_HugeField2_sh 268.162398105807 {"operatingSystem":"Debian.9.Amd64.Open","architecture":"x64","configuration":"Checked-gcstress0xc"} 0.0110654 2.9673242 4
_regress_ddb_87766_ddb87766_ddb87766_sh 241.511345881109 {"operatingSystem":"Debian.9.Amd64.Open","architecture":"x64","configuration":"Checked-jitstress2"} 0.0083202 2.0094227 13

that means that we have a problem not with a particular set of tests or platform or test mode but it is a general issue that affects the whole testing.

@briansull
Copy link
Contributor

That query indicates that a very few set of "huge tests" have widely varying executions times.
Which can happen when an O/S has to resort to paging the virtual memory out to disk.

_regress_ddb
_opt_cse_HugeField2
_opt_cse_HugeArray
_opt_cse_hugeexpr1

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@xiangzhai
Copy link
Contributor

:mips-interest

@sandreenko
Copy link
Contributor Author

I am going to reenable the disabled tests and see if the issue still reproduces. There were many infrastructure changes since it was open, so it could have been fixed already.

In the past, I was trying to make a change to allocate 2 cores per running test in helix but it did not work well (https://github.com/dotnet/coreclr/pull/24749/files). On some machines I was getting an incorrect number of cores (like 2 cores VM on 64 cores physical arm machine was giving me 32, instead of 1), on others, it still was hitting timeouts, but less frequently, because the tests that hit them did not allocate 2 cores, they were allocating dozen each. The big downside of that change was that on average a test run was taking longer (like ~20% slower).

@sandreenko
Copy link
Contributor Author

As I see from the recent GCStress runs it is still an issue: JIT/HardwareIntrinsics/Arm/AdvSimd/AdvSimd_r/AdvSimd_r.sh
and similar that were not disabled are still failing with timeouts (example)

@echesakov
Copy link
Contributor

As I see from the recent GCStress runs it is still an issue: JIT/HardwareIntrinsics/Arm/AdvSimd/AdvSimd_r/AdvSimd_r.sh

@sandreenko I am working on splitting those up #37282

@sandreenko
Copy link
Contributor Author

@sandreenko I am working on splitting those up #37282

Could you please try to reenabled tests from https://github.com/dotnet/runtime/pull/37530/files and close this issue in your PR?

@echesakov
Copy link
Contributor

Could you please try to reenabled tests from https://github.com/dotnet/runtime/pull/37530/files and close this issue in your PR?

Sure

@jashook
Copy link
Contributor

jashook commented Jul 20, 2020

Addressed

@jashook jashook closed this as completed Jul 20, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants