Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert JIT\Directed tests to merged test groups #83256

Merged
merged 30 commits into from
Mar 24, 2023

Conversation

markples
Copy link
Member

(based on #81969)

@ghost
Copy link

ghost commented Mar 10, 2023

Tagging subscribers to this area: @dotnet/area-system-reflection-metadata
See info in area-owners.md if you want to be subscribed.

Issue Details

(based on #81969)

Author: markples
Assignees: markples
Labels:

area-System.Reflection.Metadata

Milestone: -

@markples markples added test-enhancement Improvements of test source code and removed area-System.Reflection.Metadata labels Mar 10, 2023
@markples
Copy link
Member Author

/azp run runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@BrianBohe
Copy link
Member

LGTM! Let see if that typo was the only problem

@BrianBohe
Copy link
Member

/azp run runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@markples
Copy link
Member Author

/azp run runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@BrianBohe BrianBohe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to split HardwareIntrinsics_ro? Would that avoid these timeouts? I don't see issues related to this pr

@BrianBohe
Copy link
Member

Does it make sense to split HardwareIntrinsics_ro? Would that avoid these timeouts? I don't see issues related to this pr

What do you think about this @trylek ?

@trylek
Copy link
Member

trylek commented Mar 11, 2023

What is the number of tests in the group and its duration in checked mode (without any stress modes)?

@markples
Copy link
Member Author

It looks like it has 5834. I see a successful x64 run that took about 5 minutes. Even the timeouts are only showing 5-10 minutes in the test group. This includes timestamps that appear to be after the test script is terminated, so I don't think it's getting stuck. (These tests are generated, etc., so there are a lot of them... @davidwrighton added various things like the test striping mechanism just to handle them at all.. so maybe there's more to do there.)

There's a comment:

<!-- For Vector512, we only have a very small pool of machines with acceleration support, so they are always outerloop -->

Could this be a case where queue time is counting against a timeout value?

@markples
Copy link
Member Author

/azp run runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@trylek
Copy link
Member

trylek commented Mar 11, 2023

Hmm, both these values seem far beyond what I'd consider reasonable. Please remember that in GC / JIT stress modes the tests are typically two to three orders of magnitude slower and Helix items taking 5000 minutes are less fun than they seem. Under JIT/Methodical, I tried to keep the group size around 400 tests, we can certainly be more aggressive for tests that are really tiny but running the entire Pri1 test set as a single Helix item is obviously not desirable for multiple reasons.

@davidwrighton
Copy link
Member

@trylek, the tests are setup so that under striping, we run only about 600 or so tests , which results in them finishing in a vaguely reasonable timeframe under stress. If we're seeing intermittent failures here, then yep, we've got a real product bug to look into, and what I saw here looked like an intermittent failure that should be investigated.

@markples
Copy link
Member Author

Thanks @davidwrighton for pointing out the details on striping - I now see the msbuild variable is called NumberOfStripesToUseInStress so my reporting of a normal run was inaccurate.

I am curious if you can share your reasoning for suspecting a product issue here. Looking at one of the failures (AzDO build https://dev.azure.com/dnceng-public/public/_build/results?buildId=200759&view=results), I see

top of log:

BEGIN EXECUTION
"C:\h\w\9D430947\p\watchdog.exe" 300 "C:\h\w\9D430947\p\corerun.exe" -p "System.Reflection.Metadata.MetadataUpdater.IsSupported=false"  HardwareIntrinsics_ro.dll 
22:21:46.449 Running test: global::JIT.HardwareIntrinsics.Arm._AdvSimd.Arm64.Program.Abs_Vector128_Double()
Supported ISAs:

bottom of log:

22:26:48.580 Passed test: global::JIT.HardwareIntrinsics.General._Vector128_1.Program.op_AdditionByte()
22:26:48.581 Running test: global::JIT.HardwareIntrinsics.General._Vector128_1.Program.op_AdditionDouble()
Beginning scenario: RunBasicScenario_UnsafeRead
SUCCESS: The process "corerun.exe" with PID 5812 has been terminated.
2023-03-10T22:26:49.815Z	INFO   	run.py	run(48)	main	Beginning reading of test results.

Start @ 22:21:46.449
Start failing test @ 22:26:48.581
Results message @ 22:26:49.815

The failing test is only running for just over a second before the overall 5 minute timeout hits.

Recent changes have greatly bumped up this value, so if you are seeing a product issue, then it may be hidden now.

@markples
Copy link
Member Author

The failures we've seen so far have been seen elsewhere, so I will resolve the conflicts and update this for merging. @trylek, Brian and I have both made changes/reviewed this, so please let me know if there's anything that you'd like to look at more closely. Thanks!

Copy link
Member

@trylek trylek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, thanks Mark!

@markples markples marked this pull request as ready for review March 23, 2023 17:29
@SamMonoRT
Copy link
Member

@markples @JulieLeeMSFT @BrianBohe - Looking at https://dev.azure.com/dnceng-public/public/_build/results?buildId=226494&view=logs&j=e45de9b4-b3b3-54f9-2ea5-8e56201c788d&t=92db4d17-27b4-5f87-01bd-8cef7a796420 failure,

  1. For such large extensive changes to test infrastructure, given same tests are run for both runtimes, please kick off "/azp run runtime-extra-platforms" which should trigger some more lanes, one of which is linux-arm64 Release AllSubsets_Mono_LLVMFullAot_RuntimeTests llvmfullaot' extra-platforms lane. During that, the LLVM AOT cross-compile CoreCLR tests job runs, fails on a couple tests, hence the Send to Helix job doesn't collect any logs. Mono doesn't seem to have any aot-llvm arm64 runtime test coverage due to this.

  2. Example of that is https://dev.azure.com/dnceng-public/public/_build/results?buildId=226494&view=logs&j=e45de9b4-b3b3-54f9-2ea5-8e56201c788d&t=92db4d17-27b4-5f87-01bd-8cef7a796420

  3. The two test failing in above are : Regression_1.dll:

  • Assertion at /__w/1/s/src/mono/mono/mini/aot-compiler.c:3956, condition `image_index < MAX_IMAGE_INDEX' not met
    GitHub_27678.dll:
  • Assertion at /__w/1/s/src/mono/mono/mini/memory-access.c:120, condition `size < MAX_INLINE_COPY_SIZE' not met
  1. A PR was merged to fix one of those : [mono][aot] Fix an assert in the aot compiler. #84343

  2. For the second one, GitHub_27678.dll, it seems it is ignoring the condition in the issues.target file to exclude for Mono runtime. But on deeper look in the failure logs, it shows file at .../coreclr/linux.arm64.Release/JIT/Regression/Regression_3/GitHub_27678.dll, while as in the issues.target file, it seems to be ignoring from the JitBlue directory as in: ....JIT/Regression/JitBlue/GitHub_27678

You should work to disable the test correctly for Mono. Also @trylek I want to understand motivation behind the coreclr cross-compiler tests running under a Mono AOT-LLVM arm64 lane -- shouldn't these be two different lanes ?

@markples
Copy link
Member Author

markples commented Apr 5, 2023

@SamMonoRT Thank you for the info and investigation. I'll add runtime-extra-platforms.

The naming in issues.targets will work the same before and after merging unless certain internal test changes are made (I believe that splitting a single Main into multiple [Fact]s can cause this). However, merged tests cause this AOT step to happen during Regression_3 for all of the tests inside the group before we check for specific tests. I opened #84380, which effectively pulls GitHub_27678 into a separate executable so that the exclusion happens in time.

@markples
Copy link
Member Author

markples commented Apr 6, 2023

@SamMonoRT I just wanted to double-check that you weren't waiting for me on anything else. I think these two were the full impact (and both builds fixed so I abandoned my PR - one still fails at runtime but issue.targets handles that).

@ghost ghost locked as resolved and limited conversation to collaborators May 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
test-enhancement Improvements of test source code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants