-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable SuperPMI collection of CoreCLR test run #74961
Enable SuperPMI collection of CoreCLR test run #74961
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThis change enables all-platform SuperPMI collection of a run of the CoreCLR tests (not just a PMI or crossgen2 compilation The collection process works as follows. Compared to the other SuperPMI collection jobs which manage the partitioning of
Tests are run by Helix using a configuration set by The per-test generated wrapper scripts have been augmented to check for a The result of all of this is a single merged .MCH file for each platform, uploaded to the usual Azure Storage location where Currently, we are running the The collections are around 6GB, which is more than 2x the size of the coreclr_tests PMI collection. There are a few rough edges currently:
This change also includes some clean-up to the existing SuperPMI collection scripts: The SuperPMI collection scripts were initially written to do PMI-based collections, and crossgen2 and benchmarks were added In addition, the superpmi_collect_setup.py script gets more argument validation. This isn't terribly necessary since it's There is an unfortunate hack that has been added to handle a problem with zero-length .MCH files. Due to the way Helix Miscellaneous notes on the implementation:
|
The SuperPMI collection scripts were initially written to do PMI-based collections, and crossgen2 and benchmarks were added later. This change cleans up the scripts to only do what is required for a particular collection type. E.g., don't clone and build jitutils to get pmi.dll for crossgen2 collections, where it is not needed. Additional documentation is added and things are renamed to be more clear, if possible. Also, don't pass PMI-specific arguments to superpmi.py for crossgen2 collections (where they are ignored, but might be confusing to readers). In addition, the superpmi_collect_setup.py script gets more argument validation. This isn't terribly necessary since it's only called in one place in the CI scripts, but it is useful for documentation, and helps when you are calling the script manually as part of testing changes. I added a `-payload_directory` argument that specifies where the correlation and work item payloads should be placed, instead of assuming they should be in "well-known" directories in the source tree. Once again, this is useful for testing.
Instead of using our existing SuperPMI collection scripts which does test partitioning for PMI and crossgen2 collections, we use the existing (and very complex) test partitioning and running scripting, and add before steps to enable SuperPMI collection and after steps to process the results. The process is as follows: 1. The top-level superpmi-collect.yml file has a new task, which invokes the test run scripting /eng/pipelines/common/templates/runtimes/run-test-job.yml with a new argument: `SuperPmiCollect: true` 2. run-test-job.yml then adds a new dependency and various variable definitions, and invokes the send-to-helix-step.yml file with the new SuperPmiCollect variable. It also adds a number of post-Helix steps to (a) merge the MCH files from all the Helix jobs, (b) upload the merged MCH result to Azure Storage, (c) upload SuperPMI log files to artifacts. 3. The Helix invocation ends up at helixpublishwitharcade.proj, which runs a set of tests. It adds pre-commands to set environment variables to trigger SuperPMI collection and post-commands to merge all the generated MC files to a single MCH file. The environment variable `spmi_enable_collection` is set to trigger the collection. 4. The batch/bash wrapper scripts key on this variable and set up the SuperPMI collection variables. The idea is to set the SuperPMI collection variables as late as possible, namely, immediately before corerun.exe is invoked on the test. Specifically, we don't set the variables earlier, before dotnet.exe and xunit are invoked. Note: this collection process doesn't currently handle the new "merged" tests.
1. Create new 'superpmi' machine group for internal pipeline that only has one Helix queue per architecture instead of multiple. 2. Copy superpmi.py and dependents to correlation payload superpmi_scripts sub-directory. 3. Always make output directories 4. Set MchFileTag value value in run-test-job.yml.
1. Fix merged partitions to use "call" in front of batch file invocations so post commands get executed. 2. Hack spmi_collection_partition to use zero instead of HELIX_WORKITEM_ID so we know what the filename will be when it gets downloaded to the 'helixresults' directory on the AzDO machine. 3. Use DownloadFilesFromResults to download mch/mct/log file to AzDO machine for further processing/merging. (don't really need mct file since we'll merge and regenerate it) Because of hack (2), each partition will generate the same filename so probably they'll just overwrite each other on AzDO and we'll only get one. It's hard to see how we can pass a unique partition ID all the way down to the SPMI HelixPostCommand that can also be used in the DownloadFilesFromResults line.
1. Fix 'du' to only run on non-Windows 2. Make 'du' run always, even after failure 3. Make artifacts upload run always, even if merge-mch fails (I'm seeing MCT creation fail but not MCH merging; maybe I can more easily repro if the MCH gets uploaded?) 4. Stop uploading MCT files from Helix machines (a new MCT will get created with the merged MCH). (Optionally, we should just not create the MCT on the Helix machines.)
Truncate any function signature that is too large (> 63KB).
We're going to merge and regenerate the overall TOC anyway, so no need to create it and download it to the AzDO machine.
If we don't run any tests when collecting, a zero-length MCH file will be created. Helix won't upload zero-length files, which leads to an error in the AzDO script when the HelixWorkItem.DownloadFilesFromResults item pointing to the expected .MCH file is not found. To work around this, convert zero-length files to special sentinal files during superpmi collection. Then, before merging files using merge-mch, delete these files. The `--ci` option must be passed to both superpmi.py "collect" and "merge-mch" commands to get this behavior.
Must have dropped it in an integration
Remove temporary code (except all other jobs are still commented out)
35b3d5e
to
68b8628
Compare
This change subsumes #74598 |
|
Latest test collection: https://dev.azure.com/dnceng/internal/_build/results?buildId=1980977&view=results |
@kunalspathak @dotnet/jit-contrib PTAL fyi @jkoritzinsky @trylek @hoyosjs This shouldn't affect normal test runs, but FYI on this work. |
ping |
Will review later today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The collections are around 6GB, which is more than 2x the size of the coreclr_tests PMI collection.
Should we stop collecting coreclr_tests PMI collection since you are already collecting tier0/tier1 coreclr tests run?
The collections are around 6GB, which is more than 2x the size of the coreclr_tests PMI collection.
Who is doing it?
There are a few test crashes doing a collection.
How does the collection behave to those crashes? It will ignore and continue collecting?
@@ -45,7 +52,9 @@ jobs: | |||
- ${{ if eq(parameters.platform, 'Linux_arm') }}: | |||
- ${{ if eq(variables['System.TeamProject'], 'public') }}: | |||
- (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 | |||
- ${{ if eq(variables['System.TeamProject'], 'internal') }}: | |||
- ${{ if and(eq(variables['System.TeamProject'], 'internal'), in(parameters.jobParameters.helixQueueGroup, 'superpmi')) }}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With changes in this file, do we need to remove setting the helix_queue
variable in superpmi_collect_setup.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No: these changes are used by the coreclr_tests run collection, but the Helix queue setup in superpmi_collect_setup.py
is still used by the existing collection jobs.
Ideally, we would always use the helix-queues-setup.yml
file to determine the Helix queues to use, so there is reduced chance that a need to update the queues or Docker containers doesn't miss the superpmi_collect_setup.py
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we would always use the helix-queues-setup.yml file to determine the Helix queues to use, so there is reduced chance that a need to update the queues or Docker containers doesn't miss the superpmi_collect_setup.py case.
How can we fix that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure; there's a lot of YML template magic going on. We possibly need helix-queues-setup.yml to pass something through to superpmi_collect_setup.py?
We can consider it. PMI is slightly different, but maybe the difference doesn't matter enough to keep doing PMI collections? E.g.,
I assume the question is: why are the "run" collections 2x the size of the "PMI" collections? Most likely because the "run" collections have both tiering enabled and tiering disabled runs.
The test run stops, because JIT asserts kill the xunit run. However, everything collected up to that point contributes to the final .MCH file. |
I meant to ask that question for ^ this. |
So, who is updating the Docker containers? Me: #75167 (still iterating) |
Yes, because not only we are doing an extra collection, but it also runs superpmi-diff and superpmi-replay job. At least when I do a local superpmi-diff run, I mostly check asp, benchmarks, libraries-pmi and sometimes crossgen2. Likewise, in reviewing someone's PR, I would just check the diffs in those few collections. I would suggest to stop collecting coreclr_tests pmi collection to concentrate on small subset of collection that really matters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow up, please remove the coreclr_test pmi collection.
With dotnet#74961, we have a collection of a run of CoreCLR tests. That makes the PMI collection of the tests mostly duplicative.
With #74961, we have a collection of a run of CoreCLR tests. That makes the PMI collection of the tests mostly duplicative.
This change enables all-platform SuperPMI collection of a run of the CoreCLR tests (not just a PMI or crossgen2 compilation
of the tests).
The collection process works as follows. Compared to the other SuperPMI collection jobs which manage the partitioning of
Helix work items and driving of the collection process completely, this job uses the existing (and very complex) CoreCLR test
process, but adds as minimal additional logic as needed to enable SuperPMI collection. In particular, the normal test run
/eng/pipelines/common/templates/runtimes/run-test-job.yml
job is used, just passing the additional parameterSuperPmiCollect: true
.run-test-job.yml
then defines a few extra dependencies, a few extra variables, passesSuperPmiCollect: true
down to thesend-to-helix-step.yml
template that runs tests, then, after tests are complete, collects the per-Helix-job generated MCH files,merges them, generates the merged .MCT file, uploads the generated files to Azure Storage, and uploads appropriate log files.
Tests are run by Helix using a configuration set by
helixpublishwitharcade.proj
. New Helix pre-commands are added to set anenvironment variable
spmi_enable_collection
ifSuperPmiCollect
istrue
, as well as aspmi_collect_dir
variable where theindividual SuperPMI generated .MC files should go. New Helix post-commands are added to merge all the .MC files generated
by the executed tests in the Helix partition, and put the resulting .MCH file in the upload directory where it will get uploaded to
artifact storage.
The per-test generated wrapper scripts have been augmented to check for a
spmi_enable_collection
variable, and if set,the appropriate variables are set that enable SuperPMI collection.
The result of all of this is a single merged .MCH file for each platform, uploaded to the usual Azure Storage location where
superpmi.py will find it.
Currently, we are running the
outerloop
test group. That is defined byrun-test-job.yml
to mean thenormal
andno_tiered_compilation
scenarios of every test. So, the collection contains both Tier-0 and Tier-1 / tiering disabled (fully optimized) code.The collections are around 6GB, which is more than 2x the size of the coreclr_tests PMI collection.
There are a few rough edges currently:
pip
tool, so AzDO upload is failing.This requires updating the Docker containers used for these runs.
COMPlus_EnableExtraSuperPmiQueries
variable being set, but need to be investigated.
This change also includes some clean-up to the existing SuperPMI collection scripts:
The SuperPMI collection scripts were initially written to do PMI-based collections, and crossgen2 and benchmarks were added
later. This change cleans up the scripts to only do what is required for a particular collection type. E.g., don't clone and
build jitutils to get pmi.dll for crossgen2 collections, where it is not needed. Additional documentation is added and things are
renamed to be more clear, if possible. Also, don't pass PMI-specific arguments to superpmi.py for crossgen2 collections
(where they are ignored, but might be confusing to readers).
In addition, the superpmi_collect_setup.py script gets more argument validation. This isn't terribly necessary since it's
only called in one place in the CI scripts, but it is useful for documentation, and helps when you are calling
the script manually as part of testing changes. I added a
-payload_directory
argument that specifies where thecorrelation and work item payloads should be placed, instead of assuming they should be in "well-known" directories
in the source tree. Once again, this is useful for testing.
There is an unfortunate hack that has been added to handle a problem with zero-length .MCH files. Due to the way Helix
test partitioning works, we sometimes send a test assembly to a Helix machine and no tests from that assembly are run,
due to test exclusions most likely. In this case, we generate a zero-length .MCH file on the test machine (the SuperPMI merge
process generates this even if there are no .MC files to merge). Helix, by design, does not upload to artifact
storage any zero-length files. The Helix work item specifies, using the
<DownloadFilesFromResults>
property,the name of all files we want to download to the AzDO machine from artifact storage after the Helix test run
is complete. This property doesn't allow for optional downloads, and will fail the job if the file doesn't exist.
A zero-length file will thus cause the job to fail because it won't be found in the
artifact storage. The solution here is that after merging the .MCH file on the Helix machine, if it is zero-length, we replace it
with a "sentinel" file with the exact contents "ZEROLENGTH". On the AzDO machine, before merging MCH files, we delete
any file with the contents "ZEROLENGTH". These behaviors are added to superpmi.py commands under the new
--ci
arguments.
Miscellaneous notes on the implementation:
call
or else the Helix post-commands don't get executed.mcs -strip
is augmented to simply copy a MCH file, without removing anything, if the strip 'range' is empty. This makes it easier to implement the "clean" phase without needing a separate step to determine if amcs -strip
command is necessary or not.mcs -toc
to handle extremely long method full names. In particular, we will truncate them to fit in our pre-existing static buffer. Before,sprintf_s
would crash or show a modal CRT dialog box if the name was too long. I found a test case where the full method name was 190KB due to a huge number of arguments.