Enable SuperPMI collection of CoreCLR test run #74961

BruceForstall · 2022-09-01T20:25:31Z

This change enables all-platform SuperPMI collection of a run of the CoreCLR tests (not just a PMI or crossgen2 compilation
of the tests).

The collection process works as follows. Compared to the other SuperPMI collection jobs which manage the partitioning of
Helix work items and driving of the collection process completely, this job uses the existing (and very complex) CoreCLR test
process, but adds as minimal additional logic as needed to enable SuperPMI collection. In particular, the normal test run
/eng/pipelines/common/templates/runtimes/run-test-job.yml job is used, just passing the additional parameter
SuperPmiCollect: true.

run-test-job.yml then defines a few extra dependencies, a few extra variables, passes SuperPmiCollect: true down to the
send-to-helix-step.yml template that runs tests, then, after tests are complete, collects the per-Helix-job generated MCH files,
merges them, generates the merged .MCT file, uploads the generated files to Azure Storage, and uploads appropriate log files.

Tests are run by Helix using a configuration set by helixpublishwitharcade.proj. New Helix pre-commands are added to set an
environment variable spmi_enable_collection if SuperPmiCollect is true, as well as a spmi_collect_dir variable where the
individual SuperPMI generated .MC files should go. New Helix post-commands are added to merge all the .MC files generated
by the executed tests in the Helix partition, and put the resulting .MCH file in the upload directory where it will get uploaded to
artifact storage.

The per-test generated wrapper scripts have been augmented to check for a spmi_enable_collection variable, and if set,
the appropriate variables are set that enable SuperPMI collection.

The result of all of this is a single merged .MCH file for each platform, uploaded to the usual Azure Storage location where
superpmi.py will find it.

Currently, we are running the outerloop test group. That is defined by run-test-job.yml to mean the normal and no_tiered_compilation scenarios of every test. So, the collection contains both Tier-0 and Tier-1 / tiering disabled (fully optimized) code.

The collections are around 6GB, which is more than 2x the size of the coreclr_tests PMI collection.

There are a few rough edges currently:

The Linux-arm and Linux-arm64 AzDO Docker containers don't include the Python pip tool, so AzDO upload is failing.
This requires updating the Docker containers used for these runs.
There are a few test crashes doing a collection. It looks like these are related to the COMPlus_EnableExtraSuperPmiQueries
variable being set, but need to be investigated.

This change also includes some clean-up to the existing SuperPMI collection scripts:

The SuperPMI collection scripts were initially written to do PMI-based collections, and crossgen2 and benchmarks were added
later. This change cleans up the scripts to only do what is required for a particular collection type. E.g., don't clone and
build jitutils to get pmi.dll for crossgen2 collections, where it is not needed. Additional documentation is added and things are
renamed to be more clear, if possible. Also, don't pass PMI-specific arguments to superpmi.py for crossgen2 collections
(where they are ignored, but might be confusing to readers).

In addition, the superpmi_collect_setup.py script gets more argument validation. This isn't terribly necessary since it's
only called in one place in the CI scripts, but it is useful for documentation, and helps when you are calling
the script manually as part of testing changes. I added a -payload_directory argument that specifies where the
correlation and work item payloads should be placed, instead of assuming they should be in "well-known" directories
in the source tree. Once again, this is useful for testing.

There is an unfortunate hack that has been added to handle a problem with zero-length .MCH files. Due to the way Helix
test partitioning works, we sometimes send a test assembly to a Helix machine and no tests from that assembly are run,
due to test exclusions most likely. In this case, we generate a zero-length .MCH file on the test machine (the SuperPMI merge
process generates this even if there are no .MC files to merge). Helix, by design, does not upload to artifact
storage any zero-length files. The Helix work item specifies, using the <DownloadFilesFromResults> property,
the name of all files we want to download to the AzDO machine from artifact storage after the Helix test run
is complete. This property doesn't allow for optional downloads, and will fail the job if the file doesn't exist.
A zero-length file will thus cause the job to fail because it won't be found in the
artifact storage. The solution here is that after merging the .MCH file on the Helix machine, if it is zero-length, we replace it
with a "sentinel" file with the exact contents "ZEROLENGTH". On the AzDO machine, before merging MCH files, we delete
any file with the contents "ZEROLENGTH". These behaviors are added to superpmi.py commands under the new --ci
arguments.

Miscellaneous notes on the implementation:

A bug was fixed in the invocation of merged wrapper tests on Windows, which require prefixing the batch script with call or else the Helix post-commands don't get executed.
The SuperPMI collection pipeline runs in the 'internal' instance. Thus, it uses the Helix queues defined for the internal instance. Typically, for the public instance with a PR test, only a single queue per architecture is defined. For 'internal', multiple are often defined, to get greater OS coverage. I defined a new 'internal' machine subset for 'superpmi' that is exactly one machine per platform, since we don't want to collect each platform more than once.
All the post-Helix superpmi processing steps are set to always run, even if there are failures in previous steps (such as the test run job). This is because we want to collect and publish as much data as possible, no matter what failures are encountered.
mcs -strip is augmented to simply copy a MCH file, without removing anything, if the strip 'range' is empty. This makes it easier to implement the "clean" phase without needing a separate step to determine if a mcs -strip command is necessary or not.
A bug was fixed in mcs -toc to handle extremely long method full names. In particular, we will truncate them to fit in our pre-existing static buffer. Before, sprintf_s would crash or show a modal CRT dialog box if the name was too long. I found a test case where the full method name was 190KB due to a huge number of arguments.

ghost · 2022-09-01T20:25:48Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details