Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable SuperPMI collection of CoreCLR test run #74961

Merged

Conversation

BruceForstall
Copy link
Member

This change enables all-platform SuperPMI collection of a run of the CoreCLR tests (not just a PMI or crossgen2 compilation
of the tests).

The collection process works as follows. Compared to the other SuperPMI collection jobs which manage the partitioning of
Helix work items and driving of the collection process completely, this job uses the existing (and very complex) CoreCLR test
process, but adds as minimal additional logic as needed to enable SuperPMI collection. In particular, the normal test run
/eng/pipelines/common/templates/runtimes/run-test-job.yml job is used, just passing the additional parameter
SuperPmiCollect: true.

run-test-job.yml then defines a few extra dependencies, a few extra variables, passes SuperPmiCollect: true down to the
send-to-helix-step.yml template that runs tests, then, after tests are complete, collects the per-Helix-job generated MCH files,
merges them, generates the merged .MCT file, uploads the generated files to Azure Storage, and uploads appropriate log files.

Tests are run by Helix using a configuration set by helixpublishwitharcade.proj. New Helix pre-commands are added to set an
environment variable spmi_enable_collection if SuperPmiCollect is true, as well as a spmi_collect_dir variable where the
individual SuperPMI generated .MC files should go. New Helix post-commands are added to merge all the .MC files generated
by the executed tests in the Helix partition, and put the resulting .MCH file in the upload directory where it will get uploaded to
artifact storage.

The per-test generated wrapper scripts have been augmented to check for a spmi_enable_collection variable, and if set,
the appropriate variables are set that enable SuperPMI collection.

The result of all of this is a single merged .MCH file for each platform, uploaded to the usual Azure Storage location where
superpmi.py will find it.

Currently, we are running the outerloop test group. That is defined by run-test-job.yml to mean the normal and no_tiered_compilation scenarios of every test. So, the collection contains both Tier-0 and Tier-1 / tiering disabled (fully optimized) code.

The collections are around 6GB, which is more than 2x the size of the coreclr_tests PMI collection.

There are a few rough edges currently:

  1. The Linux-arm and Linux-arm64 AzDO Docker containers don't include the Python pip tool, so AzDO upload is failing.
    This requires updating the Docker containers used for these runs.
  2. There are a few test crashes doing a collection. It looks like these are related to the COMPlus_EnableExtraSuperPmiQueries
    variable being set, but need to be investigated.

This change also includes some clean-up to the existing SuperPMI collection scripts:

The SuperPMI collection scripts were initially written to do PMI-based collections, and crossgen2 and benchmarks were added
later. This change cleans up the scripts to only do what is required for a particular collection type. E.g., don't clone and
build jitutils to get pmi.dll for crossgen2 collections, where it is not needed. Additional documentation is added and things are
renamed to be more clear, if possible. Also, don't pass PMI-specific arguments to superpmi.py for crossgen2 collections
(where they are ignored, but might be confusing to readers).

In addition, the superpmi_collect_setup.py script gets more argument validation. This isn't terribly necessary since it's
only called in one place in the CI scripts, but it is useful for documentation, and helps when you are calling
the script manually as part of testing changes. I added a -payload_directory argument that specifies where the
correlation and work item payloads should be placed, instead of assuming they should be in "well-known" directories
in the source tree. Once again, this is useful for testing.

There is an unfortunate hack that has been added to handle a problem with zero-length .MCH files. Due to the way Helix
test partitioning works, we sometimes send a test assembly to a Helix machine and no tests from that assembly are run,
due to test exclusions most likely. In this case, we generate a zero-length .MCH file on the test machine (the SuperPMI merge
process generates this even if there are no .MC files to merge). Helix, by design, does not upload to artifact
storage any zero-length files. The Helix work item specifies, using the <DownloadFilesFromResults> property,
the name of all files we want to download to the AzDO machine from artifact storage after the Helix test run
is complete. This property doesn't allow for optional downloads, and will fail the job if the file doesn't exist.
A zero-length file will thus cause the job to fail because it won't be found in the
artifact storage. The solution here is that after merging the .MCH file on the Helix machine, if it is zero-length, we replace it
with a "sentinel" file with the exact contents "ZEROLENGTH". On the AzDO machine, before merging MCH files, we delete
any file with the contents "ZEROLENGTH". These behaviors are added to superpmi.py commands under the new --ci
arguments.

Miscellaneous notes on the implementation:

  1. A bug was fixed in the invocation of merged wrapper tests on Windows, which require prefixing the batch script with call or else the Helix post-commands don't get executed.
  2. The SuperPMI collection pipeline runs in the 'internal' instance. Thus, it uses the Helix queues defined for the internal instance. Typically, for the public instance with a PR test, only a single queue per architecture is defined. For 'internal', multiple are often defined, to get greater OS coverage. I defined a new 'internal' machine subset for 'superpmi' that is exactly one machine per platform, since we don't want to collect each platform more than once.
  3. All the post-Helix superpmi processing steps are set to always run, even if there are failures in previous steps (such as the test run job). This is because we want to collect and publish as much data as possible, no matter what failures are encountered.
  4. mcs -strip is augmented to simply copy a MCH file, without removing anything, if the strip 'range' is empty. This makes it easier to implement the "clean" phase without needing a separate step to determine if a mcs -strip command is necessary or not.
  5. A bug was fixed in mcs -toc to handle extremely long method full names. In particular, we will truncate them to fit in our pre-existing static buffer. Before, sprintf_s would crash or show a modal CRT dialog box if the name was too long. I found a test case where the full method name was 190KB due to a huge number of arguments.

@ghost ghost assigned BruceForstall Sep 1, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 1, 2022
@ghost
Copy link

ghost commented Sep 1, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This change enables all-platform SuperPMI collection of a run of the CoreCLR tests (not just a PMI or crossgen2 compilation
of the tests).

The collection process works as follows. Compared to the other SuperPMI collection jobs which manage the partitioning of
Helix work items and driving of the collection process completely, this job uses the existing (and very complex) CoreCLR test
process, but adds as minimal additional logic as needed to enable SuperPMI collection. In particular, the normal test run
/eng/pipelines/common/templates/runtimes/run-test-job.yml job is used, just passing the additional parameter
SuperPmiCollect: true.

run-test-job.yml then defines a few extra dependencies, a few extra variables, passes SuperPmiCollect: true down to the
send-to-helix-step.yml template that runs tests, then, after tests are complete, collects the per-Helix-job generated MCH files,
merges them, generates the merged .MCT file, uploads the generated files to Azure Storage, and uploads appropriate log files.

Tests are run by Helix using a configuration set by helixpublishwitharcade.proj. New Helix pre-commands are added to set an
environment variable spmi_enable_collection if SuperPmiCollect is true, as well as a spmi_collect_dir variable where the
individual SuperPMI generated .MC files should go. New Helix post-commands are added to merge all the .MC files generated
by the executed tests in the Helix partition, and put the resulting .MCH file in the upload directory where it will get uploaded to
artifact storage.

The per-test generated wrapper scripts have been augmented to check for a spmi_enable_collection variable, and if set,
the appropriate variables are set that enable SuperPMI collection.

The result of all of this is a single merged .MCH file for each platform, uploaded to the usual Azure Storage location where
superpmi.py will find it.

Currently, we are running the outerloop test group. That is defined by run-test-job.yml to mean the normal and no_tiered_compilation scenarios of every test. So, the collection contains both Tier-0 and Tier-1 / tiering disabled (fully optimized) code.

The collections are around 6GB, which is more than 2x the size of the coreclr_tests PMI collection.

There are a few rough edges currently:

  1. The Linux-arm and Linux-arm64 AzDO Docker containers don't include the Python pip tool, so AzDO upload is failing.
    This requires updating the Docker containers used for these runs.
  2. There are a few test crashes doing a collection. It looks like these are related to the COMPlus_EnableExtraSuperPmiQueries
    variable being set, but need to be investigated.

This change also includes some clean-up to the existing SuperPMI collection scripts:

The SuperPMI collection scripts were initially written to do PMI-based collections, and crossgen2 and benchmarks were added
later. This change cleans up the scripts to only do what is required for a particular collection type. E.g., don't clone and
build jitutils to get pmi.dll for crossgen2 collections, where it is not needed. Additional documentation is added and things are
renamed to be more clear, if possible. Also, don't pass PMI-specific arguments to superpmi.py for crossgen2 collections
(where they are ignored, but might be confusing to readers).

In addition, the superpmi_collect_setup.py script gets more argument validation. This isn't terribly necessary since it's
only called in one place in the CI scripts, but it is useful for documentation, and helps when you are calling
the script manually as part of testing changes. I added a -payload_directory argument that specifies where the
correlation and work item payloads should be placed, instead of assuming they should be in "well-known" directories
in the source tree. Once again, this is useful for testing.

There is an unfortunate hack that has been added to handle a problem with zero-length .MCH files. Due to the way Helix
test partitioning works, we sometimes send a test assembly to a Helix machine and no tests from that assembly are run,
due to test exclusions most likely. In this case, we generate a zero-length .MCH file on the test machine (the SuperPMI merge
process generates this even if there are no .MC files to merge). Helix, by design, does not upload to artifact
storage any zero-length files. The Helix work item specifies, using the <DownloadFilesFromResults> property,
the name of all files we want to download to the AzDO machine from artifact storage after the Helix test run
is complete. This property doesn't allow for optional downloads, and will fail the job if the file doesn't exist.
A zero-length file will thus cause the job to fail because it won't be found in the
artifact storage. The solution here is that after merging the .MCH file on the Helix machine, if it is zero-length, we replace it
with a "sentinel" file with the exact contents "ZEROLENGTH". On the AzDO machine, before merging MCH files, we delete
any file with the contents "ZEROLENGTH". These behaviors are added to superpmi.py commands under the new --ci
arguments.

Miscellaneous notes on the implementation:

  1. A bug was fixed in the invocation of merged wrapper tests on Windows, which require prefixing the batch script with call or else the Helix post-commands don't get executed.
  2. The SuperPMI collection pipeline runs in the 'internal' instance. Thus, it uses the Helix queues defined for the internal instance. Typically, for the public instance with a PR test, only a single queue per architecture is defined. For 'internal', multiple are often defined, to get greater OS coverage. I defined a new 'internal' machine subset for 'superpmi' that is exactly one machine per platform, since we don't want to collect each platform more than once.
  3. All the post-Helix superpmi processing steps are set to always run, even if there are failures in previous steps (such as the test run job). This is because we want to collect and publish as much data as possible, no matter what failures are encountered.
  4. mcs -strip is augmented to simply copy a MCH file, without removing anything, if the strip 'range' is empty. This makes it easier to implement the "clean" phase without needing a separate step to determine if a mcs -strip command is necessary or not.
  5. A bug was fixed in mcs -toc to handle extremely long method full names. In particular, we will truncate them to fit in our pre-existing static buffer. Before, sprintf_s would crash or show a modal CRT dialog box if the name was too long. I found a test case where the full method name was 190KB due to a huge number of arguments.
Author: BruceForstall
Assignees: BruceForstall
Labels:

area-CodeGen-coreclr

Milestone: -

The SuperPMI collection scripts were initially written to do
PMI-based collections, and crossgen2 and benchmarks were added
later. This change cleans up the scripts to only do what is
required for a particular collection type. E.g., don't clone and
build jitutils to get pmi.dll for crossgen2 collections, where it
is not needed. Additional documentation is added and things are
renamed to be more clear, if possible. Also, don't pass PMI-specific
arguments to superpmi.py for crossgen2 collections (where they are
ignored, but might be confusing to readers).

In addition, the superpmi_collect_setup.py script gets more argument validation.
This isn't terribly necessary since it's only called in one place in the CI
scripts, but it is useful for documentation, and helps when you are calling
the script manually as part of testing changes. I added a `-payload_directory`
argument that specifies where the correlation and work item payloads should
be placed, instead of assuming they should be in "well-known" directories
in the source tree. Once again, this is useful for testing.
Instead of using our existing SuperPMI collection scripts which does
test partitioning for PMI and crossgen2 collections, we use the existing
(and very complex) test partitioning and running scripting, and add
before steps to enable SuperPMI collection and after steps to process
the results.

The process is as follows:
1. The top-level superpmi-collect.yml file has a new task, which invokes
the test run scripting /eng/pipelines/common/templates/runtimes/run-test-job.yml
with a new argument: `SuperPmiCollect: true`
2. run-test-job.yml then adds a new dependency and various variable definitions,
and invokes the send-to-helix-step.yml file with the new SuperPmiCollect variable.
It also adds a number of post-Helix steps to (a) merge the MCH files from all the
Helix jobs, (b) upload the merged MCH result to Azure Storage, (c) upload SuperPMI
log files to artifacts.
3. The Helix invocation ends up at helixpublishwitharcade.proj, which runs a set of
tests. It adds pre-commands to set environment variables to trigger SuperPMI collection
and post-commands to merge all the generated MC files to a single MCH file.
The environment variable `spmi_enable_collection` is set to trigger the collection.
4. The batch/bash wrapper scripts key on this variable and set up the SuperPMI
collection variables. The idea is to set the SuperPMI collection variables as late
as possible, namely, immediately before corerun.exe is invoked on the test. Specifically,
we don't set the variables earlier, before dotnet.exe and xunit are invoked.

Note: this collection process doesn't currently handle the new "merged" tests.
1. Create new 'superpmi' machine group for internal pipeline that only has one
Helix queue per architecture instead of multiple.
2. Copy superpmi.py and dependents to correlation payload superpmi_scripts
sub-directory.
3. Always make output directories
4. Set MchFileTag value value in run-test-job.yml.
1. Fix merged partitions to use "call" in front of batch file invocations so
post commands get executed.
2. Hack spmi_collection_partition to use zero instead of HELIX_WORKITEM_ID
so we know what the filename will be when it gets downloaded to the 'helixresults'
directory on the AzDO machine.
3. Use DownloadFilesFromResults to download mch/mct/log file to AzDO machine for
further processing/merging.
(don't really need mct file since we'll merge and regenerate it)

Because of hack (2), each partition will generate the same filename so probably they'll
just overwrite each other on AzDO and we'll only get one. It's hard to see how we
can pass a unique partition ID all the way down to the SPMI HelixPostCommand that can
also be used in the DownloadFilesFromResults line.
1. Fix 'du' to only run on non-Windows
2. Make 'du' run always, even after failure
3. Make artifacts upload run always, even if merge-mch fails
(I'm seeing MCT creation fail but not MCH merging; maybe I can more easily
repro if the MCH gets uploaded?)
4. Stop uploading MCT files from Helix machines (a new MCT will get created
with the merged MCH). (Optionally, we should just not create the MCT on the
Helix machines.)
Truncate any function signature that is too large (> 63KB).
We're going to merge and regenerate the overall TOC anyway,
so no need to create it and download it to the AzDO machine.
If we don't run any tests when collecting, a zero-length MCH file will be
created. Helix won't upload zero-length files, which leads to an error in the
AzDO script when the HelixWorkItem.DownloadFilesFromResults item pointing to
the expected .MCH file is not found.

To work around this, convert zero-length files to special sentinal files during
superpmi collection. Then, before merging files using merge-mch, delete these files.

The `--ci` option must be passed to both superpmi.py "collect" and "merge-mch" commands
to get this behavior.
Must have dropped it in an integration
Remove temporary code

(except all other jobs are still commented out)
@BruceForstall BruceForstall force-pushed the EnableSpmiCollectionOfCoreClrTests branch from 35b3d5e to 68b8628 Compare September 1, 2022 20:32
@BruceForstall
Copy link
Member Author

This change subsumes #74598

@BruceForstall
Copy link
Member Author

mcs -jitflags on a sample win-x64 collection yields:

Individual Flag Appearances

   78839   18.77%  DEBUG_CODE
  416284   99.08%  DEBUG_INFO
     529    0.13%  MIN_OPT
     395    0.09%  OSR
     990    0.24%  PROF_ENTERLEAVE
     476    0.11%  PROF_NO_PINVOKE_INLINE
  420136  100.00%  SKIP_VERIFICATION
    3852    0.92%  IL_STUB
     200    0.05%  BBINSTR
  249929   59.49%  BBOPT
   37319    8.88%  FRAMED
    2546    0.61%  PUBLISH_SECRET_PARAM
    1044    0.25%  REVERSE_PINVOKE
     912    0.22%  TRACK_TRANSITIONS
  169678   40.39%  TIER0
   17389    4.14%  TIER1
       5    0.00%  HAS_METHOD_PROFILE
     128    0.03%  HAS_DYNAMIC_PROFILE
    4309    1.03%  HAS_STATIC_PROFILE
    3100    0.74%  HAS_LIKELY_CLASS
     112    0.03%  HAS_CLASS_PROFILE
    2510    0.60%  HAS_EDGE_PROFILE
    4437    1.06%  HAS_PGO

@BruceForstall
Copy link
Member Author

@BruceForstall
Copy link
Member Author

@kunalspathak @dotnet/jit-contrib PTAL

fyi @jkoritzinsky @trylek @hoyosjs This shouldn't affect normal test runs, but FYI on this work.

@BruceForstall
Copy link
Member Author

ping

@kunalspathak
Copy link
Member

ping

Will review later today.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collections are around 6GB, which is more than 2x the size of the coreclr_tests PMI collection.

Should we stop collecting coreclr_tests PMI collection since you are already collecting tier0/tier1 coreclr tests run?

The collections are around 6GB, which is more than 2x the size of the coreclr_tests PMI collection.

Who is doing it?

There are a few test crashes doing a collection.

How does the collection behave to those crashes? It will ignore and continue collecting?

@@ -45,7 +52,9 @@ jobs:
- ${{ if eq(parameters.platform, 'Linux_arm') }}:
- ${{ if eq(variables['System.TeamProject'], 'public') }}:
- (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440
- ${{ if eq(variables['System.TeamProject'], 'internal') }}:
- ${{ if and(eq(variables['System.TeamProject'], 'internal'), in(parameters.jobParameters.helixQueueGroup, 'superpmi')) }}:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With changes in this file, do we need to remove setting the helix_queue variable in superpmi_collect_setup.py?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No: these changes are used by the coreclr_tests run collection, but the Helix queue setup in superpmi_collect_setup.py is still used by the existing collection jobs.

Ideally, we would always use the helix-queues-setup.yml file to determine the Helix queues to use, so there is reduced chance that a need to update the queues or Docker containers doesn't miss the superpmi_collect_setup.py case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we would always use the helix-queues-setup.yml file to determine the Helix queues to use, so there is reduced chance that a need to update the queues or Docker containers doesn't miss the superpmi_collect_setup.py case.

How can we fix that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure; there's a lot of YML template magic going on. We possibly need helix-queues-setup.yml to pass something through to superpmi_collect_setup.py?

eng/pipelines/common/upload-artifact-step.yml Show resolved Hide resolved
src/coreclr/scripts/superpmi-collect.proj Show resolved Hide resolved
src/tests/Common/helixpublishwitharcade.proj Outdated Show resolved Hide resolved
src/tests/Common/helixpublishwitharcade.proj Show resolved Hide resolved
src/tests/Common/CLRTest.Execute.Bash.targets Show resolved Hide resolved
@BruceForstall
Copy link
Member Author

Should we stop collecting coreclr_tests PMI collection since you are already collecting tier0/tier1 coreclr tests run?

We can consider it. PMI is slightly different, but maybe the difference doesn't matter enough to keep doing PMI collections? E.g.,

  1. It JIT's every function, not just those that are run
  2. It possibly generates more generic instantiations, though it's debatable whether the "extra" instantiations, if there are any, are useful beyond the instantiations that the run actually uses.

| The collections are around 6GB, which is more than 2x the size of the coreclr_tests PMI collection.
Who is doing it?

I assume the question is: why are the "run" collections 2x the size of the "PMI" collections? Most likely because the "run" collections have both tiering enabled and tiering disabled runs.

How does the collection behave to those crashes? It will ignore and continue collecting?

The test run stops, because JIT asserts kill the xunit run. However, everything collected up to that point contributes to the final .MCH file.

@kunalspathak
Copy link
Member

| The collections are around 6GB, which is more than 2x the size of the coreclr_tests PMI collection.
Who is doing it?

I assume the question is: why are the "run" collections 2x the size of the "PMI" collections? Most likely because the "run" collections have both tiering enabled and tiering disabled runs.

This requires updating the Docker containers used for these runs.

I meant to ask that question for ^ this.

@BruceForstall
Copy link
Member Author

I meant to ask that question for ^ this.

So, who is updating the Docker containers? Me: #75167 (still iterating)

@kunalspathak
Copy link
Member

the difference doesn't matter enough to keep doing PMI collections

Yes, because not only we are doing an extra collection, but it also runs superpmi-diff and superpmi-replay job. At least when I do a local superpmi-diff run, I mostly check asp, benchmarks, libraries-pmi and sometimes crossgen2. Likewise, in reviewing someone's PR, I would just check the diffs in those few collections. I would suggest to stop collecting coreclr_tests pmi collection to concentrate on small subset of collection that really matters.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow up, please remove the coreclr_test pmi collection.

@BruceForstall BruceForstall merged commit ae767d9 into dotnet:main Sep 7, 2022
@BruceForstall BruceForstall deleted the EnableSpmiCollectionOfCoreClrTests branch September 7, 2022 19:05
BruceForstall added a commit to BruceForstall/runtime that referenced this pull request Sep 7, 2022
With dotnet#74961, we have a collection
of a run of CoreCLR tests. That makes the PMI collection of the tests
mostly duplicative.
BruceForstall added a commit that referenced this pull request Sep 7, 2022
With #74961, we have a collection
of a run of CoreCLR tests. That makes the PMI collection of the tests
mostly duplicative.
@ghost ghost locked as resolved and limited conversation to collaborators Oct 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants