Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: resolve issues with OSR and PGO #47942

Closed
Tracked by #74873
AndyAyersMS opened this issue Feb 6, 2021 · 5 comments · Fixed by #80481
Closed
Tracked by #74873

JIT: resolve issues with OSR and PGO #47942

AndyAyersMS opened this issue Feb 6, 2021 · 5 comments · Fixed by #80481
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@AndyAyersMS
Copy link
Member

OSR and PGO don't play nicely together.

See #46882 (comment) and follow on comments.

@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Feb 6, 2021
@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Feb 6, 2021
@AndyAyersMS AndyAyersMS added this to the 6.0.0 milestone Feb 6, 2021
This was referenced Feb 6, 2021
@AndyAyersMS
Copy link
Member Author

There's a workaround in place (use block profiling with OSR) that's good enough for now.

So, moving to Future.

@AndyAyersMS
Copy link
Member Author

The interaction of OSR and PGO is a bit more involved than I'd been thinking. There are a few sub-problems:

  1. The original method's PGO data will likely be incomplete. In particular there may not be any profile data "after" the patchpoint so large portions of the method may appear cold when the OSR method is being jitted.
  2. With sparse profiles we may even lose the ability to infer counts in the parts of the original method that were profiled, unless we model the potential impact of the OSR flow.
  3. This problem will persist if we ever rejit the method at Tier1, since any execution of the original method that would have accumulated profile data will be redirected to the OSR method.
  4. With sparse instrumentation the OSR method's flow graph needs to match the original method, initially, so we can determine the right placement of counts.

It seems plausible that we can fix 2, 3, and 4 by modelling OSR flow in the original method, deferring OSR flow updates until after the profile model has been built (which is happening as part of #59784) and then adding instrumentation to the root part of the OSR method (this is a bit novel, but seems doable, we just use the schema binding we get from reading the PGO data in the OSR method to emit "redundant" probes in the OSR method itself). The downside here is that the OSR method has some extra overhead, but maybe that's acceptable.

But fixing 1 seems hard...unless we have the OSR method return back to the original method , once we've exited the part of the OSR method flowgraph that is post-dominated by the patchpoint. Note there could be many such points, and each one might need a custom "fixup" action to restore the IL state in the original method from the live state in the OSR method; plus we might need to do lifetime extensions to ensure that we don't dead code something in the OSR method that's still needed back in the original.

@AndyAyersMS
Copy link
Member Author

Rough plan of attack.

  1. Keep using "dense" profiles for Tier0 methods with patchpoints.
  2. Don't use dynamic PGO data to optimize OSR methods (static PGO ok).
  3. Add ability to fetch schema and live data from VM. OSR method does this, verifies schema matches, and then instruments the parts of the code it will generate. Thus both the Tier0 and OSR methods will contribute to overall profile production.

So OSR methods won't be competitive with their Tier1 counterparts; this should be OK. But they should be much faster than Tier0 and perhaps that's good enough.

@AndyAyersMS
Copy link
Member Author

Working on the abovem ran into a complication. It's not possible to get the OSR method to produce the exact same schema as the Tier0 method.

They produce the same flow graph initially (since it's based on method IL). But they do different selective importation, and currently block and class instrumentation is gated on this:

bool ShouldProcess(BasicBlock* block) override
{
return ((block->bbFlags & (BBF_INTERNAL | BBF_IMPORTED)) == BBF_IMPORTED);
}

bool ShouldProcess(BasicBlock* block) override
{
return ((block->bbFlags & (BBF_INTERNAL | BBF_IMPORTED)) == BBF_IMPORTED);
}

We could fix this easily enough for counts by "over allocating" the schema for both Tier0 and OSR (basically pretending like we're going to import the entire method) because count probes don't depend on the IR in the block. The downside here is using more schema/data memory then necessary.

But for class probes the probe structure depends on the IR so if we import differently we can't hope to match class probe schema entries.

Some alternatives to explore:

  • update the runtime to recognize when one schema (the one we propose to use for the OSR method) is a subset of another (the Tier0 method), and properly update the offset info for the subset based on the superset. This should not be too hard, provided the runtime doesn't reorder the schema entries that the jit creates at Tier0.
  • do something similar, but on the jit side -- fetch the Tier0 schema/data and then verify each place we'd put a probe is already covered.

@AndyAyersMS
Copy link
Member Author

One other issue:

  • we may lose some PGO coverage for inlinees as we're not adding probes for them to the OSR method

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Jan 11, 2023
Enable edge based profiles for OSR, partial compilation, and optimized plus
instrumented cases.

For OSR this requires deferring flow graph modifications until after we have
built the initial probe list, so that the initial list reflects the entirety
of the method. This set of candidate edge probes is thus the same no matter how
the method is compiled. A given compile may schematize a subset of these probes
and materialize a subset of what gets schematized; this is tolerated by the PGO
mechanism provided that the initial instrumented jitting produces a schema which
is a superset of the schema produced by any subsequent instrumented rejitting.
This is normally the case.

Partial compilation may still need some work to ensure full schematization but
it is currently off by default. Will address this subsequently.

For optimized compiles we give the EfficientEdgeCountInstrumentor the same kind
of probe relocation abilities that we have in the BlockCountInstrumentor. In
particular we need to move probes that might appear in return blocks that follow
implicit tail call blocks, since those return blocks must remain empty.

The details on how we do this are a bit different but the idea is the same: we
create duplicate copies of any probe that was going to appear in the return block
and instead instrument each pred. If the pred reached the return via a critical
edge, we split the edge and put the probe there. This analysis relies on cheap
preds, so to ensure we can use them we move all the critial edge splitting so it
happens before we need the cheap pred lists.

The ability to do block profiling is retained but will no longer be used without
special config settings.

There were also a few bug fixes in the spanning tree visitor. It must visit a
superset of the blocks we end up importing and was missing visits in some cases.

This should improve jit time and code quality for instrumented code.

Fixes dotnet#47942.
Fixes dotnet#66101.
Contributes to dotnet#74873.
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jan 11, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jan 13, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Feb 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant