Search native code in all R2R images in version bubble #68607

y-yamshchikov · 2022-04-27T17:06:38Z

This PR fixes part of #44948 and fixes #46160.

This code simply traverses through assemblies in the application
domain. For each assembly (module) it realizes is it Ready To Run and is
it in the same bubble with (does it deliberately bubbling the) module
from which generic function originates. If so, it makes request for code
is Ready To Run and hopes there is some in the module. If the request
succeeds it proceeds with found pointer to the bare native code.

Generic instantiations like System.Collections.Concurrent.ConcurrentDictionary`2[System.Int32,System.__Canon] contained within same version bubble are either a bug in crossgen2 or should be taken care of by having PGO data.

Now such methods use their Ready To Run code.

We have got significant performance gain on startup: 7% average on our representative set on Tizen.

This PR worked out notices in PRs below:
#47269
about linear search through the set of RangeSections. In this new PR we propose storing of RangeSections in sorted array (with number of optimizations inspired by prior linked list based solution).
#57277

about lockfree requirements

We have extensively tested this PR on armel/Tizen platform so in this case we are confident in reliability and profitability of the solution.

We have also implemented high coverage unit-testing model for RangeSection optimization and used it in single-processor single-thread load, single-processor multi-thread load and multi-processor multi-thread load with all combination of simultaneous add and/or read and/or delete (with and without two simultaneous readers and/or two simultaneous adders and/or two simultaneous deleters). Order of adding has had switches between ascending, descending and random modes. All these switches have been tested under debug and release setups of the algorithm. Strictly, we have performed full-factor experiment under selected options in the unit-test. The test can be accessed by the following link: https://github.com/y-yamshchikov/rs-model. The unit-test had been performed for linux-x64.
Main concept of the test is 1-to-1 copying of the code of RangeSection storage with decoupling from the CoreCLR by the least significant line (as far as it is possible in the case it is not in a separate file), so we can be sure an error-prone surface is smallest.
Of course, we have performed priority1 tests and performance tests after all.

In this PR (vs #57277) we have implemented new synchronization mechanism for RangeSection storage. It grants lock-free (but not wait-free) access and Log(n) complexity. Primitives have the same names: ReaderLockHolder and WriterLockHolder, but theirs behavior and encapsulation slightly differ.

Dear colleagues @jkotas @alpencolt, please take a look.

dotnet-issue-labeler · 2022-04-27T17:06:43Z

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

danmoseley · 2022-05-05T02:55:38Z

@davidwrighton

davidwrighton

I really like the idea of how this is supposed to work, but I do not believe the locking is not correct. In any case, if the locking is correct, the set of comments describing the locking has not been updated, and is certainly incorrect. @jkotas do you agree?

davidwrighton · 2022-05-05T19:07:35Z

src/coreclr/vm/proftoeeinterfaceimpl.cpp

@@ -1927,13 +1927,6 @@ HRESULT GetFunctionInfoInternal(LPCBYTE ip, EECodeInfo * pCodeInfo)

    if (ShouldAvoidHostCalls())
    {
-        ExecutionManager::ReaderLockHolder rlh(NoHostCalls);


This ReaderLockHolder logic needs to stay in place. GetFunctionInfoInternal from the profiler api may be called while other threads are suspended via OS thread suspension apis, and may be used when alloc/free/new/delete are not permitted to be used. However, the technique you're using for locking looks like it will always successfully acquire, so you don't need to restore the Acquired() api, and call below.

In the following perspective
#68607 (comment)
#68607 (comment)
it is possibble to stop writers here and pass NoHostCalls to the codeman by call below
//TODO

davidwrighton · 2022-05-05T19:09:07Z

src/coreclr/vm/codeman.cpp

-// Returns whether the reader lock is acquired
+    if (count == 1)// h->count == 0
+    {
+        //we are here in relatively rare case


If we are in a NoHostCalls situation, we should not delete anything.

OK, I can leave this code only in wlh destructor (as writers are about to need allocate or delete something). Now this code is both in rlh and wlh destructors.
//TODO

There's no reason to do that. Just disable the code when running in NoHostCalls.

davidwrighton · 2022-05-05T19:12:34Z

src/coreclr/vm/codeman.cpp

@@ -4475,7 +4493,6 @@ BOOL ExecutionManager::IsManagedCodeWithLock(PCODE currentPC)
        GC_NOTRIGGER;
    } CONTRACTL_END;

-    ReaderLockHolder rlh;


Why is it ok to remove this lock?

I see that GetRangeSection will unconditionally take the lock, but it will return a RangeSection, that itself isn't protected from unload. Please restore the original structure of how the locks are created.

As I can understand your comment, the function of rlh there is not for protecting the RS list from possibly inconsistence but instead for protecting an RS using in IsManagedCodeWorker from deletion.

In my proposal, as described in my answer to you at #68607 (comment) rlh and wlh are used only for maintain RS storage consistency, but the schema leaves us an opportunity to restore the RS instance protecting lock structure.
//TODO

davidwrighton · 2022-05-05T19:13:00Z

src/coreclr/vm/codeman.cpp

@@ -4496,7 +4513,6 @@ BOOL ExecutionManager::IsManagedCode(PCODE currentPC, HostCallPreference hostCal
    }

    ReaderLockHolder rlh(hostCallPreference);
-    if (!rlh.Acquired())


By removing the if statement, we are unconditionally going down the failure path, why are you doing this?

As soon as the schema described in my answer below (#68607 (comment)) is correct, there are no reader's locks at all. rlh is needed only for counting references and make proper swaps of reader's array and writer's array. There is no case the "lock" is not acquired: it is always "acquired" either instantly or a few cycles later.

No, that isn't what I'm saying. In this case, line 4515 unconditionally takes the lock, and then as the if statement is removed, the new code path will unconditionally return FALSE, and set *pfFailedReaderLock to TRUE.

I certainly have to remove all the code under the if body, not only if statement.
//TODO

davidwrighton · 2022-05-05T19:17:20Z

src/coreclr/vm/codeman.cpp

-    RangeSection *pCurr = pHead;
-    RangeSection *pLast = NULL;
-
+    ReaderLockHolder rlh;


We need whatever lock we are using to protect not only the rangesection lists, but also the RangeSection instances themselves. This is not an appropriate place to take the lock.

If we become agree on
#68607 (comment)
in perspective of
#68607 (comment)
this annotation becomes resolved.

davidwrighton · 2022-05-05T19:26:14Z

src/coreclr/vm/codeman.cpp

+            iHigh = iMid;
+            iMid = (iHigh + iLow)/2;
+        }
+        else if (addr >= array[iMid+1].LowAddress)


The code here is doing a binary search of the RangeSection list, but I don't see how its protected from a separate thread writing to the sorted list. In particular, if there is an Add going on, which may be performing a memcpy of the list to insert something in the middle, the structure of the reader lock as written will allow a reader to enter, and read while the list is in an inconsistent state.

Separate writing thread uses the copy of an array. Schema is as follows:

Readers can simultaneously perform any number of search runs over reader's copy of an array (I call it reader's array or reader's header (as the meta of an array resides in a header structure). There is only fast interlocked reference counting of how many readers are running. Once a reader leaves the reader array there is fast interlocked decrement of the reference counter. Reading is asymptotically log2 N by the nature of an algorithm or unit by the nature of last_used_index optimization (depending on the load model).

Writers can perform O(N) (due to memcpy of the whole array) simultaneously with any number of readers traversing the reader's array. Writers use the copy of the reader's array (I call it writer's array of writer's header). There are only one writer at a time using writer's array, but O(N) writing is NOT preventing readers from traversing reader's array. Once a writer finishes preparing proper writer's array with all necessary changes, the WriterLockHolder (wlh) goes to it's destructor and performs the substitution of writer's array header pointer to the reader's one. Till now, all old readers still run over the old reader's array and decrement old reader's array reference counter, and all new readers will run over the new reader's array incrementing new reader's reference counter on the start. Once the latest old-reader leaves old-reader-array, rlh destructor returns reader's array header pointer to the writer's slot, and the swap of arrays completes (the wlh destructor can perform same action if there are no old-readers when it substitutes reader's <- writer's header pointers).

So we have almost seamless reader's copy and some writing work in a background.

Ah, I missed that. That highlights the concern I have with the comments, etc. ALL of those need to be updated to accurately describe the new model. Also, recent experience has shown that its much safer to use explicit volatile load/store methods instead, so the particular memory operations can easily be seen and examined.

jkotas · 2022-05-05T19:42:43Z

the set of comments describing the locking has not been updated, and is certainly incorrect. @jkotas do you agree?

I agree. I have double-checked some of the comments made by @davidwrighton and I agree that they highlight real problems.

davidwrighton · 2022-05-09T20:25:02Z

src/coreclr/vm/codeman.h

-    static Volatile<LONG>   m_dwReaderCount;
-    static Volatile<LONG>   m_dwWriterLock;
+    static volatile RangeSectionHandleHeader * m_RangeSectionHandleReaderHeader;
+    static volatile RangeSectionHandleHeader * m_RangeSectionHandleWriterHeader;


Plain volatile is not safe to use within our codebase. It has subtly different meanings on various compilers. Please either use the Volatile type, or use explicit VolatileLoad/VolatileStore apis.

davidwrighton · 2022-05-09T20:25:42Z

// release the writer and yield to wait for the readers to be done.

Comments like this throughout the codebase need to be updated with this change.

Refers to: src/coreclr/vm/codeman.cpp:651 in 5056171. [](commit_id = 5056171, deletion_comment = False)

davidwrighton · 2022-05-09T20:37:12Z

src/coreclr/vm/codeman.cpp

+    int count;
+    RangeSectionHandleHeader *old_rh = (RangeSectionHandleHeader*)m_RangeSectionHandleReaderHeader;
+    h->count = 1; //EM's unit
+    m_RangeSectionHandleReaderHeader = h;


This needs to be a VolatileStore with a barrier.

davidwrighton · 2022-06-03T19:58:52Z

@y-yamshchikov Please let me know when you are ready for me to review this again.

y-yamshchikov · 2022-06-14T16:00:14Z

/azp run "runtime (CoreCLR Product Build Linux x86 checked)"

azure-pipelines · 2022-06-14T16:00:24Z

Commenter does not have sufficient privileges for PR 68607 in repo dotnet/runtime

y-yamshchikov · 2022-06-14T16:03:31Z

@y-yamshchikov Please let me know when you are ready for me to review this again.

@davidwrighton please take a look
you can find exhaustive comment on top of the vm/codeman.h document

alpencolt · 2022-06-22T09:24:33Z

cc @HJLeee @wscho77

davidwrighton

This is not a viable locking strategy.

The ReaderLock isn't safe to take unless the ForbidDeleteLock is held. (The process of entering into the ReaderLock, reads a value, that may be deleted before any other operation occurs)
The ForbidDeletion lock is simply a reader/writer lock. If held, it makes the read side of the system safe, but should use standard reader/writer locks if that is what is wanted.
There appear to be code paths which rely solely on a ReaderLock to protect them. However, by the description, while the reader lock is held, in theory the writer is allowed to replace the current copy of the reader data structure. If this happens twice while the reader is operating, then the reader may be using a copy of the reader array which can be freed (as the ReaderLockHolder only locks the reader array present at holder creation, but other code such as GetRangeSection will read the current reader array, and not the one that was locked.
The ShouldAvoidHostCalls pathway in GetFunctionInfoInternal is not wait-free. That pathway relies on the ability to perform its operations when the rest of the runtime is suspended at any point of code, including while holding locks. The logic you've written does avoid host calls, but it does so by introducing spinning, which is not an improvement, and may deadlock if the other threads are suspended in the wrong spot.

y-yamshchikov · 2022-07-18T11:15:45Z

@davidwrighton

ForbidDeletion lock is designed to protect all deletions of any element from the storage, so it should be held if one can expect deletion. Maybe you mean WriterLockHolder could delete something ReaderLockHolder is trying to read on acquiring, but it surely will not: the reader (and writer) arrays are separated in two parts: one with meta and one with an array itself. Metas (one for reader and one for writer) are never deleted. rlh works only with meta on acquiring.
With ForbidDeletion lock we could block readers only if we want to delete (which seems to be the case on AppDomain unloading) and let readers seamlessly flow in case of ordinary (if it is actually more ordinary to add an RS then to delete) adding. In the proposal where we need to sync only on just a very moment of interchanging pointers to metas (which in turn point to arrays) it seems costly and unnecessary to yield.
Maybe I had not highlighted this in the description. It won't happen twice. Second (call it so) writer will wait till oldest reader leaves and puts it's pointer to writer's slot. As it happens, Second writer becomes new First, and so on. I understand it is hard to pick from code review and it was the main reason I have developed huge unit test mentioned in topic. The payload model for the test was not so subtle to capture issues like you mentioned in (4), but very intensive to detect errors you suspect in (3). If I'm still not clear, please tell me and I'll made an attempt to prove it formally.
OK, it is trivial to return logic like it was making one-shot FDH version (tries one time to enter and returns status).

mangod9 · 2022-08-29T16:12:36Z

Checking in on the current state of this PR. Is this still being considered?

davidwrighton · 2022-08-29T20:01:25Z

I've been busy with getting .NET 7 out the door, and I'm now going to take another deep look at this now that we've forked for .NET 8.

davidwrighton · 2022-10-13T01:37:37Z

I've taken a closer look at this, and while I think the concept is good, the locking mechanism is complex enough that even if it is correct (and I'm not entirely convinced of that, but I don't actually see incorrect behavior anywhere I look) there is no practical way to maintain confidence that it remains correct over time. In addition, while I'm confident the performance is good for the scenarios that your team sees, we have some customers that have many thousands of these range sections, and it is plausible that this scheme will have outsize performance problems in those scenarios. As such, I'd like to investigate different approaches to this that don't require such a monumentally complex to analyze locking scheme. In the next month or so I will try to put together a different approach to the problem that doesn't require such complex analysis. My current thought is to mimic some form of a page table, which will allow us to implement most of the complexity in terms of atomic operations which do not require complex locks.

davidwrighton · 2022-12-12T18:36:39Z

Its taken a while longer than I hoped, but the work to fix the controversial portion of this change is well in progress, and shows excellent characteristics. I'll be closing this for now, but once #79021 is merged, please take another look at your scenario and see what more changes you'd like to have, and we'll review them.

ghost added the community-contribution Indicates that the PR has been added by a community member label Apr 27, 2022

am11 added the area-crossgen2-coreclr label Apr 27, 2022

y-yamshchikov force-pushed the opt-r2r-rs branch 3 times, most recently from 03e04a3 to 5056171 Compare May 4, 2022 11:18

davidwrighton requested changes May 5, 2022

View reviewed changes

ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label May 5, 2022

ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label May 6, 2022

davidwrighton reviewed May 9, 2022

View reviewed changes

Probing R2R methods code in caller modules

ac77f07

y-yamshchikov force-pushed the opt-r2r-rs branch 5 times, most recently from 472e200 to ac3d401 Compare June 3, 2022 17:27

Range Sections now stored in sorted array

25e5c2e

y-yamshchikov force-pushed the opt-r2r-rs branch from ac3d401 to 25e5c2e Compare June 6, 2022 10:54

y-yamshchikov force-pushed the opt-r2r-rs branch from 264dd57 to 000a4df Compare June 14, 2022 09:39

RS deletion correction and comments on a whole schema

eab49b9

y-yamshchikov force-pushed the opt-r2r-rs branch from 000a4df to eab49b9 Compare June 14, 2022 13:36

y-yamshchikov requested a review from davidwrighton June 15, 2022 11:43

davidwrighton requested changes Jul 2, 2022

View reviewed changes

ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jul 2, 2022

JulieLeeMSFT assigned y-yamshchikov Jul 11, 2022

ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Jul 18, 2022

davidwrighton self-assigned this Oct 13, 2022

davidwrighton mentioned this pull request Dec 10, 2022

Fix the multicore performance of mapping from an instruction pointer to a MethodDesc #79021

Merged

davidwrighton closed this Dec 12, 2022

ghost locked as resolved and limited conversation to collaborators Jan 11, 2023

Search native code in all R2R images in version bubble #68607

Search native code in all R2R images in version bubble #68607

Conversation

y-yamshchikov commented Apr 27, 2022 • edited Loading

dotnet-issue-labeler bot commented Apr 27, 2022

danmoseley commented May 5, 2022

davidwrighton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

y-yamshchikov May 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

y-yamshchikov May 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas commented May 5, 2022

Choose a reason for hiding this comment

davidwrighton commented May 9, 2022

Choose a reason for hiding this comment

davidwrighton commented Jun 3, 2022

y-yamshchikov commented Jun 14, 2022

azure-pipelines bot commented Jun 14, 2022

y-yamshchikov commented Jun 14, 2022

alpencolt commented Jun 22, 2022

davidwrighton left a comment

Choose a reason for hiding this comment

y-yamshchikov commented Jul 18, 2022

mangod9 commented Aug 29, 2022

davidwrighton commented Aug 29, 2022

davidwrighton commented Oct 13, 2022

davidwrighton commented Dec 12, 2022

y-yamshchikov commented Apr 27, 2022 •

edited

Loading

y-yamshchikov May 6, 2022 •

edited

Loading

y-yamshchikov May 11, 2022 •

edited

Loading