-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How best to implement Linq's iterators. #15289
Comments
Option 6 ruled out, as I can create too many cases where code that handles an exception would become code that fails to do so. It might be worth doing, but the impact is probably worth considering as a separate issue. |
/cc @jaredpar |
I think option 6 is not an option. It is an observable change for sideeffecting GetEnumerator implementations. That is not extremely common, but still possible for GetEnumerator to open a file or grab a lock. We would not want to change when that happens. |
Yes, I already ruled that one out. I haven't had a chance to race the others yet, because it needs a time while my computer is neither turned off nor in such heavy use that could upset the measures, and it tends to be one of the other, but a holiday soon will be a fine time to have comparisons running away while I do other things. |
I tried to make some synthetic benchmarks measuring just overhead of iterator strategies between 1) ThreadID 2)Interlocked 3)enumerator/enumerable not merged. Time-wise the results are very close with ThreadID approach being faster than Interlocked and separate enumerator being yet slightly faster. That is under ideal conditions for combined enumerable/enumerator - when enumerable is iterated just once. The main concern is that in a most trivial case (enumerable wraps a single array) separate iterator results in 14%-23% (32bit, 64bit) theoretical increase in allocated bytes compared to current implementation. That is because while we would have same fields and drop ThreadID, we would also have two objects allocated instead of one each having a header (2 native ptrs). So basically, we will save on synchronization calls, but add some memory management costs (in amortized sense - including GC). These expenses are very hard to compare since they tend to scale differently between scenarios/hardware/platforms. For example on a busy manycore box Interlocked would be more expensive, even more so on ARM. So far I think the "separate enumerator" is the more promising of all because
However I do not yet feel convinced enough, to make changes. |
Ok, we have to get this to some conclusion. Assuming that
I think the above rules out "just separate Enumerable and Enumerator classes". Now we have to decide between CurrentManagedThreadId and InterlockedExchange On one hand we could save an int field, on the other hand CurrentManagedThreadId seems to be very fast since it is just a fetch from the TLS, while the cost of Interlocked is less clear. Some CPU may have pretty heavy implementations. Considering that we do not have a very strong case against CurrentManagedThreadId, I will consider keeping existing implementation as acceptable and close this. If we find other reasons, substantial enough to tip the scales, we can revisit. |
Note: The "async streams" feature is running into a similar question. Based on the discussion above, I will stick with the current pattern (if state machine is stopped and if it came from current thread, then we'll use |
The current approach with
System.Linq.Enumerator
's enumerables and their enumerators copies the approach for compiler-generated iterators; the same instance that serves as the enumerable serves as the result for the first call toGetEnumerator()
, so avoiding an extra allocation.There is a race inherent to this approach; if two threads called
GetEnumerator()
at the same time they could end up with the same object when they should have separate instances. This is resolved by capturingEnvironment.CurrentManagedThreadId
on construction and only usingthis
as the result ofGetEnumerator()
if it matches a second call.This relatively expensive call is of no value if:
GetEnumerator()
is never called (many optimisations skip it in various ways).GetEnumerator()
is repeatedly called (after the first call).GetEnumerator()
is only called once, on another thread (e.g. if a query is constructed before anawait
and enumerated after it).[WIP] Avoid race in GetEnumerator with Interlocked rather than ThreadID check corefx#3313 changes the approach to changing the state with
Interlocked.CompareExchange()
to resolve the race in a different way. This means no call in case 1 and 3 above, but is more expensive in case 2. It's about even in the case of oneGetEnumerator()
at least as some tests with Win64 DNX suggest (CompareExchange
being about twice the case ofCurrentManagedThreadId
and being called half as often).In the discussion on that pull request, @VSadov points out that the optimisation of avoid an allocation may not be as valuable as it once was in any case. Some very limited experimentation suggests that indeed just creating a new object on every call to
GetEnumerator()
is a clear win.There are a few possible variants here. We can have the same classes that server as both enumerable and enumerator, or separate classes (some reduction in state held, some increase in jitted code). If separate classes are used, internal calls to
GetEnumerator()
can be moved to this object's construction, further reducing state held in some cases, but being an observable change if that call throws an exception.A completely different possibility is to use the current approach, but avoid calling CurrentManagedThreadId is some cases.
Some experimentation is in order.
Some possible approaches:
CurrentManagedThreadId
call, https://github.com/JonHanna/corefx/tree/fewer_threadid_callsInterlocked.CompareExchange()
: https://github.com/JonHanna/corefx/tree/defer_obtaining_threadidGetEnumerator()
: https://github.com/JonHanna/corefx/tree/fresh_enumeratorsUse separate enumerators, shiftingGetEnumerator()
calls (observable change) https://github.com/JonHanna/corefx/tree/separate_enumeratorsI'm planning to do some comparisons with these variants, but I'm opening this issue ahead of that for suggestions as to situations the comparisons must cover.
In particular:
Is 6 (with the observable change) completely out of the question?
What tests would one expect avoiding an allocation to do better than not avoiding it? (First brief experiments suggest it's always a lose, but I'm wary of removing what was clearly intended as an optimisation if I can't see where it succeeded in optimising, though maybe it's just a matter of older CLR versions).
The text was updated successfully, but these errors were encountered: