Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mlir] Optimize ThreadLocalCache by removing atomic bottleneck (attempt #3) #93315

Merged
merged 1 commit into from
Jun 21, 2024

Commits on Jun 21, 2024

  1. [mlir] Optimize ThreadLocalCache by removing atomic bottleneck

    The ThreadLocalCache implementation is used by the MLIRContext (among
    other things) to try to manage thread contention in the StorageUniquers.
    There is a bunch of fancy shared pointer/weak pointer setups that
    basically keeps everything alive across threads at the right time, but a
    huge bottleneck is the `weak_ptr::lock` call inside the `::get` method.
    
    This is because the `lock` method has to hit the atomic refcount several
    times, and this is bottlenecking performance across many threads.
    However, all this is doing is checking whether the storage is
    initialized. Importantly, when the `PerThreadInstance` goes out of
    scope, it does not remove all of its associated entries from the
    thread-local hash map (it contains dangling `PerThreadInstance *` keys).
    The `weak_ptr` also allows the thread local cache to synchronize with
    the `PerThreadInstance`'s destruction:
    
    1. if `ThreadLocalCache` destructs, the `weak_ptr`s that reference its
       contained values are immediately invalidated
    2. if `CacheType` destructs within a thread, any entries still live are
       removed from the owning `PerThreadInstance`, and it locks the
       `weak_ptr` first to ensure it's kept alive long enough for the
       removal.
    
    This PR changes the TLC entries to contain a `shared_ptr<ValueT*>` and a
    `weak_ptr<PerInstanceState>`. It gives the `PerInstanceState` entries a
    `weak_ptr<ValueT*>` on top of the `unique_ptr<ValueT>`. This enables
    `ThreadLocalCache::get` to check if the value is initialized by
    dereferencing the `shared_ptr<ValueT*>` and check if the contained
    pointer is null. When `PerInstanceState` destructs, the values inside
    the TLC are written to nullptr. The TLC uses the
    `weak_ptr<PerInstanceState>` to satisfy (2).
    
    (1) is no longer the case. When `ThreadLocalCache` begins destruction,
    the `weak_ptr<PerInstanceState>` are invalidated, but not the
    `shared_ptr<ValueT*>`. This is OK: because the overall object is being
    destroyed, `::get` cannot get called and because the
    `shared_ptr<PerInstanceState>` finishes destruction before freeing the
    pointer, it cannot get reallocated to another `ThreadLocalCache` during
    destruction. I.e. the values inside the TLC associated with a
    `PerInstanceState` cannot be read during destruction. The most important
    thing is to make sure destruction of the TLC doesn't race with the
    destructor of `PerInstanceState`. Because `PerInstanceState` carries
    `weak_ptr` references into the TLC, we guarantee to not have any
    use-after-frees.
    Mogball committed Jun 21, 2024
    Configuration menu
    Copy the full SHA
    899b32c View commit details
    Browse the repository at this point in the history