Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock in MemoryPool #10400

Closed
wants to merge 1 commit into from

Conversation

lingbin
Copy link
Contributor

@lingbin lingbin commented Jul 4, 2024

Fixes #10342

MemoryPool::poolMutex_ is used to protect MemoryPool::children_ in
the following 4 member methods:
addLeafChild()/addAggregateChild()/getChildCount()/visitChildren.

In other words, to avoid deadlock:

  1. First of all, we can't call the above 4 methods while holding
    poolMutex_.
  2. Furthermore, we cannot call methods that use these 4 methods while
    holding poolMutex_;
  3. Finally, there cannot be any direct or indirect calls between
    these 4 methods. (This is the root cause of this problem)
    .
addAggregateChild() ------ will acquire 'poolMutex_'
--> toString() [When duplicate names appear, used to print error messages]
    --> usedBytes()
        --> visitChildren() ------ [When reservedBytes() != 0] will acquire 'poolMutex_' again, **DEADLOCK!!**

So, addAggregateChild() indirectly calls visitChildren() [When
reservedBytes() != 0], causing a deadlock.

To solve this problem, the easiest way is to simplify the error message
printed when duplicate names appear. After all, we only need to know
the name of the parent pool, just like many other places in this class do.

`MemoryPool::poolMutex_` is used to protect `MemoryPool::children_` in
the following 4 member methods:
`addLeafChild()`/`addAggregateChild()`/`getChildCount()`/`visitChildren`.

In other words, to avoid deadlock:
1. First of all, we can't call the above 4 methods while holding
   `poolMutex_`.
2. Furthermore, we cannot call methods that use these 4 methods while
   holding `poolMutex_`;
3. Finally, **there cannot be any direct or indirect calls between
   these 4 methods. (This is the root cause of this problem)**.

```cpp
addAggregateChild() ------ will acquire 'poolMutex_'
--> toString() [When duplicate names appear, used to print error messages]
    --> usedBytes()
        --> visitChildren() ------ [When reservedBytes() != 0] will acquire 'poolMutex_' again, **DEADLOCK!!**
```

So, `addAggregateChild()` indirectly calls `visitChildren()` [When
`reservedBytes() != 0`], causing a deadlock.

To solve this problem, the easiest way is to simplify the error message
printed when duplicate names appear.  After all, we only need to know
the name of the parent pool, just like many other places in this class do.
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 4, 2024
Copy link

netlify bot commented Jul 4, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 9c39dc5
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6686b217602ebb000809266f

Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lingbin thanks for the fix % nit!

void* buff = childOne->allocate(kChunkSize);
// Add child when 'reservedBytes != 0', in which case 'usedBytes()' will call
// 'visitChildren()'.
ASSERT_THROW(root->addAggregateChild("child_one"), VeloxRuntimeError);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ASSERT_THROW/VELOX_ASSERT_THROW/

@facebook-github-bot
Copy link
Contributor

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@xiaoxmeng merged this pull request in 311e911.

Copy link

Conbench analyzed the 1 benchmark run on commit 311e911d.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dead lock when adding child memory pool with already existing name
3 participants