Use after free in Michael-Scott queue? #238

ghost · 2018-11-27T14:58:42Z

Queues that suffer from the problem:

MsQueue
SegQueue
channel::unbounded() (see Potential subtle unsafe memory reclaim #237, thanks to @Pslydhh)
Queue in crossbeam-epoch

Consider the following scenario in MsQueue...

Suppose the current epoch is E. The queue is empty, meaning head and tail are pointing to the sentinel node (let's call it node0).
In other words, head = node0, tail = node0, node0.next = null.

The following happens next...

Thread T1 gets pinned in epoch E and pushes a new value into the queue by allocating a new node node1 and then it does node0.next.cas(null, node1). The next operation it needs to do is tail.cas(node0, node1), but the thread gets yielded first so tail is still node0.

The queue is now: head = node0, tail = node0, node0.next = node1, node1.next = null.

Thread T2 gets pinned in epoch E and pops a value from the queue with head.cas(node0, node0.next). The CAS succeeds and head becomes node1. Then we call defer_destroy(node0), which means node0 becomes garbage marked with epoch E. The thread then exits.

The queue is now: head = node1, tail = node0, node0.next = node1, node1.next = null.

The global epoch gets advanced to E+1.

Thread T3 gets pinned in epoch E+1 and attempts to push a new value into the queue. It allocates a new node called node2, loads the tail pointer and gets node0. Then, it would attempt to do node0.next.cas(null, node2), but first it yields...

Thread T1 resumes, sets tail to node1, and exits.

The queue is now: head = node1, tail = node1, node1.next = null.

The global epoch is advanced to E+2. Garbage from epoch E is collected and so node0 gets deallocated.

Thread T3 resumes and attempts to continue with node0.next.cas(null, node2), but node0 has been destroyed. This is use-after-free!

To solve this problem, I propose we just relax the invariant used by defer.

Currently, the invariant is: We may use defer to destroy an object only if it will become inaccessible when all active guards get dropped.

I propose we change it to: We may use defer to destroy an object only if it will become inaccessible when all active guards and future guards overlapping with them get dropped.

In other words, instead of saying "call defer if the object is inaccessible" we'd say "call defer if it is inaccessible or if it will become inaccessible before all active guards get dropped".

The PR fixing this would simply change the global_epoch - garbage_epoch >= 2 invariant to global_epoch - garbage_epoch >= 3.

cc @Pslydhh @jeehoonkang

The text was updated successfully, but these errors were encountered:

Pslydhh · 2018-11-28T05:50:48Z

It seems like based on the invariant:

If a thread accesses an unlinked object, it accesses it at most once.
So in the next epoch, it can't.

This looks right to me, now we no need to modify the source code to avoid such errors.

blitzerr · 2019-08-04T06:21:02Z

@stjepang If its okay, I would like to work on this

jeehoonkang · 2019-08-04T06:30:44Z

I think this issue is related to #221, especially "Using small epoch numbers" in #221 (comment) . This can also solve the problem of epoch wraparound. I'll write down more details soon.

blitzerr · 2019-08-04T14:59:54Z

Thanks a lot @jeehoonkang. It will be much appreciated as I am new to rust and to crossbeam. I am very interested in concurrency and crossbeam. Thanks again.

jeehoonkang · 2019-08-09T05:03:51Z

@stjepang Thank you for reporting this. It seems much deeper than I first expected... I'd like to spend a little bit more thinking on this issue...

blitzerr · 2019-08-21T15:02:00Z

Any updates @jeehoonkang ? Anything I can start looking at ?

…

On Thu, Aug 8, 2019, 22:03 Jeehoon Kang ***@***.***> wrote: @stjepang <https://github.com/stjepang> Thank you for reporting this. It seems much deeper than I first expected... I'd like to spend a little bit more thinking on this issue... — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#238?email_source=notifications&email_token=AG3EF4LODVPKUVNQQ75CMGTQDT3DXA5CNFSM4GGWVPYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD35THHI#issuecomment-519779229>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AG3EF4O3RWARAYKS5QKV54LQDT3DXANCNFSM4GGWVPYA> .

blitzerr · 2019-08-27T19:36:23Z

ping @jeehoonkang @stjepang any mentoring notes on the issue ?

tomtomjhj · 2019-08-28T15:24:47Z

Hi @blitzerr. I discussed this problem with @jeehoonkang and here is the summary.

TL;DR;

It's enough to tag the block with the local epoch in which it was defer_destroy'd. Using the global epoch is fine too, but it's an overkill.
If the garbage is tagged with E, it can be free'd in E+3.
We've already implemented above changes as a part of our recent project, along with many other things. We are planning to merge some of them into crossbeam upstream. The paper summarizes hazard pointer and EBR (with E+3 reclamation scheme as described above) and proposes a new scheme that integrates HP and EBR.

Details

I'll use some figures and notations from our paper.
It's basically a re-statement of the example @stjepang gave.

Requirement for retirement (`defer_destroy`) under "free when `current_epoch - garbage_local_epoch >= 3` scheme"

If a block b is retired in epoch E, then b should be unreachable in epoch >= E+2. That is, b should be unlinked from the shared memory (all references to b are removed) before the global epoch advances to E+2.

According to the figure 5 in the paper, use-after-free cannot occur if this condition is met. In fact, this requirement is just a direct description of the figure.

Note that Requirement 2.1 from the paper implies this requirement. Requirement 2.1 can be used for proving safety of some algorithm under EBR. However, Requirement 2.1 doesn't apply to MSQueue as you can see in the example by stjepang. (Queue::pop_internal fails to completely unlink the head from the shared memory if self.head and self.tail coincide.) So we have to show that the more general requirement presented above holds for MSQueue.

MSQueue satisfies the requirement

The only interesting situation is the setting where defer_destroy is called on head (snapshot of self.head in pop) when self.head and self.tail coincide, because this is the only case that doesn't satisfy Requirement 2.1.
Since it seems that the example stjepang gave is the only possible case, I'll use the notations from it.

Claim: All references to node0 in shared memory are removed before the global epoch advances to E+2.

In order for the global epoch to advance to E+2, T1 should quit epoch E.
In order for T1 to quit E, T1 should finish executing let _ = self.tail.compare_and_set(onto, new, Release, guard); in push_internal. The result doesn't matter since both cases imply that self.tail no longer points to node0.

Remark

I'm sorry for the late reply.
I think there might be a general way to characterize an algorithm that is compatible with EBR with "free when current_epoch - garbage_local_epoch >= n" scheme.
Please let us know if you have any questions or opinions about this issue and our paper!

blitzerr · 2019-09-13T03:30:06Z

Thanks a lot @tomtomjhj for the link to the paper. I will read it and get back to you if I have questions, which I am sure I will have many :)

tomtomjhj · 2019-11-20T15:36:16Z

@stjepang We actually found out that free at E+3 is still buggy because of a similar scenario in which the thread 2 first pins at E and sleeps until the thread 1 pins at E+1 and gets yielded just before the pushing CAS as in your scenario. So in this scenario the actual sequence of modifications of the queue by thread 1 and 2 is identical but the epoch in which the "real unlinking" happens at E+1 instead of E. In this case, a thread that pins at E+2 may access the retired node0 via tail. So even if deallocation of node0 is deferred to E+3, there is no happens-before relation between this access and deallocation.

The real problem is that the implementation doesn't faithfully follow the original MSQueue algorithm. The original pop function advances tail pointer in cases like this scenario so that defer_destroy is called only for nodes that are actually unlinked. This is what other implementations do, e.g. libcds.
Another approach is to advance the tail pointer if the head CAS succeeds.

twmb · 2020-01-31T19:53:17Z

To be clear, it sounds like the four queues mentioned in the open of this issue are all effectively potentially unsafe. If that is true, I think it's worth it to disclose this information in the documentation for these queues. My understanding was that SegQueue was safe for production, but I'm not so sure now that I know there is a potential use after free issue with it.

cynecx · 2020-01-31T21:26:57Z

Afaict, crossbeam-channel and crossbeam-queue aren't affected by this because they don't depend on crossbeam-epoch anymore, no?

twmb · 2020-01-31T22:50:00Z

@cynecx thanks for pointing that out, I think that's true. It doesn't look like anything in this repo is dependent on crossbeam_epoch anymore! It may be worth it to change the title of this issue.

cynecx · 2020-02-01T17:53:21Z

@twmb Well, actually there is, it's crossbeam-deque ;).

pop() must completely unlink the popped node from the data structure before it calls defer_destroy() to prevent use-after-free. closes crossbeam-rs#238

466: fix use-after-free in crossbeam-epoch/sync/queue r=jeehoonkang a=tomtomjhj `pop()` must completely unlink the popped node from the shared memory before it calls `defer_destroy()` to prevent use-after-free. This implementation is based on the variation by Doherty et al. where the `head == tail` check is done after a successful CAS, which can be slightly more efficient than the original MSQueue. closes #238 Co-authored-by: Jaehwang Jerry Jung <tomtomjhj@gmail.com>

Shnatsel · 2020-02-15T11:34:44Z

Use-after-free could be an exploitable security vulnerability. Please make a semver-compatible release with a fix and file a security advisory so that dependent crates would know to upgrade to a fixed version.

416: Require three epoch advancements for deallocation r=jeehoonkang a=tomtomjhj This patch implements the fix discussed in #238 (comment). * tag the Bag with the local epoch E * free at E+3 Co-authored-by: Jeehoon Kang <jeehoon.kang@sf.snu.ac.kr>

pop() must completely unlink the popped node from the shared memory before it calls defer_destroy() to prevent use-after-free. This implementation is based on the variation by Doherty et al. where the `head == tail` check is done after a successful CAS, which can be slightly more efficient than the original version. closes crossbeam-rs#238

ghost added bug crossbeam-channel crossbeam-epoch crossbeam labels Nov 27, 2018

jeehoonkang mentioned this issue Aug 9, 2019

crossbeam ebr + michael scott queue의 버그 확인 kaist-cp/smr-benchmark#24

Closed

tomtomjhj mentioned this issue Sep 2, 2019

requirements 더 일반화 kaist-cp/smr-benchmark#28

Closed

tomtomjhj mentioned this issue Sep 9, 2019

Require three epoch advancements for deallocation #416

Merged

tomtomjhj added a commit to tomtomjhj/crossbeam that referenced this issue Feb 3, 2020

fix use-after-free in queue

6381337

pop() must completely unlink the popped node from the data structure before it calls defer_destroy() to prevent use-after-free. closes crossbeam-rs#238

tomtomjhj mentioned this issue Feb 3, 2020

fix use-after-free in crossbeam-epoch/sync/queue #466

Merged

bors bot closed this as completed in 2618830 Feb 10, 2020

ming535 mentioned this issue Mar 11, 2022

[Question] epoch + 3 or epoch + 2? kaist-cp/cs431#597

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use after free in Michael-Scott queue? #238

Use after free in Michael-Scott queue? #238

ghost commented Nov 27, 2018 •

edited by ghost

Loading

Pslydhh commented Nov 28, 2018

blitzerr commented Aug 4, 2019

jeehoonkang commented Aug 4, 2019

blitzerr commented Aug 4, 2019

jeehoonkang commented Aug 9, 2019

blitzerr commented Aug 21, 2019 via email

blitzerr commented Aug 27, 2019

tomtomjhj commented Aug 28, 2019 •

edited

Loading

blitzerr commented Sep 13, 2019

tomtomjhj commented Nov 20, 2019 •

edited

Loading

twmb commented Jan 31, 2020

cynecx commented Jan 31, 2020 •

edited

Loading

twmb commented Jan 31, 2020

cynecx commented Feb 1, 2020

Shnatsel commented Feb 15, 2020

Use after free in Michael-Scott queue? #238

Use after free in Michael-Scott queue? #238

Comments

ghost commented Nov 27, 2018 • edited by ghost Loading

Pslydhh commented Nov 28, 2018

blitzerr commented Aug 4, 2019

jeehoonkang commented Aug 4, 2019

blitzerr commented Aug 4, 2019

jeehoonkang commented Aug 9, 2019

blitzerr commented Aug 21, 2019 via email

blitzerr commented Aug 27, 2019

tomtomjhj commented Aug 28, 2019 • edited Loading

TL;DR;

Details

Requirement for retirement (defer_destroy) under "free when current_epoch - garbage_local_epoch >= 3 scheme"

MSQueue satisfies the requirement

Remark

blitzerr commented Sep 13, 2019

tomtomjhj commented Nov 20, 2019 • edited Loading

twmb commented Jan 31, 2020

cynecx commented Jan 31, 2020 • edited Loading

twmb commented Jan 31, 2020

cynecx commented Feb 1, 2020

Shnatsel commented Feb 15, 2020

ghost commented Nov 27, 2018 •

edited by ghost

Loading

tomtomjhj commented Aug 28, 2019 •

edited

Loading

Requirement for retirement (`defer_destroy`) under "free when `current_epoch - garbage_local_epoch >= 3` scheme"

tomtomjhj commented Nov 20, 2019 •

edited

Loading

cynecx commented Jan 31, 2020 •

edited

Loading