-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock in scheduler.h #231
Comments
An update: I would provide an example case here showing this deadlock: I figured that this race condition happens more likely with the number of threads increasing. In this case, syscalls are frequently called by the join-leave implementation, i.e. syscallLeave(). With the number of threads going up, the number of scheduled sleep syscalls exponentially increases, which indicates it’s very likely to encounter a sleeping syscall related race condition with even 1 more thread. For example, when running single thread moses, 6 out of 19 scheduled events are sleep events; running 4-thread moses incurs 1182 scheduled events and 1166 of them relate to sleep syscalls. The same trend exists for other apps, like a key-value store app. |
In the Scheduler |
Previous ThreadFini() lacks control of SLEEPING threads in finish(). This change helps many-thread simulation to completion; Otherwise, it deadlocks when running TailBench Apps with only 2-4 threads.
@gaomy3832 Thank you for your reply! In short, I figured it may be a corner case situation where Thread 2 is waiting for Thread 1 to release As you mention, the reason why Thread 1 is waiting is that current |
I would showcase the outputs here to explain why: [S 6] [G 393216] ***Jason Print: finish() in scheduler.h is called [S 6] [G 393219] ***Jason Print: finish() in scheduler.h is called [H] WARN: Stalled for 20 secs so far Basically this example shows exactly what happens in the scheduler when finish() is called. When the thread (gid 393219) is called to finish, the weird situation is that it is not running in any schedulers' contexts (i.e. cores). Hopefully this could show an example of how the deadlock I explained before happens. Thank you, |
I am not 100% sure, but it seems that |
The thread that is called Please review the commit I added 7 days ago. I solved this corner case deadlock by adding concerns on Sleeping threads in Note that G 393218 (Thread 2), G 393222 (Thread 6), G 393223 (Thread 7), G 393224 (Thread 8) are all called when they are in the sleep state, and with my modified codes, the simulation gets to successful completion. [S 6] [G 393222] leave function ----- jz: ----- Inserted in sleepQueue, current sleepQueue size is: 4 [S 6] State: 0o ___ 393220r ___ 393219r ___ ___ ___ ___ ___ ___ 393216o [S 6] State: 0o ___ 393220r ___ 393219r ___ ___ ___ ___ ___ ___ ___ ################################ [S 6] State: 0o ___ 393220r ___ 393219r ___ ___ ___ ___ ___ ___ ___ [S 6] State: 0o ___ 393220r ___ 393219r ___ ___ ___ ___ ___ ___ ___ [S 6] State: 0o ___ 393220r ___ ___ ___ ___ ___ ___ ___ ___ ___ ################################ [S 6] State: 0o ___ 393220r ___ ___ ___ ___ ___ ___ ___ ___ ___ [S 6] State: 0o ___ 393220r ___ ___ ___ ___ ___ ___ ___ ___ ___ ################################ [S 6] State: 0o ___ 393220r ___ ___ ___ ___ ___ ___ ___ ___ ___ [S 6] State: 0o ___ 393220r ___ ___ ___ ___ ___ ___ ___ ___ ___ ################################ [S 6] State: 0o ___ 393220r ___ ___ ___ ___ ___ ___ ___ ___ ___ [S 6] State: 0o ___ 393220r ___ ___ ___ ___ ___ ___ ___ ___ ___ [S 6] Finished, code 0 [S 0] Finished, code 0 |
I had the same issue when running Tailbench on ZSim. I added this check in finish() in scheduler.h:
I added the code above after this if statement "if (th->state == RUNNING)" and before the assertion of "assert_msg(th->state == STARTED.....)" |
Yes. This is the simple fix I was looking for in PR #232 . Good to know it works |
I am running TailBench http://tailbench.csail.mit.edu/ on a 12-core simulated system.
When running moses, single thread and 2-thread are good to simulation completion. When running >= 4 worker threads, after an assertion in scheduler.h and an error of “ACCESS_INVALID_ADDRESS”, a deadlock happened. Other TailBench apps have similar problems when the number of threads is up to 2 to 4. It seems there might be some hidden race conditions when simulating multi-threaded syscall intensive apps instead of traditional benchmarks such as SPLASH-2/PARSEC.
I confirmed that TailBench apps’ implementation is thread-safe with pthread on real servers, which can scale up to 20+ threads. So it’s not due to the app implementation.
It’s also not resulted from improper configuration settings #97, thread overcommit in the simulated system #44, too short fake leave time for an overcommitted host machine #15, or unmatched memory timing configuration #25, because I made the corresponding tests and scale up to 64 simulated cores but the same problem exists.
I also configured different virtual memory configurations in Linux kernel and it seems the error of “ACCESS_INVALID_ADDRESS” has nothing to do with address space exceptions.
Disabling sim.deadlockDetection, suggested by #172, also does not work as expected. The default 130 seconds of deadlock detection is fairly enough for our 4 threads, unlike the case in with 1024 worker threads (a lot of fake leaves) #26.
Did anyone try TailBench in Zsim before and/or encounter similar problems?
Jason
The text was updated successfully, but these errors were encountered: