Pending `run_pending_tasks` may cause busy loop in `schedule_write_op` #412

sticnarf · 2024-04-09T05:01:36Z

It's not so easy to create a minimal reproducible example. I will try to describe my code and my findings.

As a workaround for #408, I run run_pending_tasks manually when the cache size exceeds the capacity by a wide margin. However, occasionally I find my tokio worker threads are stuck in the busy loop of schedule_write_op.

I examine what happens with gdb:

The write op channel is full. I'm doing a lot of cache insertions so it's not strange.
All worker threads cannot acquire the current_task mutex in try_run_pending_tasks.
The internal state of the current_task mutex is 2, which means the mutex is not locked but there is a starved acquirer. In this state, try_lock will fail.

I further add some logs to moka and find that inside my manual call to run_pending_tasks, it tries to acquire the current_task mutex and does not actually get it acquired. It probably shows that the starved mutex acquirer is the manual run_pending_tasks task.

There are various reasons why the run_pending_tasks task is not waked up to acquire the lock:

All of the tokio worker threads are running the busy loops in schedule_write_op. The runtime never schedules the run_pending_tasks task to run.
I also combine the run_pending_tasks task together with other insert tasks in a join_all future. So, when the join_all future is doing the busy loop, it's also impossible for the run_pending_tasks task to run.

The text was updated successfully, but these errors were encountered:

sticnarf · 2024-04-09T05:15:22Z

Because run_pending_tasks is paused before acquiring the current_task task, it doesn't enter do_run_pending_tasks and maintenance_task_lock is not acquired.

So, this line doesn't yield to break the busy loop.

moka/src/future/base_cache.rs

Line 681 in ae33f8d

let _ = maintenance_task_lock.read().await;

It seems to me that current_task can play a similar role to the maintenance_task_lock. I'm thinking whether it's a good idea to remove maintenance_task_lock and change current_task to a RwLock. Then, we call let _ = current_task.read().await after spinning to yield.

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

tatsuya6502 · 2024-04-09T06:49:23Z

Hi. Thank you for reporting the issue. So, if I understand correctly, (1) unlocking maintenance_task_lock before (2) unlocking current_task causes all concurrent insert calls to go into busy-loop immediately after (1), and it will prevent manual run_pending_tasks to be scheduled by Tokio?

I'm thinking whether it's a good idea to remove maintenance_task_lock and change current_task to a RwLock. Then, we call let _ = current_task.read().await after spinning to yield.

Can you please try it by modifying moka's code locally?

When I implemented the loop in schedule_write_op, what I really wanted was tokio::task::yield_now. However, moka does not depend on any async runtime (Tokio, async_std, etc.), so I had to do some trick with async_lock::RwLock (maintenance_task_lock) instead.

tatsuya6502 · 2024-04-09T06:54:14Z

Oh, I realized that you added yield_now on your fork. sticnarf@8352623 Yeah, that is the one I wanted (but is impossible).

sticnarf · 2024-04-09T06:55:38Z

Yes, I have tried inserting a yield_now before spinning. After that, I don't see the busy loop to occur again.

tatsuya6502 · 2024-04-09T07:03:07Z

That's great!

I still do not want to add async runtime dependency to moka. So, I think I will make the busy loop in schedule_write_op to sometimes call try_run_pending_task.

sticnarf · 2024-04-09T07:07:49Z

I still do not want to add async runtime dependency to moka. So, I think I will make the busy loop in schedule_write_op to sometimes call try_run_pending_task.

Do you mean sometimes call run_pending_tasks instead of the try_ version?

schedule_write_op already calls try_run_pending_task through apply_reads_writes_if_needed. But it never succeeds because try_lock fails due to there is a starved op waiting.

tatsuya6502 · 2024-04-09T10:21:10Z

schedule_write_op already calls try_run_pending_task through apply_reads_writes_if_needed. But it never succeeds because try_lock fails due to there is a starved op waiting.

I see. So, without calling yield_now in the busy loop, I think we should never call lock on the current_task. If we only call try_lock, it will be okay, because try_lock will never create a starved acquirer.

I will try to reproduce the problem and try some ideas to fix. I run mokabench before every release. Although the write op channel often gets full during a mokabench run, I have never had the busy loop problem. I think this is because mokabench never calls run_pending_tasks using lock, but it only lets insert method to call try_pending_tasks using try_lock.

I will modify mokabench to do something like the followings to see it will reproduce the problem:

I also combine the run_pending_tasks task together with other insert tasks in a join_all future.

tatsuya6502 · 2024-04-10T11:27:53Z

@sticnarf

I created a potential fix for the busy loop issue. Can you please try it if you have time?

Prevent pending run_pending_tasks of future::Cache from causing busy loop in schedule_write_op #415

[dependencies]
moka = { git = "https://github.com/moka-rs/moka", branch = "avoid-async-scheduler-busy-loop", features = ["future"] }

If you do not have time, that is okay. I have not been able to reproduce the issue using mokabench, but I will continue trying.

tatsuya6502 · 2024-04-10T15:44:32Z

I have not been able to reproduce the issue using mokabench, but I will continue trying.

I found the following program can reproduce the problem. I ran it on a MacBook Air with Apple M2 chip.

When I used moka@v0.12.5, it reproduced the problem within a few seconds.
When I used moka on the branch for Prevent pending run_pending_tasks of future::Cache from causing busy loop in schedule_write_op #415, it did not reproduce the problem and continued running to the end. It took about 10 minutes.

Cargo.toml

[dependencies]
moka = { version = "0.12.5", features = ["future"] }
# moka = { git = "https://github.com/moka-rs/moka", branch = "avoid-async-scheduler-busy-loop", features = ["future"] }

futures-util = "0.3.30"
rand = "0.8.5"
rand_pcg = "0.3.1"
tokio = { version = "1.37.0", features = ["rt-multi-thread", "macros"] }

src/main.rs

use moka::future::Cache;
use rand::{distributions::Uniform, prelude::*};
use rand_pcg::Pcg64;

const NUM_TASKS: usize = 48;
const NUM_KEYS: usize = 500_000;
const NUM_OPERATIONS_PER_TASK: usize = 1_000_000;
const NUM_INSERTS_PER_STEP: usize = 16;
const MAX_CAPACITY: u64 = 100_000;

// Use a small number of the worker threads to make it easy to reproduce the problem.
#[tokio::main(flavor = "multi_thread", worker_threads = 4)]
async fn main() {
    let cache = Cache::builder().max_capacity(MAX_CAPACITY).build();

    // Spawn `NUM_TASKS` tasks, each performing `NUM_OPERATIONS_PER_TASK` operations.
    let tasks = (0..NUM_TASKS).map(|task_id| {
        let cache = cache.clone();

        tokio::spawn(async move {
            let mut rng = Pcg64::seed_from_u64(task_id as u64);
            let distr = Uniform::new_inclusive(0, NUM_KEYS - 1);

            // Each task will perform `NUM_OPERATIONS_PER_TASK` operations.
            for op_number in 0..NUM_OPERATIONS_PER_TASK {
                // Each task will insert `NUM_INSERTS_PER_STEP` entries per operation in
                // parallel, and call `run_pending_tasks()` if the random number `n` is 0.
                let ns = (&mut rng)
                    .sample_iter(distr)
                    .take(NUM_INSERTS_PER_STEP)
                    .collect::<Vec<_>>();

                let ops = ns
                    .into_iter()
                    .map(|n| {
                        let key = format!("key-{}", n);
                        let value = format!("value-{}", n);
                        let cache = cache.clone();

                        async move {
                            cache.insert(key, value).await;
                            if n == 0 {
                                cache.run_pending_tasks().await;
                            }
                        }
                    })
                    .collect::<Vec<_>>();

                futures_util::future::join_all(ops).await;

                if (op_number + 1) % 1000 == 0 {
                    println!("Task {:2} - {:7}", task_id, op_number + 1);
                    // Give other tasks some chances to run.
                    tokio::task::yield_now().await;
                }
            }

            println!("Task {:2} - {:7}", task_id, NUM_OPERATIONS_PER_TASK);
        })
    });

    // Run all tasks and wait for them to complete.
    futures_util::future::join_all(tasks).await;

    cache.run_pending_tasks().await;
    println!("Done! Cache size: {}", cache.entry_count());
}

tatsuya6502 · 2024-04-11T02:13:58Z

Just for sure, I ran the above program on a Linux x86_64 PC with Intel Core i7-12700F. I got the same result:

When I used moka@v0.12.5, it reproduced the problem within a few seconds.
When I used moka on the branch of Prevent pending run_pending_tasks of future::Cache from causing busy loop in schedule_write_op #415, it did not reproduce the problem and continued running to the end.

tatsuya6502 · 2024-04-11T10:22:11Z

I merged #415 into the main branch and am going to delete the topic branch. Please use the main branch from now.

[dependencies]
moka = { git = "https://github.com/moka-rs/moka", branch = "main", features = ["future"] }

I will try to release v0.12.6 in coming weekend.

tatsuya6502 · 2024-04-12T00:43:29Z

I published moka@v0.12.6 to crates.io.

Thanks again for reporting the bug. Your analysis really helped to reproduce and fix it.

sticnarf · 2024-04-12T10:17:53Z

Sorry, I didn't have time to try the new fix until today. I can confirm the problem disappears in the test of my application with moka 0.12.6.

Thank you for the quick fix!

tatsuya6502 · 2024-04-12T13:05:45Z

Thank you for trying out moka 0.12.6 while you are busy!

sticnarf added a commit to sticnarf/moka that referenced this issue Apr 9, 2024

Workaround schedule_write_op busy loop issue (moka-rs#412)

8352623

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

tatsuya6502 added the bug Something isn't working label Apr 9, 2024

tatsuya6502 self-assigned this Apr 9, 2024

tatsuya6502 mentioned this issue Apr 10, 2024

Prevent pending run_pending_tasks of future::Cache from causing busy loop in schedule_write_op #415

Merged

tatsuya6502 added this to the v0.12.6 milestone Apr 10, 2024

tatsuya6502 closed this as completed in #415 Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pending `run_pending_tasks` may cause busy loop in `schedule_write_op` #412

Pending `run_pending_tasks` may cause busy loop in `schedule_write_op` #412

sticnarf commented Apr 9, 2024 •

edited

Loading

sticnarf commented Apr 9, 2024 •

edited

Loading

tatsuya6502 commented Apr 9, 2024 •

edited

Loading

tatsuya6502 commented Apr 9, 2024

sticnarf commented Apr 9, 2024

tatsuya6502 commented Apr 9, 2024

sticnarf commented Apr 9, 2024

tatsuya6502 commented Apr 9, 2024

tatsuya6502 commented Apr 10, 2024 •

edited

Loading

tatsuya6502 commented Apr 10, 2024 •

edited

Loading

tatsuya6502 commented Apr 11, 2024

tatsuya6502 commented Apr 11, 2024

tatsuya6502 commented Apr 12, 2024

sticnarf commented Apr 12, 2024

tatsuya6502 commented Apr 12, 2024

Pending run_pending_tasks may cause busy loop in schedule_write_op #412

Pending run_pending_tasks may cause busy loop in schedule_write_op #412

Comments

sticnarf commented Apr 9, 2024 • edited Loading

sticnarf commented Apr 9, 2024 • edited Loading

tatsuya6502 commented Apr 9, 2024 • edited Loading

tatsuya6502 commented Apr 9, 2024

sticnarf commented Apr 9, 2024

tatsuya6502 commented Apr 9, 2024

sticnarf commented Apr 9, 2024

tatsuya6502 commented Apr 9, 2024

tatsuya6502 commented Apr 10, 2024 • edited Loading

tatsuya6502 commented Apr 10, 2024 • edited Loading

tatsuya6502 commented Apr 11, 2024

tatsuya6502 commented Apr 11, 2024

tatsuya6502 commented Apr 12, 2024

sticnarf commented Apr 12, 2024

tatsuya6502 commented Apr 12, 2024

Pending `run_pending_tasks` may cause busy loop in `schedule_write_op` #412

Pending `run_pending_tasks` may cause busy loop in `schedule_write_op` #412

sticnarf commented Apr 9, 2024 •

edited

Loading

sticnarf commented Apr 9, 2024 •

edited

Loading

tatsuya6502 commented Apr 9, 2024 •

edited

Loading

tatsuya6502 commented Apr 10, 2024 •

edited

Loading

tatsuya6502 commented Apr 10, 2024 •

edited

Loading