fix(stdlib): properly detect tokio runtime in `dns_lookup` #882

esensar · 2024-06-07T14:53:58Z

No description provided.

esensar · 2024-06-07T14:56:11Z

When testing dns_lookup in vector in a real environment, we got this error:

thread 'vector-worker' panicked at
/cargo/registry/src/index.crates.io-6f17d22bba15001f/domain-0.10.0/src/resolv/stub/mod.rs:287:17:
Cannot start a runtime from within a runtime. This happens because a
function (like `block_on`) attempted to block the current thread while
the thread is being used to drive asynchronous tasks.
note: run with `RUST_BACKTRACE=1` environment variable to display a
backtrace

Which happens because VRL is called in a tokio managed thread. The solution here was to move the call to another thread, because internally domain crate implement blocking calls by still relyting on async function implementation.

Let me know if you have a better solution for this, since this might be too much to spawn a thread for each lookup.

pront · 2024-06-07T16:24:05Z

Hi @esensar, thanks for suggesting a fix.

since this might be too much to spawn a thread for each lookup.

This might be problematic, could we have a pool or a dedicated thread for this?

jszwedko · 2024-06-07T17:12:27Z

src/stdlib/dns_lookup.rs

-            stub.query((host, qtype, qclass)).await
-        })
+        let answer = match Handle::try_current() {
+            Ok(_) => thread::spawn(move || {


I think if it returns Ok you can use something like:

futures::executor::block_on(async { handle .spawn(async { ... }) })

to just execute in the current Tokio runtime instead of spawning a separate thread per https://stackoverflow.com/a/62536772

I still have to await on return value of block_on which bring us back to the same issue. I think we will have to go with a pool or a dedicated thread until proper async support is added to VRL (if it ever gets added, since it maybe clashes with original goals).

Ah gotcha. Then yeah, a pool seems like a good approach. If/when VRL functions become async we can revisit this. I was going to suggest we thread the through the runtime from Vector into VRL, but I think I'd prefer to just make the functions async than do that.

jszwedko

I'm realizing this might be the first example of an "async VRL function". Per vectordotdev/vector#20495 we've tried to avoid that until now. I think we can just hack it as you currently have with sync lookups but if we gather more examples like this it might motivate adding async VRL function variants (internally, not exposing the async nature to users).

jszwedko · 2024-06-11T14:59:37Z

@esensar just a heads up that we'll be releasing Vector next week in-case you want to try to fix this before then.

esensar · 2024-06-12T08:56:33Z

I have implemented just a single dedicated thread for this. I know that is probably not the best idea, but this close to the release, I didn't want to risk breaking stuff by adding a thread pool (which would probably include a crate, since it would be better to rely on something existing already).

pront · 2024-06-12T18:52:31Z

src/stdlib/dns_lookup.rs

+    static WORKER: Lazy<Worker> = Lazy::new(|| Worker::new());
+
+    type Job<T> = Box<dyn FnOnce() -> T + Send + 'static>;
+    struct JobHandle<T> {
+        job: Job<T>,
+        result: Arc<mpsc::Sender<T>>,
+    }
+
+    struct Worker {
+        thread: Option<thread::JoinHandle<()>>,
+        queue: Option<mpsc::Sender<JobHandle<Result<Answer, Error>>>>,
+    }
+
+    impl Worker {
+        fn new() -> Self {
+            let (sender, receiver) = mpsc::channel::<JobHandle<Result<Answer, Error>>>();
+            let receiver = Arc::new(Mutex::new(receiver));
+            Self {
+                thread: Some(thread::spawn(move || loop {
+                    match receiver.lock().unwrap().recv() {
+                        Ok(handle) => {
+                            let result = (handle.job)();
+                            handle.result.as_ref().send(result).unwrap();
+                        }
+                        Err(_) => todo!(),
+                    }
+                })),
+                queue: Some(sender),
+            }
+        }
+
+        fn execute<F>(&self, f: F) -> Result<Answer, Error>
+        where
+            F: FnOnce() -> Result<Answer, Error> + Send + 'static,
+        {
+            let job = Box::new(f);
+            let (sender, receiver) = mpsc::channel();
+            let receiver = Arc::new(Mutex::new(receiver));
+            let handle = JobHandle {
+                job,
+                result: Arc::new(sender),
+            };
+
+            self.queue.as_ref().unwrap().send(handle).unwrap();
+            return receiver.lock().unwrap().recv().unwrap();
+        }
+    }
+
+    impl Drop for Worker {
+        fn drop(&mut self) {
+            drop(self.queue.take());
+            if let Some(thread) = self.thread.take() {
+                thread.join().unwrap();
+            }
+        }
+    }


Hi @esensar, thank you for this patch. I think it's acceptable workaround for an undocumented function. I think it's good to add some comments here and also a module-level doc //! to explain what this function does and the limitations of a single threading solution. Especially when it comes:

Blocking

Worker Thread Saturation

Unbounded Job Queue

Another reasonable improvement here is to use a bounded channel.

Alright, that makes sense. I will try to add these as soon as possible, but I will not be available for the next 2-3 days. When is the release planned? I would love to get this ready for that release.

The release is planned for the 17th, which doesn't leave us too much time 😓 We do minor releases every 6 weeks though so it wouldn't have to wait too long if it didn't make it.

jszwedko

Thanks @esensar . I think we are ok including a single-threaded implementation to unblock use of this function for the release while we think about a better model. I left some inline comments below, but we'd also like to see a warning added as a rustdoc just to make people aware of the potential bottleneck. Do you think you could add that?

jszwedko · 2024-06-12T18:59:58Z

src/stdlib/dns_lookup.rs

+                    match receiver.lock().unwrap().recv() {
+                        Ok(handle) => {
+                            let result = (handle.job)();
+                            handle.result.as_ref().send(result).unwrap();


Could we change all of these unwraps to expects?

jszwedko · 2024-06-12T19:00:12Z

src/stdlib/dns_lookup.rs

+                            let result = (handle.job)();
+                            handle.result.as_ref().send(result).unwrap();
+                        }
+                        Err(_) => todo!(),


What should we do here? Panic?

jszwedko · 2024-06-12T19:05:18Z

src/stdlib/dns_lookup.rs

+        queue: Option<mpsc::Sender<JobHandle<Result<Answer, Error>>>>,
+    }
+
+    impl Worker {


Did you borrow this implementation from somewhere? It's slightly more complicated than I might have expected.

It is roughly based on https://doc.rust-lang.org/book/ch20-02-multithreaded.html
I tried to simplify it as much as I could, but I thought I had to have some kind of channel to take in the function and to send back the result.

Ah, gotcha, yeah I see. It seems like that example just creates one channel that is reused where this implementation creates a channel per call to execute. Is there a reason we need to create one channel per execute call?

Oh, right. My bad, one should be enough.

Right, a single mpsc::sync_channel(CHANNEL_CAPACITY) should be enough.
Both the sender and the receiver should be lazily instantiated instances.

It is all contained in Worker instance now, so both of them are lazily insttantiated (since Worker is lazily instantiated). There are 2 bounded channels though, one for jobs and one for results. I hope that is alright.

I have made the capacity 0 for testing (meaning it is always blocking), but on the other hand, not sure if it makes sense to make it any bigger, considering there is only 1 thread handling jobs. I might be missing something,

jszwedko · 2024-06-12T19:08:38Z

src/stdlib/dns_lookup.rs

+            let job = Box::new(f);
+            let (sender, receiver) = mpsc::channel();
+            let receiver = Arc::new(Mutex::new(receiver));
+            let handle = JobHandle {
+                job,
+                result: Arc::new(sender),
+            };


Similarly to my above question, did you borrow this from somewhere? It seems like a lot of "work" to do for each execute function such that I'm wondering what the performance looks like. Maybe we could add a benchmark for it in benches/stdlib.rs to see?

There are already some benchmarks there for the dns_lookup function. Did you mean we should add some benchmarks for the Worker itself?

jszwedko

Thanks for the updates @esensar ! I think this looks good to me know.

It seems possible that having a higher channel capacity could improve performance by letting the worker thread process more messages before it switches out, but we'd need to do some benchmarking to prove that out.

jszwedko · 2024-06-17T13:55:39Z

I'd like to get @pront's review again too.

pront · 2024-06-17T14:24:40Z

src/stdlib/dns_lookup.rs

+    // Currently blocks on each request until result is received
+    // It should be avoided unless absolutely needed
+    static WORKER: Lazy<Worker> = Lazy::new(Worker::new);
+    const CHANNEL_CAPACITY: usize = 0;


Let's set a sufficiently large number here?

I had trouble figuring out the right number for this and then I thought it would make sense to keep it 0, considering that it is just 1 thread, but that is probably wrong.

Do you have suggestion for a number?

What you have works but we could probably go a bit further by allowing some buffering (without worrying about memory explosion), based on the docs:

This channel has an internal buffer on which messages will be queued. bound specifies the buffer size. When the internal buffer becomes full, future sends will block waiting for the buffer to open up. Note that a buffer size of 0 is valid, in which case this becomes “rendezvous channel” where each [send](https://doc.rust-lang.org/std/sync/mpsc/struct.SyncSender.html#method.send) will not return until a [recv](https://doc.rust-lang.org/std/sync/mpsc/struct.Receiver.html#method.recv) is paired with it.

Tying it to the number of concurrent executions of the VRL runtime could make sense. For example, the remap transform runs one transform per available thread: https://github.com/vectordotdev/vector/blob/b3276b4cc73dee6d3854469562f1b1fcf15a419c/src/topology/builder.rs#L68-L73. To do that "right" though it should probably be configurable on the VRL runtime.

I guess any number higher than that would be alright, so I guess for now we can hardcode something random, like a 100, since it won't really add more jobs than threads, due to the way the function blocks the current thread while waiting for result.

Tying it to the number of concurrent executions of the VRL runtime could make sense. For example, the remap transform runs one transform per available thread: https://github.com/vectordotdev/vector/blob/b3276b4cc73dee6d3854469562f1b1fcf15a419c/src/topology/builder.rs#L68-L73. To do that "right" though it should probably be configurable on the VRL runtime.

Sorry for just going for the hard coded number, but I thought we could go for that configurable approach when we update this to use multiple threads (or hopefully sometime in the future make it properly async).

pront · 2024-06-17T14:28:33Z

I'd like to get @pront's review again too.

This is significantly improved, thanks @esensar. I had a question about the channel capacity which I believe is important to resolve before we merge this.

pront

Thank you @esensar!

fix(stdlib): properly detect tokio runtime in dns_lookup

ac58cd6

jszwedko reviewed Jun 7, 2024

View reviewed changes

Add a worker thread for dns_lookup

7442a5e

pront reviewed Jun 12, 2024

View reviewed changes

Fix clippy warnings

326005c

jszwedko reviewed Jun 12, 2024

View reviewed changes

esensar added 4 commits June 12, 2024 22:04

Replace unwraps with expect

087b474

Use single bounded channel for dns lookup job results

a1514e3

Add result_receiver drop in Worker Drop impl

f61d308

Add some comments for dns_lookup function and Worker impl

e326d84

jszwedko approved these changes Jun 17, 2024

View reviewed changes

jszwedko requested a review from pront June 17, 2024 13:55

pront added the no-changelog Changes in this PR do not need user-facing explanations in the release changelog label Jun 17, 2024

pront reviewed Jun 17, 2024

View reviewed changes

Hardcode capacity for job channels to 100

44c355a

pront approved these changes Jun 17, 2024

View reviewed changes

jszwedko added this pull request to the merge queue Jun 17, 2024

Merged via the queue into vectordotdev:main with commit f5719be Jun 17, 2024
13 checks passed

esensar deleted the fix/dns-lookup-tokio-runtime branch June 17, 2024 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(stdlib): properly detect tokio runtime in `dns_lookup` #882

fix(stdlib): properly detect tokio runtime in `dns_lookup` #882

esensar commented Jun 7, 2024

esensar commented Jun 7, 2024

pront commented Jun 7, 2024 •

edited

Loading

jszwedko Jun 7, 2024 •

edited

Loading

esensar Jun 10, 2024

jszwedko Jun 10, 2024

jszwedko left a comment

jszwedko commented Jun 11, 2024

esensar commented Jun 12, 2024

pront Jun 12, 2024 •

edited

Loading

pront Jun 12, 2024

esensar Jun 12, 2024

jszwedko Jun 12, 2024

jszwedko left a comment

jszwedko Jun 12, 2024 •

edited

Loading

jszwedko Jun 12, 2024

jszwedko Jun 12, 2024

esensar Jun 12, 2024

jszwedko Jun 12, 2024

esensar Jun 12, 2024

pront Jun 12, 2024

esensar Jun 15, 2024

jszwedko Jun 12, 2024

esensar Jun 15, 2024

jszwedko left a comment

jszwedko commented Jun 17, 2024

pront Jun 17, 2024

esensar Jun 17, 2024

pront Jun 17, 2024

jszwedko Jun 17, 2024

esensar Jun 17, 2024

esensar Jun 17, 2024

pront commented Jun 17, 2024

pront left a comment

fix(stdlib): properly detect tokio runtime in dns_lookup #882

fix(stdlib): properly detect tokio runtime in dns_lookup #882

Conversation

esensar commented Jun 7, 2024

esensar commented Jun 7, 2024

pront commented Jun 7, 2024 • edited Loading

jszwedko Jun 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jszwedko left a comment

Choose a reason for hiding this comment

jszwedko commented Jun 11, 2024

esensar commented Jun 12, 2024

pront Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jszwedko left a comment

Choose a reason for hiding this comment

jszwedko Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jszwedko left a comment

Choose a reason for hiding this comment

jszwedko commented Jun 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pront commented Jun 17, 2024

pront left a comment

Choose a reason for hiding this comment

fix(stdlib): properly detect tokio runtime in `dns_lookup` #882

fix(stdlib): properly detect tokio runtime in `dns_lookup` #882

pront commented Jun 7, 2024 •

edited

Loading

jszwedko Jun 7, 2024 •

edited

Loading

pront Jun 12, 2024 •

edited

Loading

jszwedko Jun 12, 2024 •

edited

Loading