-
Notifications
You must be signed in to change notification settings - Fork 5
createBuffer is bad - Why BQ isn't the problem #487
Comments
jumping on a call now, potentially more context to come |
For more context, here's a walkthrough of what happens using the snippet I tested with: global.buffer = createBuffer({
limit: 5 * 1024 * 1024,
timeoutSeconds: 1,
// onFlush will get voided so it happens in the background
onFlush: async (batch) => {
// this runs immediately
const timer1 = new Date().getTime()
// here, the request is initiated and the worker moves on
// to go through the other phases of the event loop
// but this whole block isn't being awaited by the task
// so the worker goes into `atomicsWaitLoop` mode
// no tasks come for e.g. 15s so the request may have completed
// but the callback (everything after the fetch line) hasn't been called
// because the event loop is blocked
await fetch('https://webhook.site/ba1c2e5c-e471-4873-be4c-b19dfd1dda86')
// after 15s this callback is finally called
// and the log says something like 15s
// but if 15 was 35, an error is thrown because of `asyncGuard`
// and this never runs
const timer2 = new Date().getTime()
console.log('onFlush took', (timer2-timer1)/1000, 'seconds')
},
}) |
Oh wow, that's some amazing digging you did there! 💯 The findings make a lot of sense as well. Not sure re: the best solution, as we don't really know and can't know if any library we use is using internal buffers, discared awaits, etc. We're anyway using a patched pisicna (with a broadcastTask that they didn't accept), so perhaps it makes sense to patch that somehow? Or we can just I don't have good answers (yet?), but again, great work digging this out! |
There's a chance we might be able to solve this by circumventing |
No timeout errors at all from the above. For the buffer at least, we can probably move it away from |
Here's a before and after from Screen.Recording.2021-06-25.at.16.36.18.mov |
This is super interesting. 👏 I definitely can see other places where we'd like to |
What about creating an internal promise-tracker? Something like: const trackedPromises = new Set()
const trackPromise = (promise) => {
if (trackedPromises.has(promise)) {
return
}
trackedPromises.add(promise)
promise.finally(() => {
trackedPromises.remove(promise)
})
}
function createBuffer() {
return {
flush: async () => {
const actuallyFlush = async () => {
// do whatever
//
}
const promise = actuallyFlush()
trackPromise(promise)
return promise
}
}
} This won't help with random unhandled promises in plugins (unless we implement some babel magic ;)), like someone forgetting to |
To clarify, this Then in teardown, we could also |
Super interesting, great work getting to the bottom of this! :D Another option might be to move things around such that we don't end up in this situation? Where, instead of every thread getting its own This is related to opening a channel of communication between threads & main server, and also: PostHog/posthog#4709 |
The promise tracker wouldn't be generalizable enough to all the use cases I feel. Let's see the results of configuring If that doesn't work (it solves this problem, but the question is what sort of performance dip will we see) then we go back to the drawing board. There's probably ways to patch the piscina worker to do some processing in between looking for tasks, which would be ideal as it would work at the foundation level rather than with some higher level monkey patching. Could well be that this turns out to not be a good idea though. 🤷 Just rubber ducking. |
We can write a babel plugin that captures all discarded promises (at least inside plugin, can't say for included libs) to make the feature more universal. Plus this makes some sense anyway if we want a clean shutdown procedure... and/or just to track leaks inside plugins. Just trying to solve the "you really shouldn't discard promises" root cause here :). In any case, feel free to try whatever makes sense! |
Makes sense 👍 |
Q: How did you figure out that these are related to the piscina workers "hanging" for the next task? |
The Sentry issues you mean? Well, they aren't certainly because of this, so maybe I should write "potentially related issues". They are very much in line with things I've seen locally though. But good point, it's not certain this is the only thing at play here. |
Update: The original commit that added the
|
Potentially worth running our own benchmarks |
Just dumping this here as a note to self for now. Overall time
Benchmarks (useAtomics = true)
Benchmarks (useAtomics = false)
|
Another one. Overall time
Benchmarks (useAtomics = false)
Benchmarks (useAtomics = true)
|
Ok so some findings on
|
That 27% throughput drop for avoidable software reasons doesn't seem that nice :(. |
Yup, definitely not the approach to take. What's next is exploring how to make sure we keep track of those promises, preferably done at a lower level than requiring instrumentation on every unhandled promise we create. Have a few ideas to test. |
However, I might put through a jobs-based buffer implementation, just so we at least get rid of the "BigQuery errors" (that happen in multiple plugins) |
So this is a reasonably hard problem (as anticipated). There is a way to check for async stuff running in the background using some undocumented Node Another initial thought was to have So with enough searching, I found out that there's a new method for So if we can't go up to Node 16, it's back to the drawing board. |
I'd still suggest this: #487 (comment) Make sure everything we export (redis.set, etc) is wrapped with something that noopawaits the promise and we're done. This will add another thing to think of when we finally launch |
What about something like this? https://github.com/PostHog/piscina/pull/3/files RE the promise tracker, I'm confused as to how it would ensure promises are fulfilled when the event loop is blocked? What ensures the event loop can e.g. close sockets in time? In the buffer, |
I basically mean that here replace The import { trackPromise } from '@posthog/piscina'
redis.set = trackPromise(redis.set) |
Hmm but that's the same problem as this:
|
I don't think these methods will give good results because with the random number of other promises running from within piscina and other parts of the stack. I'd bet you can't reliably determine what's "zero" with them. Replied in the PR, but AFAIK the main point of atomics is to immediately react when a new message comes in. That makes sense when we have 100 messages per second, but it's a pretty "meh whatever" feature when we get just 2 messages per second. So just decreasing the default atomics wait time (and then running benchmarks) before re-running the entire wait loop including the As in, if we decrease it to ~100ms from the current |
Should be solved with #490. |
As the Extensibility team disbanded to conquer the world, it was my turn to be assigned PostHog/bigquery-plugin#10.
Having spent a good amount of time pulling my beard hairs out, I believe I now have the answer. Here's what's happening and a mandatory story of how I got there.
Disclaimer: While I'm reasonably sure of quite a bit of this, some of it is still "hunches", so it may not be 100% technically correct
Context
@neilkakkar suggested I pick this up during the last planning session, noting that this was an error that only happens on local/self-hosted but not Cloud.
This was super weird, because it was reported on local Postgres instances, but also Netdata's ClickHouse instance. I kept thinking about all the things that could be different between these environments and testing them out, and went through a lot of the same steps others here have taken. Could it be logs? Could it be job queues?
Then I started looking into the buffer.
BigQuery ain't the problem
When I first read the issue, I immediately thought this wasn't a BigQuery thing. Why? I had experienced the same issues developing the RudderStack plugin, and the odd retries even featured on a video I made for them :D
It seemed that retry errors were only thrown on the first run, but everything would always complete as expected the second time.
What was different here? In both plugins, the first flush would come from the buffer, whereas all subsequent ones would come from jobs. Bingo, buffer.
The Evil Buffer
To find out if the buffer was the problem, I set up a buffer flush and a job to call the same async function that made a request to an endpoint that'd always wait 5s before responding.
The job always executed in around the same amount of time, but the buffer would vary a lot:
I went through thousands of runs, and the job was always consistent. The buffer would vary from 5s to 29s and then came the timeouts.
So the buffer was the problem. But finding out why was where my beard hairs started to come out.
I started to notice connection reset, socket hang up, and other sorts of errors. I knew the request would take only 5s, so there were 25s to spare. This made no sense, and so my hunch was that this a problem lower down.
Here came a period of hours and hours looking through node and piscina docs, changing config options, inspecting all the possible lines of code one by one, etc. We can skip ahead from here.
0x our savior
Finally, I decided to run the plugin server with flamegraph tool https://github.com/davidmarkclements/0x.
Here's what that looked like:
Hmmm, what the hell is
atomicsWaitLoop
?I went to the piscina repo and there was the answer:
"Generally, tasks should not spawn asynchronous operations without waiting for them to finish, though."
Hey! That's exactly what we do.
This by itself didn't fully answer the Cloud vs local/self-hosted question though, but with a bit more digging and testing, the answer came through:
This does happen on Cloud (10k+ events on Sentry) but it happens waaay less because there are always tasks to process.
What happens is that when a piscina worker is looking for a task, while it doesn't find it, it keeps blocking the event loop of that thread until a task comes through. If it does, it may have time to run though the other phases of the event loop and handle the voided steps of the buffer flush before the timeout (e.g. the
fetch
request may have completed, but its callback hasn't been processed yet). But if no tasks come, it will wait, and wait, and wait, and when it finally comes back,asyncGuard
does its thing aaaand we're sending a request to be retried (even though it may have finished, as was pointed out with BQ).Testing this locally confirmed this:
The fewer events you send, the more
atomicsWaitLoop
will run for. This isn't much of a problem in Cloud because there's always events (and all the other tasks) to process.Solution
The solution is to listen to the piscina people and not void promises in tasks.
While it will be adding another task each time, we're better off flushing everything via a job.
The text was updated successfully, but these errors were encountered: