-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query: fix retry query case. #297
Conversation
Maybe we should also increase the timeouts/retries while we are here, so that we can mitigate cases like this: lightningnetwork/lnd#8497. 2 sec timeout: Line 14 in 43f5a58
4 sec timeout after first failed retry: Lines 375 to 379 in 43f5a58
Overall batchtimeout 30 sec: Lines 10 to 12 in 43f5a58
Retries 2: Lines 18 to 20 in 43f5a58
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, thanks a lot for handling this edge case. The fix looks good. I think we should have checked earlier on if batch == nil here:
Line 312 in 43f5a58
batch := currentBatches[batchNum] |
instead of duplicating that check over and over but maybe that can be included in the rework that you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of questions. Other than that looks good.
f2b8212
to
535aa62
Compare
Ok I came up with a slightly different approach, we will now use a cancel channel and close it instead of looping through the queryJob array, maybe we should do both (close channel to remove active requests and remove all the one still queued, I think just the first one is enough) ?. Let me know if its even a more hacky way, tho have to think about a new test for this. |
535aa62
to
1ef869f
Compare
Don't worry about it, this is fine enough for a change this small. I would like to see neutrino undergo some refactoring at some point but I don't think we should turn your 1-commit fix into that project. |
Sounds good, I am writing a test for the new approach and then we ship it until we get the new refactor into the codebase |
@ProofOfKeags actually the former version had a bug in it, it would crash in some circumstances which was only revealed to me while adding a unit test for this case. We cannot just remove the heap entries for the queries because they might already be registered with the workers. So the prior version would crash. Good reminder to never ack something without tests... 😅 |
Good catch. I'll admit that the way we do testing makes it really hard for me to tell whether the tests are actually good or not. I'm actively trying to improve our library infrastructure. The better we factorize things, the smaller the tests will be and the easier they will be to evaluate too. It requires a lot of discipline though and time though. |
05d2140
to
2e5e321
Compare
Interesting, I thought if they had already been registered with a worker, it would have been deleted from the heap, then and so won't be present in the purge you did earlier. Line 244 in 1ef869f
A learning point for me as well. This approach works too, just that we would have to wait to schedule the job that has the batch cancelled already before it gets a cancel signal. |
Yes so the main problem is that as soon as a job is |
@ellemouton: review reminder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find!
I think this is the incorrect use of the cancelChannel
api though. A cancel channel given by the caller is a way for the caller to signal to us (the system) that we should cancel early. Ie, we the system should only ever listen on this channel. Else we risk running into a "panic: close of closed channel" error.
I think we should instead have an internal-only way of canceling queries that is separate from the way that a caller cancels queries
9cc29eb
to
853b4d1
Compare
Sounds like I should revive my work on the Async API. 😅 |
Nice input @ellemouton 🙏, introduced a new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick update! Looks good! One little thing left over from previous version I think though! After that, LGTM!
In case the backend is very unstable and times out the batch we need to make sure ongoing queryJobs are droped and already registered queryJobs are removed from the heap as well.
853b4d1
to
a2d891c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
@ziggie1984 - before we merge this, can you open an LND PR that points to this so we can make sure the CI passes? |
@guggero I think we can merge this now, itests passed on the LND PR (lightningnetwork/lnd#8621) |
Fixes lightningnetwork/lnd#8593
In general I will rework this code area in LND 19 because this needs an overhaul and relates to btcsuite/btcwallet#904
In the current code there is an edge case which manifests itself when the used backend has very unstable connections and also not a lot of outbound peers which results in queryJobs waiting a long time in the queue. Then we would timeout the batch here:
neutrino/query/workmanager.go
Lines 420 to 433 in 43f5a58
but the queryJob was already registered for a second try here:
neutrino/query/workmanager.go
Lines 388 to 389 in 43f5a58
Now this query will be tried all over again and never purged from the queue and due to how the heap map works it will always retry the query with the lowest index number. So this will cause us to never try new queries but always retry the old ones.