Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: fix retry query case. #297

Merged
merged 2 commits into from
Apr 25, 2024

Conversation

ziggie1984
Copy link
Contributor

@ziggie1984 ziggie1984 commented Mar 27, 2024

Fixes lightningnetwork/lnd#8593

In general I will rework this code area in LND 19 because this needs an overhaul and relates to btcsuite/btcwallet#904

In the current code there is an edge case which manifests itself when the used backend has very unstable connections and also not a lot of outbound peers which results in queryJobs waiting a long time in the queue. Then we would timeout the batch here:

if batch != nil {
select {
case <-batch.timeout:
batch.errChan <- ErrQueryTimeout
delete(currentBatches, batchNum)
log.Warnf("Query(%d) failed with "+
"error: %v. Timing out.",
result.job.index, result.err)
log.Debugf("Batch %v timed out",
batchNum)
default:

but the queryJob was already registered for a second try here:

heap.Push(work, result.job)
currentQueries[result.job.index] = batchNum

Now this query will be tried all over again and never purged from the queue and due to how the heap map works it will always retry the query with the lowest index number. So this will cause us to never try new queries but always retry the old ones.

@ziggie1984
Copy link
Contributor Author

Maybe we should also increase the timeouts/retries while we are here, so that we can mitigate cases like this: lightningnetwork/lnd#8497.
Where the default 2 retries with a timeout of 2 and then 4 seconds just aren't enough also taking the overall timeout of the batch of 30 seconds into account.

2 sec timeout:

minQueryTimeout = 2 * time.Second

4 sec timeout after first failed retry:

if result.err == ErrQueryTimeout {
newTimeout := result.job.timeout * 2
if newTimeout > maxQueryTimeout {
newTimeout = maxQueryTimeout
}

Overall batchtimeout 30 sec:

// defaultQueryTimeout specifies the default total time a query is
// allowed to be retried before it will fail.
defaultQueryTimeout = time.Second * 30

Retries 2:

// defaultNumRetries is the default number of times that a query job
// will be retried.
defaultNumRetries = 2

@ziggie1984 ziggie1984 marked this pull request as ready for review March 27, 2024 15:59
@saubyk saubyk requested a review from Chinwendu20 March 28, 2024 18:39
@ProofOfKeags ProofOfKeags self-requested a review April 1, 2024 20:55
Copy link
Contributor

@Chinwendu20 Chinwendu20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, thanks a lot for handling this edge case. The fix looks good. I think we should have checked earlier on if batch == nil here:

batch := currentBatches[batchNum]

instead of duplicating that check over and over but maybe that can be included in the rework that you suggested.

@saubyk saubyk requested a review from ellemouton April 2, 2024 16:46
Copy link

@ProofOfKeags ProofOfKeags left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of questions. Other than that looks good.

query/workmanager.go Show resolved Hide resolved
query/workmanager.go Outdated Show resolved Hide resolved
query/workmanager.go Outdated Show resolved Hide resolved
@ziggie1984
Copy link
Contributor Author

ziggie1984 commented Apr 3, 2024

Ok I came up with a slightly different approach, we will now use a cancel channel and close it instead of looping through the queryJob array, maybe we should do both (close channel to remove active requests and remove all the one still queued, I think just the first one is enough) ?. Let me know if its even a more hacky way, tho have to think about a new test for this.

@ProofOfKeags
Copy link

Ok I came up with a slightly different approach, we will now use a cancel channel and close it instead of looping through the queryJob array, maybe we should do both (close channel to remove active requests and remove all the one still queued, I think just the first one is enough) ?. Let me know if its even a more hacky way, tho have to think about a new test for this.

Don't worry about it, this is fine enough for a change this small. I would like to see neutrino undergo some refactoring at some point but I don't think we should turn your 1-commit fix into that project.

@ziggie1984
Copy link
Contributor Author

Sounds good, I am writing a test for the new approach and then we ship it until we get the new refactor into the codebase

@ziggie1984
Copy link
Contributor Author

ziggie1984 commented Apr 3, 2024

@ProofOfKeags actually the former version had a bug in it, it would crash in some circumstances which was only revealed to me while adding a unit test for this case. We cannot just remove the heap entries for the queries because they might already be registered with the workers. So the prior version would crash.

Good reminder to never ack something without tests... 😅

@ProofOfKeags
Copy link

Good reminder to never ack something without tests...

Good catch. I'll admit that the way we do testing makes it really hard for me to tell whether the tests are actually good or not. I'm actively trying to improve our library infrastructure. The better we factorize things, the smaller the tests will be and the easier they will be to evaluate too. It requires a lot of discipline though and time though.

@Chinwendu20
Copy link
Contributor

@ProofOfKeags actually the former version had a bug in it, it would crash in some circumstances which was only revealed to me while adding a unit test for this case. We cannot just remove the heap entries for the queries because they might already be registered with the workers. So the prior version would crash.

Interesting, I thought if they had already been registered with a worker, it would have been deleted from the heap, then and so won't be present in the purge you did earlier.

heap.Pop(work)

Good reminder to never ack something without tests... 😅

A learning point for me as well.

This approach works too, just that we would have to wait to schedule the job that has the batch cancelled already before it gets a cancel signal.

@ziggie1984
Copy link
Contributor Author

ziggie1984 commented Apr 4, 2024

Interesting, I thought if they had already been registered with a worker, it would have been deleted from the heap, then and so won't be present in the purge you did earlier.

Yes so the main problem is that as soon as a job is Pop()-ed from the heap it is not immediately removed from the currentQueries map, so when the batch times out and registered queries are still ongoing with the workers I would try to delete queries which are already Pop()-ed from the heap. I think the way its now, is better because it uses channels in a clean way. But as said at the beginning and also underlined by @ProofOfKeags this code part needs a refactoring.

@lightninglabs-deploy
Copy link

@ellemouton: review reminder
@ziggie1984, remember to re-request review from reviewers when ready

Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find!

I think this is the incorrect use of the cancelChannel api though. A cancel channel given by the caller is a way for the caller to signal to us (the system) that we should cancel early. Ie, we the system should only ever listen on this channel. Else we risk running into a "panic: close of closed channel" error.

I think we should instead have an internal-only way of canceling queries that is separate from the way that a caller cancels queries

query/workmanager.go Outdated Show resolved Hide resolved
query/workmanager.go Outdated Show resolved Hide resolved
query/workmanager.go Outdated Show resolved Hide resolved
query/workmanager_test.go Outdated Show resolved Hide resolved
@ziggie1984 ziggie1984 force-pushed the sync-pruned-node branch 2 times, most recently from 9cc29eb to 853b4d1 Compare April 23, 2024 19:20
@ziggie1984 ziggie1984 requested a review from ellemouton April 23, 2024 19:20
@ProofOfKeags
Copy link

I think we should instead have an internal-only way of canceling queries that is separate from the way that a caller cancels queries

Sounds like I should revive my work on the Async API. 😅

@ziggie1984
Copy link
Contributor Author

Nice input @ellemouton 🙏, introduced a new internalCancelChannel for the query to distinguish both cases.

Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick update! Looks good! One little thing left over from previous version I think though! After that, LGTM!

query/interface.go Outdated Show resolved Hide resolved
In case the backend is very unstable and times out the batch we
need to make sure ongoing queryJobs are droped and already
registered queryJobs are removed from the heap as well.
Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@ellemouton
Copy link
Member

@ziggie1984 - before we merge this, can you open an LND PR that points to this so we can make sure the CI passes?

@ziggie1984
Copy link
Contributor Author

@guggero I think we can merge this now, itests passed on the LND PR (lightningnetwork/lnd#8621)

@guggero guggero added this pull request to the merge queue Apr 25, 2024
Merged via the queue into lightninglabs:master with commit 602843d Apr 25, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug]: LND 0.17.4-beta still loses sync to chain after long times of inactivity as it seems
6 participants