query: fix retry query case. #297

ziggie1984 · 2024-03-27T15:34:22Z

In general I will rework this code area in LND 19 because this needs an overhaul and relates to btcsuite/btcwallet#904

In the current code there is an edge case which manifests itself when the used backend has very unstable connections and also not a lot of outbound peers which results in queryJobs waiting a long time in the queue. Then we would timeout the batch here:

neutrino/query/workmanager.go

Lines 420 to 433 in 43f5a58

    
           if batch != nil { 
        
           	select { 
        
           	case <-batch.timeout: 
        
           		batch.errChan <- ErrQueryTimeout 
        
           		delete(currentBatches, batchNum) 
        
           		log.Warnf("Query(%d) failed with "+ 
        
           			"error: %v. Timing out.", 
        
           			result.job.index, result.err) 
        
           		log.Debugf("Batch %v timed out", 
        
           			batchNum) 
        
           	default:

but the queryJob was already registered for a second try here:

neutrino/query/workmanager.go

Lines 388 to 389 in 43f5a58

    
           heap.Push(work, result.job) 
        
           currentQueries[result.job.index] = batchNum

Now this query will be tried all over again and never purged from the queue and due to how the heap map works it will always retry the query with the lowest index number. So this will cause us to never try new queries but always retry the old ones.

ziggie1984 · 2024-03-27T15:59:06Z

Maybe we should also increase the timeouts/retries while we are here, so that we can mitigate cases like this: lightningnetwork/lnd#8497.
Where the default 2 retries with a timeout of 2 and then 4 seconds just aren't enough also taking the overall timeout of the batch of 30 seconds into account.

2 sec timeout:

neutrino/query/workmanager.go

Line 14 in 43f5a58

minQueryTimeout = 2 * time.Second

4 sec timeout after first failed retry:

neutrino/query/workmanager.go

Lines 375 to 379 in 43f5a58

    
           if result.err == ErrQueryTimeout { 
        
           	newTimeout := result.job.timeout * 2 
        
           	if newTimeout > maxQueryTimeout { 
        
           		newTimeout = maxQueryTimeout 
        
           	}

Overall batchtimeout 30 sec:

neutrino/query/interface.go

Lines 10 to 12 in 43f5a58

    
           // defaultQueryTimeout specifies the default total time a query is 
        
           // allowed to be retried before it will fail. 
        
           defaultQueryTimeout = time.Second * 30

Retries 2:

neutrino/query/interface.go

Lines 18 to 20 in 43f5a58

    
           // defaultNumRetries is the default number of times that a query job 
        
           // will be retried. 
        
           defaultNumRetries = 2

Chinwendu20

Well, thanks a lot for handling this edge case. The fix looks good. I think we should have checked earlier on if batch == nil here:

neutrino/query/workmanager.go

Line 312 in 43f5a58

batch := currentBatches[batchNum]

instead of duplicating that check over and over but maybe that can be included in the rework that you suggested.

ProofOfKeags

I have a couple of questions. Other than that looks good.

query/workmanager.go

ziggie1984 · 2024-04-03T16:20:55Z

Ok I came up with a slightly different approach, we will now use a cancel channel and close it instead of looping through the queryJob array, maybe we should do both (close channel to remove active requests and remove all the one still queued, I think just the first one is enough) ?. Let me know if its even a more hacky way, tho have to think about a new test for this.

ProofOfKeags · 2024-04-03T17:40:40Z

Ok I came up with a slightly different approach, we will now use a cancel channel and close it instead of looping through the queryJob array, maybe we should do both (close channel to remove active requests and remove all the one still queued, I think just the first one is enough) ?. Let me know if its even a more hacky way, tho have to think about a new test for this.

Don't worry about it, this is fine enough for a change this small. I would like to see neutrino undergo some refactoring at some point but I don't think we should turn your 1-commit fix into that project.

ziggie1984 · 2024-04-03T18:17:33Z

Sounds good, I am writing a test for the new approach and then we ship it until we get the new refactor into the codebase

ziggie1984 · 2024-04-03T22:26:41Z

@ProofOfKeags actually the former version had a bug in it, it would crash in some circumstances which was only revealed to me while adding a unit test for this case. We cannot just remove the heap entries for the queries because they might already be registered with the workers. So the prior version would crash.

Good reminder to never ack something without tests... 😅

ProofOfKeags · 2024-04-03T22:31:16Z

Good reminder to never ack something without tests...

Good catch. I'll admit that the way we do testing makes it really hard for me to tell whether the tests are actually good or not. I'm actively trying to improve our library infrastructure. The better we factorize things, the smaller the tests will be and the easier they will be to evaluate too. It requires a lot of discipline though and time though.

Chinwendu20 · 2024-04-04T06:54:54Z

@ProofOfKeags actually the former version had a bug in it, it would crash in some circumstances which was only revealed to me while adding a unit test for this case. We cannot just remove the heap entries for the queries because they might already be registered with the workers. So the prior version would crash.

Interesting, I thought if they had already been registered with a worker, it would have been deleted from the heap, then and so won't be present in the purge you did earlier.

neutrino/query/workmanager.go

Line 244 in 1ef869f

heap.Pop(work)

Good reminder to never ack something without tests... 😅

A learning point for me as well.

This approach works too, just that we would have to wait to schedule the job that has the batch cancelled already before it gets a cancel signal.

ziggie1984 · 2024-04-04T10:33:27Z

Interesting, I thought if they had already been registered with a worker, it would have been deleted from the heap, then and so won't be present in the purge you did earlier.

Yes so the main problem is that as soon as a job is Pop()-ed from the heap it is not immediately removed from the currentQueries map, so when the batch times out and registered queries are still ongoing with the workers I would try to delete queries which are already Pop()-ed from the heap. I think the way its now, is better because it uses channels in a clean way. But as said at the beginning and also underlined by @ProofOfKeags this code part needs a refactoring.

lightninglabs-deploy · 2024-04-18T11:46:12Z

@ellemouton: review reminder
@ziggie1984, remember to re-request review from reviewers when ready

ellemouton

Great find!

I think this is the incorrect use of the cancelChannel api though. A cancel channel given by the caller is a way for the caller to signal to us (the system) that we should cancel early. Ie, we the system should only ever listen on this channel. Else we risk running into a "panic: close of closed channel" error.

I think we should instead have an internal-only way of canceling queries that is separate from the way that a caller cancels queries

query/workmanager.go

query/workmanager_test.go

ProofOfKeags · 2024-04-23T19:21:12Z

I think we should instead have an internal-only way of canceling queries that is separate from the way that a caller cancels queries

Sounds like I should revive my work on the Async API. 😅

ziggie1984 · 2024-04-23T19:21:15Z

Nice input @ellemouton 🙏, introduced a new internalCancelChannel for the query to distinguish both cases.

ellemouton

Thanks for the quick update! Looks good! One little thing left over from previous version I think though! After that, LGTM!

query/interface.go

In case the backend is very unstable and times out the batch we need to make sure ongoing queryJobs are droped and already registered queryJobs are removed from the heap as well.

ellemouton

LGTM 🚀

ellemouton · 2024-04-24T17:02:23Z

@ziggie1984 - before we merge this, can you open an LND PR that points to this so we can make sure the CI passes?

ziggie1984 · 2024-04-25T09:17:42Z

@guggero I think we can merge this now, itests passed on the LND PR (lightningnetwork/lnd#8621)

ziggie1984 marked this pull request as ready for review March 27, 2024 15:59

saubyk requested a review from Chinwendu20 March 28, 2024 18:39

saubyk assigned ziggie1984 Mar 28, 2024

ziggie1984 mentioned this pull request Apr 1, 2024

fix pruned node query lightningnetwork/lnd#8596

Closed

ProofOfKeags self-requested a review April 1, 2024 20:55

Chinwendu20 reviewed Apr 2, 2024

View reviewed changes

saubyk requested a review from ellemouton April 2, 2024 16:46

ProofOfKeags reviewed Apr 2, 2024

View reviewed changes

query/workmanager.go Show resolved Hide resolved

query/workmanager.go Outdated Show resolved Hide resolved

query/workmanager.go Outdated Show resolved Hide resolved

ziggie1984 force-pushed the sync-pruned-node branch from f2b8212 to 535aa62 Compare April 3, 2024 16:13

ziggie1984 force-pushed the sync-pruned-node branch from 535aa62 to 1ef869f Compare April 3, 2024 16:23

ziggie1984 force-pushed the sync-pruned-node branch 3 times, most recently from 05d2140 to 2e5e321 Compare April 3, 2024 23:08

This was referenced Apr 4, 2024

fix sync neutrino nodes lightningnetwork/lnd#8621

Merged

fix pruned node query btcsuite/btcwallet#921

Open

ProofOfKeags approved these changes Apr 22, 2024

View reviewed changes

ellemouton requested changes Apr 23, 2024

View reviewed changes

query/workmanager.go Outdated Show resolved Hide resolved

query/workmanager.go Outdated Show resolved Hide resolved

query/workmanager.go Outdated Show resolved Hide resolved

query/workmanager_test.go Outdated Show resolved Hide resolved

ziggie1984 force-pushed the sync-pruned-node branch 2 times, most recently from 9cc29eb to 853b4d1 Compare April 23, 2024 19:20

ziggie1984 requested a review from ellemouton April 23, 2024 19:20

ellemouton reviewed Apr 23, 2024

View reviewed changes

query/interface.go Outdated Show resolved Hide resolved

ziggie1984 added 2 commits April 24, 2024 01:59

query: fix retry query case.

01e720b

In case the backend is very unstable and times out the batch we need to make sure ongoing queryJobs are droped and already registered queryJobs are removed from the heap as well.

query: Add unit test for the batch time-out.

a2d891c

ziggie1984 force-pushed the sync-pruned-node branch from 853b4d1 to a2d891c Compare April 24, 2024 00:59

ziggie1984 requested a review from ellemouton April 24, 2024 08:59

ellemouton approved these changes Apr 24, 2024

View reviewed changes

guggero added this pull request to the merge queue Apr 25, 2024

Merged via the queue into lightninglabs:master with commit 602843d Apr 25, 2024
4 checks passed

buck54321 mentioned this pull request May 8, 2024

catch spv up with lightninglabs/neutrino dcrlabs/ltcwallet#6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query: fix retry query case. #297

query: fix retry query case. #297

ziggie1984 commented Mar 27, 2024 •

edited

Loading

ziggie1984 commented Mar 27, 2024

Chinwendu20 left a comment

ProofOfKeags left a comment

ziggie1984 commented Apr 3, 2024 •

edited

Loading

ProofOfKeags commented Apr 3, 2024

ziggie1984 commented Apr 3, 2024

ziggie1984 commented Apr 3, 2024 •

edited

Loading

ProofOfKeags commented Apr 3, 2024

Chinwendu20 commented Apr 4, 2024

ziggie1984 commented Apr 4, 2024 •

edited

Loading

lightninglabs-deploy commented Apr 18, 2024

ellemouton left a comment •

edited

Loading

ProofOfKeags commented Apr 23, 2024

ziggie1984 commented Apr 23, 2024

ellemouton left a comment

ellemouton left a comment

ellemouton commented Apr 24, 2024

ziggie1984 commented Apr 25, 2024

	if batch != nil {
	select {
	case <-batch.timeout:
	batch.errChan <- ErrQueryTimeout
	delete(currentBatches, batchNum)

	log.Warnf("Query(%d) failed with "+
	"error: %v. Timing out.",
	result.job.index, result.err)

	log.Debugf("Batch %v timed out",
	batchNum)

	default:

	heap.Push(work, result.job)
	currentQueries[result.job.index] = batchNum

query: fix retry query case. #297

query: fix retry query case. #297

Conversation

ziggie1984 commented Mar 27, 2024 • edited Loading

ziggie1984 commented Mar 27, 2024

Chinwendu20 left a comment

Choose a reason for hiding this comment

ProofOfKeags left a comment

Choose a reason for hiding this comment

ziggie1984 commented Apr 3, 2024 • edited Loading

ProofOfKeags commented Apr 3, 2024

ziggie1984 commented Apr 3, 2024

ziggie1984 commented Apr 3, 2024 • edited Loading

ProofOfKeags commented Apr 3, 2024

Chinwendu20 commented Apr 4, 2024

ziggie1984 commented Apr 4, 2024 • edited Loading

lightninglabs-deploy commented Apr 18, 2024

ellemouton left a comment • edited Loading

Choose a reason for hiding this comment

ProofOfKeags commented Apr 23, 2024

ziggie1984 commented Apr 23, 2024

ellemouton left a comment

Choose a reason for hiding this comment

ellemouton left a comment

Choose a reason for hiding this comment

ellemouton commented Apr 24, 2024

ziggie1984 commented Apr 25, 2024

ziggie1984 commented Mar 27, 2024 •

edited

Loading

ziggie1984 commented Apr 3, 2024 •

edited

Loading

ziggie1984 commented Apr 3, 2024 •

edited

Loading

ziggie1984 commented Apr 4, 2024 •

edited

Loading

ellemouton left a comment •

edited

Loading