BdbFrontier thread safety #212

anjackson · 2018-09-13T10:20:57Z

We're attempting to use Heritrix3 with an external module that populates the BdbFrontier via Kafka, and we're hitting problems interacting with the frontier safely. There's some more details in ukwa/ukwa-heritrix#16, but to summarise, ToeThreads are dying because keepItem is null when it should not be.

I believe this is because peekItem is marked as transient. Occasionally, between setting peekItem (this statement) and using it (this one), the WorkQueue gets updated by a separate thread in a way that forces it to get written out to disk and then read back in again. As peekItem is transient, flushing it out to the disk and back drops the value and we're left with a null.

NetArchive Suite have also seen this issue when using a RabbitMQ-based URL receiver, and patched it by ignoring the null.

The simplest way to avoid this would be to remove the transient modified from peekItem but that makes me worry because someone deliberately chose to make it transient and I don't understand why.

Secondly, I don't understand why we are seeing this, when IA also use similar methods and are (presumably?) not seeing this. Moreover, this model appears not to be fundamentally different to the traditional ActionDirectory, so I don't understand why this wasn't seen a long time ago.

Finally, this issue also made it clear that I don't actually understand how best to interact with the BdbFrontier in a thread-safe manner. If I am right in assuming that every modification to a WorkQueue needs to be followed by a .makeDirty() that serialised the queue out to disk and reads it back in again, then surely every modification needs to edit-then-write within a synchronized(WorkQueue) block? But it's pretty easy to find examples where this appears to be deliberately not the case:

heritrix3/engine/src/main/java/org/archive/crawler/frontier/WorkQueueFrontier.java

Lines 390 to 410 in 0581170

    
           synchronized(wq) { 
        
               int originalPrecedence = wq.getPrecedence(); 
        
               wq.enqueue(this, curi); 
        
               // always take budgeting values from current curi 
        
               // (whose overlay settings should be active here) 
        
               wq.setSessionBudget(getBalanceReplenishAmount()); 
        
               wq.setTotalBudget(getQueueTotalBudget()); 
        
               if(!wq.isRetired()) { 
        
                   incrementQueuedUriCount(); 
        
                   int currentPrecedence = wq.getPrecedence(); 
        
                   if(!wq.isManaged() || currentPrecedence < originalPrecedence) { 
        
                       // queue newly filled or bumped up in precedence; ensure enqueuing 
        
                       // at precedence level (perhaps duplicate; if so that's handled elsewhere) 
        
                       deactivateQueue(wq); 
        
                   } 
        
               } 
        
           } 
        
           // Update recovery log. 
        
           doJournalAdded(curi); 
        
           wq.makeDirty();

I'd appreciate any information anyone has on how best to inject URLs into Heritrix3, and on whether or not I've understood how the BdbFrontier works.

The text was updated successfully, but these errors were encountered:

anjackson · 2018-09-13T10:27:32Z

Ah, via git blame: make peekItem transient -- should not be held through checkpoint

anjackson · 2018-09-13T11:10:53Z

To attempt to explore this issue, I am attempting to run some tests with a version of Heritrix where all modifications to WorkQueue instances are wrapped in synchronized blocks.

UPDATE: See these changes

anjackson · 2018-09-13T15:03:34Z

Took a while to get the build sorted out, but now running with the synchronized(wq) version. Only been running for about short time but no ToeThreads have died yet, and this was happening every ~15mins before.

anjackson · 2018-09-14T09:35:44Z

So, this modified version looks good. No noticeable slowdown/contention and no concurrency errors or NPEs due to missing peekItem instances.

anjackson · 2018-09-14T12:38:52Z

I've written up my understanding of how BdbFrontier actually works here: https://github.com/internetarchive/heritrix3/wiki/Heritrix-BdbFrontier#implementation-details

Feedback welcome!

anjackson · 2018-09-17T09:13:38Z

Related issue here: https://webarchive.jira.com/browse/HER-507

anjackson · 2018-09-17T09:45:54Z

It's not 100% clear why we see this so often while others have not, but I suspect the reasons are:

Our domain crawl was running shallow, meaning queues were rapidly flipping between active and empty.
Our new continuous crawl works by scheduling URLs and resetting crawled-data tallies from lots of threads.

Both of which mean more calls to wq.makeDirty are happening closer together.

anjackson · 2018-09-17T13:52:59Z

BTW, following this conversation with @chronodm on Twitter, I think that while I agree that ideally the code should use explicit locks rather than the current mix of instance synchronized blocks and synchronized class methods, I'm not confident to leap into that kind of re-write at the moment.

anjackson · 2018-09-24T10:06:10Z

With thanks to all that helped out, it seems the main reason IA have not seen this is likely because they usually use a different Frontier implementation: PullingBdbFrontier. This has different threaded behaviour and as such is unlikely to see the kind of problems we've seen.

In case anyone is interested, IA's production H3 fork appears to be here kngenie/heritrix3/tree/hq.

anjackson added the open problem label Sep 13, 2018

anjackson mentioned this issue Sep 14, 2018

Ensure frontier work queues are updated safely across threads. #213

Merged

anjackson closed this as completed in #213 Oct 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BdbFrontier thread safety #212

BdbFrontier thread safety #212

anjackson commented Sep 13, 2018

anjackson commented Sep 13, 2018

anjackson commented Sep 13, 2018 •

edited

Loading

anjackson commented Sep 13, 2018

anjackson commented Sep 14, 2018

anjackson commented Sep 14, 2018

anjackson commented Sep 17, 2018

anjackson commented Sep 17, 2018 •

edited

Loading

anjackson commented Sep 17, 2018

anjackson commented Sep 24, 2018

BdbFrontier thread safety #212

BdbFrontier thread safety #212

Comments

anjackson commented Sep 13, 2018

anjackson commented Sep 13, 2018

anjackson commented Sep 13, 2018 • edited Loading

anjackson commented Sep 13, 2018

anjackson commented Sep 14, 2018

anjackson commented Sep 14, 2018

anjackson commented Sep 17, 2018

anjackson commented Sep 17, 2018 • edited Loading

anjackson commented Sep 17, 2018

anjackson commented Sep 24, 2018

anjackson commented Sep 13, 2018 •

edited

Loading

anjackson commented Sep 17, 2018 •

edited

Loading