Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BdbFrontier thread safety #212

Closed
anjackson opened this issue Sep 13, 2018 · 9 comments · Fixed by #213
Closed

BdbFrontier thread safety #212

anjackson opened this issue Sep 13, 2018 · 9 comments · Fixed by #213

Comments

@anjackson
Copy link
Collaborator

We're attempting to use Heritrix3 with an external module that populates the BdbFrontier via Kafka, and we're hitting problems interacting with the frontier safely. There's some more details in ukwa/ukwa-heritrix#16, but to summarise, ToeThreads are dying because keepItem is null when it should not be.

I believe this is because peekItem is marked as transient. Occasionally, between setting peekItem (this statement) and using it (this one), the WorkQueue gets updated by a separate thread in a way that forces it to get written out to disk and then read back in again. As peekItem is transient, flushing it out to the disk and back drops the value and we're left with a null.

NetArchive Suite have also seen this issue when using a RabbitMQ-based URL receiver, and patched it by ignoring the null.

The simplest way to avoid this would be to remove the transient modified from peekItem but that makes me worry because someone deliberately chose to make it transient and I don't understand why.

Secondly, I don't understand why we are seeing this, when IA also use similar methods and are (presumably?) not seeing this. Moreover, this model appears not to be fundamentally different to the traditional ActionDirectory, so I don't understand why this wasn't seen a long time ago.

Finally, this issue also made it clear that I don't actually understand how best to interact with the BdbFrontier in a thread-safe manner. If I am right in assuming that every modification to a WorkQueue needs to be followed by a .makeDirty() that serialised the queue out to disk and reads it back in again, then surely every modification needs to edit-then-write within a synchronized(WorkQueue) block? But it's pretty easy to find examples where this appears to be deliberately not the case:

synchronized(wq) {
int originalPrecedence = wq.getPrecedence();
wq.enqueue(this, curi);
// always take budgeting values from current curi
// (whose overlay settings should be active here)
wq.setSessionBudget(getBalanceReplenishAmount());
wq.setTotalBudget(getQueueTotalBudget());
if(!wq.isRetired()) {
incrementQueuedUriCount();
int currentPrecedence = wq.getPrecedence();
if(!wq.isManaged() || currentPrecedence < originalPrecedence) {
// queue newly filled or bumped up in precedence; ensure enqueuing
// at precedence level (perhaps duplicate; if so that's handled elsewhere)
deactivateQueue(wq);
}
}
}
// Update recovery log.
doJournalAdded(curi);
wq.makeDirty();

I'd appreciate any information anyone has on how best to inject URLs into Heritrix3, and on whether or not I've understood how the BdbFrontier works.

@anjackson
Copy link
Collaborator Author

@anjackson
Copy link
Collaborator Author

anjackson commented Sep 13, 2018

To attempt to explore this issue, I am attempting to run some tests with a version of Heritrix where all modifications to WorkQueue instances are wrapped in synchronized blocks.

UPDATE: See these changes

@anjackson
Copy link
Collaborator Author

Took a while to get the build sorted out, but now running with the synchronized(wq) version. Only been running for about short time but no ToeThreads have died yet, and this was happening every ~15mins before.

@anjackson
Copy link
Collaborator Author

So, this modified version looks good. No noticeable slowdown/contention and no concurrency errors or NPEs due to missing peekItem instances.

@anjackson
Copy link
Collaborator Author

I've written up my understanding of how BdbFrontier actually works here: https://github.com/internetarchive/heritrix3/wiki/Heritrix-BdbFrontier#implementation-details

Feedback welcome!

@anjackson
Copy link
Collaborator Author

Related issue here: https://webarchive.jira.com/browse/HER-507

@anjackson
Copy link
Collaborator Author

anjackson commented Sep 17, 2018

It's not 100% clear why we see this so often while others have not, but I suspect the reasons are:

  • Our domain crawl was running shallow, meaning queues were rapidly flipping between active and empty.
  • Our new continuous crawl works by scheduling URLs and resetting crawled-data tallies from lots of threads.

Both of which mean more calls to wq.makeDirty are happening closer together.

@anjackson
Copy link
Collaborator Author

BTW, following this conversation with @chronodm on Twitter, I think that while I agree that ideally the code should use explicit locks rather than the current mix of instance synchronized blocks and synchronized class methods, I'm not confident to leap into that kind of re-write at the moment.

@anjackson
Copy link
Collaborator Author

With thanks to all that helped out, it seems the main reason IA have not seen this is likely because they usually use a different Frontier implementation: PullingBdbFrontier. This has different threaded behaviour and as such is unlikely to see the kind of problems we've seen.

In case anyone is interested, IA's production H3 fork appears to be here kngenie/heritrix3/tree/hq.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant