-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BdbFrontier thread safety #212
Comments
Ah, via |
To attempt to explore this issue, I am attempting to run some tests with a version of Heritrix where all modifications to UPDATE: See these changes |
Took a while to get the build sorted out, but now running with the |
So, this modified version looks good. No noticeable slowdown/contention and no concurrency errors or NPEs due to missing |
I've written up my understanding of how BdbFrontier actually works here: https://github.com/internetarchive/heritrix3/wiki/Heritrix-BdbFrontier#implementation-details Feedback welcome! |
Related issue here: https://webarchive.jira.com/browse/HER-507 |
It's not 100% clear why we see this so often while others have not, but I suspect the reasons are:
Both of which mean more calls to wq.makeDirty are happening closer together. |
BTW, following this conversation with @chronodm on Twitter, I think that while I agree that ideally the code should use explicit locks rather than the current mix of instance synchronized blocks and synchronized class methods, I'm not confident to leap into that kind of re-write at the moment. |
With thanks to all that helped out, it seems the main reason IA have not seen this is likely because they usually use a different Frontier implementation: PullingBdbFrontier. This has different threaded behaviour and as such is unlikely to see the kind of problems we've seen. In case anyone is interested, IA's production H3 fork appears to be here kngenie/heritrix3/tree/hq. |
We're attempting to use Heritrix3 with an external module that populates the BdbFrontier via Kafka, and we're hitting problems interacting with the frontier safely. There's some more details in ukwa/ukwa-heritrix#16, but to summarise, ToeThreads are dying because
keepItem
isnull
when it should not be.I believe this is because
peekItem
is marked astransient
. Occasionally, between settingpeekItem
(this statement) and using it (this one), theWorkQueue
gets updated by a separate thread in a way that forces it to get written out to disk and then read back in again. AspeekItem
istransient
, flushing it out to the disk and back drops the value and we're left with anull
.NetArchive Suite have also seen this issue when using a RabbitMQ-based URL receiver, and patched it by ignoring the
null
.The simplest way to avoid this would be to remove the
transient
modified frompeekItem
but that makes me worry because someone deliberately chose to make ittransient
and I don't understand why.Secondly, I don't understand why we are seeing this, when IA also use similar methods and are (presumably?) not seeing this. Moreover, this model appears not to be fundamentally different to the traditional
ActionDirectory
, so I don't understand why this wasn't seen a long time ago.Finally, this issue also made it clear that I don't actually understand how best to interact with the BdbFrontier in a thread-safe manner. If I am right in assuming that every modification to a
WorkQueue
needs to be followed by a.makeDirty()
that serialised the queue out to disk and reads it back in again, then surely every modification needs to edit-then-write within asynchronized(WorkQueue)
block? But it's pretty easy to find examples where this appears to be deliberately not the case:heritrix3/engine/src/main/java/org/archive/crawler/frontier/WorkQueueFrontier.java
Lines 390 to 410 in 0581170
I'd appreciate any information anyone has on how best to inject URLs into Heritrix3, and on whether or not I've understood how the
BdbFrontier
works.The text was updated successfully, but these errors were encountered: