-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NUTCH-2959 -- upgrade Tika to 2.9.0 #776
Conversation
I'll merge this in a day or so unless anyone has objections. |
Converting to draft to manually check for version conflicts. |
I bumped some of the more common dependencies to match Tika 2.9.0. Let me know what you think. |
@tballison, any idea how to circumvent this error? Hadoop 3.4.0 will upgrade to commons-io 2.11.0 (HADOOP-18301). |
K. Thank you! Um... wrap() was introduced in 2.9.0 according to the commons-io docs so 2.11.0 should work. Can we exclude commons-io from hadoop and then add it as a dependency in the main ivy.xml? I'll make that change on this PR and then see if I can get pseudo-distributed mode working? |
I haven't worked with ant in a while. According to I wanted to document this in case |
I'm guessing that commit won't work if distributed hadoop is bringing its own jars (as you said!). Does hadoop do any custom classloading so that the job jars are isolated from the runtime jars? Or, more simply, notionally, should the above commit work? Or is that a non-starter? |
I'm getting a ConnectException when I try to run the tika parser tests via nutch-test-single-node-cluster. On hadoop startup, I see:
I can navigate to http://localhost:8088/cluster/nodes and I see that there are zero nodes. I cannot navigate to http://localhost:9870/. When I try to run the test_tika_parser.sh, I see:
This is Ubuntu with java 11. Any ideas what I'm doing wrong? |
When running in distributed or pseudo-distributed mode, commons-io 2.8.0 is first in the classpath, independent from which commons-io version is contained in the Nutch job jar. Using Hadoop methods to write data or for communication may rely on that specific commons-io version and changing the Hadoop classpath is challenging, since even more may break. Btw., I've just rediscovered that using Tika in (pseudo)distributed mode is broken since the upgrade to Tika 2.3.0, see NUTCH-2937. Although, it didn't seem to have affected the MIME detector.
Need to set up passphraseless ssh. |
Oh, dear...
Thank you! |
Would need to wait for Hadoop 3.4.0, then we get commons-io 2.11.0 into the classpath. See also the discussion in apache/hadoop#4455. |
Should we merge on the theory that we're already broken and we aren't breaking things any worse? Or do we just punt until 3.4.0 is available? |
Let me check whether the MIME detection works with Tika 2.3.0. I'll run the same test script with master which includes Tika 2.3.0... |
…come out with Hadoop 3.4.0.
In the latest commit, I downgraded commons-io to be what we're hoping for in Hadoop 2.4.0 |
I checked https://issues.apache.org/jira/browse/TIKA-3655, and it looks like we'd have to downgrade to Tika 2.2.1 (December 2021, released during the log4shell fun) to get a working distributed version. |
Ok, here the results:
This seems the better. Even then (and if successfully tested with 3.4.0): warn users that they need Hadoop 3.4.0 to run Nutch and esp. parse-tika. |
😭 Y, let's hold off until Hadoop 3.4.0 is released. Tika 3.x may be out by then. 🤣 ! Thank you, again! |
I suggest that we downgrade to Tika 2.2.1 to fix that regression. |
Converting this to draft until Hadoop 3.4.0 is released. |
Good point, @lewismc. I've opened NUTCH-3006 for that. |
Try the full path
Also make sure the directory exists on HDFS
…On Thu, Sep 28, 2023 at 14:03 Tim Allison ***@***.***> wrote:
Paging the nutch-test-single-node-cluster helpdesk.... what do I use for
the tika seeds file? Are you using our github repo, or the
tika-parsers-common package specifically
<https://github.com/apache/tika/tree/main/tika-parsers/tika-parsers-standard>?
Or something else? Sorry!
+ hadoop fs -copyFromLocal seeds_tika.txt crawl/seeds_tika
copyFromLocal: `seeds_tika.txt': No such file or directory
—
Reply to this email directly, view it on GitHub
<#776 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI4TF5FHFFNK6PZRRL3FALX4XQYXANCNFSM6AAAAAA4YQQNCM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
see the comments in test_tika_parser.sh |
With the update to Tika 2.9.1-SNAPSHOT, I get 85 failed parses, most of them are either encrypted documents or "can't retrieve Tika Parser for x" There is still one NoSuchMethodError, also from commons-io, but this time UnsynchronizedByteArrayInputStream$Builder: at org.apache.tika.parser.pdf.PDFEncodedStringDecoder.decode(PDFEncodedStringDecoder.java:85) |
Sorry! Yep, saw that too late. |
I reverted back to 2.2.1, and that's not far enough back -- there were 222 parse failures many with the wrap problem. I reverted back to 2.0.0, and then had 85 parse failures again. This time, no wrap problems, just the "can't retrieve parser" and encrypted doc exceptions. |
There is just no winning... We just upgraded POI to 5.2.4, and it uses a bunch of the newer commons-io methods. If we downgrade POI to 5.2.3, we get a clean build of Tika with commons.io=2.8.0 Can we do anything with shading? I don't think this will work... So, it looks like we downgrade Tika to 2.0.0 in nutch, or we make a release of Tika 2.9.1 with POI 5.2.3 and then give up until hadoop upgrades |
Even worse, POI 5.2.4 uses features of commons-io > 2.11.0. This means that POI won't work in hadoop 3.4.0. |
Stepping away from the keyboard. 😭 |
Alright, the only thing that I think might work is Tika shading commons-io in tika-app, and then Nutch uses tika-app instead of the individual parser-modules etc. for parser-tika. WDYT? |
And what about the dependency on tika-core in Nutch core, which is used for MIME detection? |
That should be ok if we remove CloseShieldInputStream.wrap() from tika-core (and may as well modify all of Tika). |
Local build of the shim works with 2.9.1-SNAPSHOT, which includes the latest version of POI which will conflict with commons-io in Hadoop 3.4.0 if we don't use this shim (or if Hadoop doesn't upgrade commons-io further). ParserStatus I'll downgrade Tika to 2.9.0 and push a shim release to maven central. |
I published the shim artifacts for: https://github.com/tballison/hadoop-safe-tika It looks like they haven't made it into the main maven repositories yet. :( Once they do, I think this PR and the hadoop-safe-tika shim will be ready for review. This represents an ugly work-around, but I did get a clean build locally, and I ran @sebastian-nagel 's One area for improvement might be to exclude the dependencies from the shim that Nutch already pulls in, like guava, for example. The release process for the shim is quite easy. Happy to do that if necessary. |
Hi @tballison, I've tested the shim artifact in local and pseudo-distributed mode: everything looks good.
+1 Note: I got marginally different results:
... but that could be because I have an outdated set of test documents. I'm using the default http.content.limit = 1 MiB. |
Thank you so much @sebastian-nagel ! |
Thanks for your contribution to Apache Nutch! Your help is appreciated!
Before opening the pull request, please verify that
NUTCH-XXXX
)[NUTCH-XXXX] Issue or pull request title
)ant clean runtime test
LICENSE-binary
andNOTICE-binary
updated accordingly?We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Nutch in general, please sign up for the Nutch mailing list. Thanks!