Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUTCH-2959 -- upgrade Tika to 2.9.0 #776

Merged
merged 8 commits into from
Oct 20, 2023
Merged

Conversation

tballison
Copy link
Contributor

Thanks for your contribution to Apache Nutch! Your help is appreciated!

Before opening the pull request, please verify that

  • there is an open issue on the Nutch issue tracker which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes.
  • the issue ID (NUTCH-XXXX)
    • is referenced in the title of the pull request
    • and placed in front of your commit messages surrounded by square brackets ([NUTCH-XXXX] Issue or pull request title)
  • commits are squashed into a single one (or few commits for larger changes)
  • Java source code follows Nutch Eclipse Code Formatting rules
  • Nutch is successfully built and unit tests pass by running ant clean runtime test
  • there should be no conflicts when merging the pull request branch into the recent master branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled master branch.
  • if new dependencies are added,
    • are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
    • are LICENSE-binary and NOTICE-binary updated accordingly?

We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Nutch in general, please sign up for the Nutch mailing list. Thanks!

@tballison
Copy link
Contributor Author

I'll merge this in a day or so unless anyone has objections.

@tballison tballison marked this pull request as draft September 15, 2023 11:55
@tballison
Copy link
Contributor Author

Converting to draft to manually check for version conflicts.

@tballison tballison marked this pull request as ready for review September 18, 2023 19:09
@tballison
Copy link
Contributor Author

I bumped some of the more common dependencies to match Tika 2.9.0. Let me know what you think.

@sebastian-nagel
Copy link
Contributor

  • afaics, no issues when running in local mode
  • in pseudo-distributed mode there is an issue with the commons-io versions: 2.8.0 provided by Hadoop (3.3.6) and 2.13.0 requested by Tika (and crawler-commons)
    • I've fetched and parsed the Tika standard parsers test documents (see test_tika_parser.sh): the following error appears 44 times in the fetcher (during MIME detection) and 467 during parsing:
      java.lang.NoSuchMethodError: 'org.apache.commons.io.input.CloseShieldInputStream org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
      

@tballison, any idea how to circumvent this error? Hadoop 3.4.0 will upgrade to commons-io 2.11.0 (HADOOP-18301).

@tballison
Copy link
Contributor Author

K. Thank you!

Um... wrap() was introduced in 2.9.0 according to the commons-io docs so 2.11.0 should work.

Can we exclude commons-io from hadoop and then add it as a dependency in the main ivy.xml?

I'll make that change on this PR and then see if I can get pseudo-distributed mode working?

@tballison
Copy link
Contributor Author

tballison commented Sep 19, 2023

I haven't worked with ant in a while. According to ant dependencytree, it looks like we don't have to exclude commons-io everywhere -- placing it in the main ivy.xml has the same effect as putting it in <dependencyManagement/> in maven. It applies throughout.

I wanted to document this in case dependencytree isn't showing what's actually happening. :D

@tballison
Copy link
Contributor Author

I'm guessing that commit won't work if distributed hadoop is bringing its own jars (as you said!). Does hadoop do any custom classloading so that the job jars are isolated from the runtime jars?

Or, more simply, notionally, should the above commit work? Or is that a non-starter?

@tballison
Copy link
Contributor Author

tballison commented Sep 19, 2023

I'm getting a ConnectException when I try to run the tika parser tests via nutch-test-single-node-cluster.

On hadoop startup, I see:

2023-09-19 10:25:15,186 INFO util.GSet: VM type       = 64-bit
2023-09-19 10:25:15,187 INFO util.GSet: 0.029999999329447746% max memory 7.7 GB = 2.4 MB
2023-09-19 10:25:15,187 INFO util.GSet: capacity      = 2^18 = 262144 entries
Re-format filesystem in Storage Directory root= /tmp/hadoop-tallison/dfs/name; location= null ? (Y or N) y
2023-09-19 10:25:17,593 INFO namenode.FSImage: Allocated new BlockPoolId: BP-993805694-127.0.1.1-1695133517585
2023-09-19 10:25:17,594 INFO common.Storage: Will remove files: [/tmp/hadoop-tallison/dfs/name/current/fsimage_0000000000000000000.md5, /tmp/hadoop-tallison/dfs/name/current/VERSION, /tmp/hadoop-tallison/dfs/name/current/fsimage_0000000000000000000, /tmp/hadoop-tallison/dfs/name/current/seen_txid]
2023-09-19 10:25:17,625 INFO common.Storage: Storage directory /tmp/hadoop-tallison/dfs/name has been successfully formatted.
2023-09-19 10:25:17,648 INFO namenode.FSImageFormatProtobuf: Saving image file /tmp/hadoop-tallison/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2023-09-19 10:25:17,705 INFO namenode.FSImageFormatProtobuf: Image file /tmp/hadoop-tallison/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 403 bytes saved in 0 seconds .
2023-09-19 10:25:17,718 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2023-09-19 10:25:17,744 INFO namenode.FSNamesystem: Stopping services started for active state
2023-09-19 10:25:17,744 INFO namenode.FSNamesystem: Stopping services started for standby state
2023-09-19 10:25:17,747 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
2023-09-19 10:25:17,747 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at tallison-XPS-9320/127.0.1.1
************************************************************/
+ set -e
+ /opt/hadoop/3.3.6/sbin/start-dfs.sh
Starting namenodes on [localhost]
localhost: ssh: connect to host localhost port 22: Connection refused
Starting datanodes
localhost: ssh: connect to host localhost port 22: Connection refused
Starting secondary namenodes [tallison-XPS-9320]
tallison-XPS-9320: ssh: connect to host tallison-xps-9320 port 22: Connection refused
+ /opt/hadoop/3.3.6/sbin/start-yarn.sh
Starting resourcemanager
Starting nodemanagers
localhost: ssh: connect to host localhost port 22: Connection refused

I can navigate to http://localhost:8088/cluster/nodes and I see that there are zero nodes.

I cannot navigate to http://localhost:9870/.

When I try to run the test_tika_parser.sh, I see:

+ hadoop fs -mkdir -p crawl/seeds_tika/
mkdir: Call From tallison-XPS-9320/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

This is Ubuntu with java 11.

Any ideas what I'm doing wrong?

@sebastian-nagel
Copy link
Contributor

Can we exclude commons-io from hadoop and then add it as a dependency in the main ivy.xml?

When running in distributed or pseudo-distributed mode, commons-io 2.8.0 is first in the classpath, independent from which commons-io version is contained in the Nutch job jar. Using Hadoop methods to write data or for communication may rely on that specific commons-io version and changing the Hadoop classpath is challenging, since even more may break.

Btw., I've just rediscovered that using Tika in (pseudo)distributed mode is broken since the upgrade to Tika 2.3.0, see NUTCH-2937. Although, it didn't seem to have affected the MIME detector.

localhost: ssh: connect to host localhost port 22: Connection refused
Any ideas what I'm doing wrong?

Need to set up passphraseless ssh.

@tballison
Copy link
Contributor Author

tballison commented Sep 19, 2023

Btw., I've just rediscovered that using Tika in (pseudo)distributed mode is broken since the upgrade to Tika 2.3.0, see NUTCH-2937. Although, it didn't seem to have affected the MIME detector.

Oh, dear...

Need to set up passphraseless ssh.

Thank you!

@sebastian-nagel
Copy link
Contributor

Oh, dear...

Would need to wait for Hadoop 3.4.0, then we get commons-io 2.11.0 into the classpath. See also the discussion in apache/hadoop#4455.

@tballison
Copy link
Contributor Author

Should we merge on the theory that we're already broken and we aren't breaking things any worse? Or do we just punt until 3.4.0 is available?

@sebastian-nagel
Copy link
Contributor

Should we merge on the theory that we're already broken and we aren't breaking things any worse? Or do we just punt until 3.4.0 is available?

Let me check whether the MIME detection works with Tika 2.3.0. I'll run the same test script with master which includes Tika 2.3.0...

@tballison
Copy link
Contributor Author

In the latest commit, I downgraded commons-io to be what we're hoping for in Hadoop 2.4.0

@tballison
Copy link
Contributor Author

I checked https://issues.apache.org/jira/browse/TIKA-3655, and it looks like we'd have to downgrade to Tika 2.2.1 (December 2021, released during the log4shell fun) to get a working distributed version.

@sebastian-nagel
Copy link
Contributor

Ok, here the results:

  • Tika 2.3.0 (current Nutch master)
    • fetching: not a single NoSuchMethodError
    • parsing: only one NoSuchMethodError because of CloseShieldInputStream.wrap(...) (when parsing testTMX.tmx)
    ParserStatus
        failed=85
        success=624
    
  • Tika 2.9.0
    • fetching: 44 of NoSuchMethodError
    • parsing: 467 of NoSuchMethodError
    ParserStatus
       failed=526
       success=139
    

Or do we just punt until 3.4.0 is available?

This seems the better. Even then (and if successfully tested with 3.4.0): warn users that they need Hadoop 3.4.0 to run Nutch and esp. parse-tika.

@tballison
Copy link
Contributor Author

tballison commented Sep 19, 2023

😭 Y, let's hold off until Hadoop 3.4.0 is released. Tika 3.x may be out by then. 🤣 !

Thank you, again!

@lewismc
Copy link
Member

lewismc commented Sep 21, 2023

I suggest that we downgrade to Tika 2.2.1 to fix that regression.

@tballison tballison marked this pull request as draft September 26, 2023 10:27
@tballison
Copy link
Contributor Author

Converting this to draft until Hadoop 3.4.0 is released.

@sebastian-nagel
Copy link
Contributor

I suggest that we downgrade to Tika 2.2.1 to fix that regression.

Good point, @lewismc. I've opened NUTCH-3006 for that.

@lewismc
Copy link
Member

lewismc commented Sep 28, 2023 via email

@sebastian-nagel
Copy link
Contributor

what do I use for the tika seeds file? Are you using our github repo, or the
tika-parsers-common package specifically

see the comments in test_tika_parser.sh

@tballison
Copy link
Contributor Author

With the update to Tika 2.9.1-SNAPSHOT, I get 85 failed parses, most of them are either encrypted documents or "can't retrieve Tika Parser for x"
parse-segment-error-parsing.txt

There is still one NoSuchMethodError, also from commons-io, but this time UnsynchronizedByteArrayInputStream$Builder: at org.apache.tika.parser.pdf.PDFEncodedStringDecoder.decode(PDFEncodedStringDecoder.java:85)

@tballison
Copy link
Contributor Author

see the comments in test_tika_parser.sh

Sorry! Yep, saw that too late.

@tballison
Copy link
Contributor Author

I reverted back to 2.2.1, and that's not far enough back -- there were 222 parse failures many with the wrap problem. I reverted back to 2.0.0, and then had 85 parse failures again. This time, no wrap problems, just the "can't retrieve parser" and encrypted doc exceptions.

@tballison
Copy link
Contributor Author

There is just no winning...

We just upgraded POI to 5.2.4, and it uses a bunch of the newer commons-io methods. If we downgrade POI to 5.2.3, we get a clean build of Tika with commons.io=2.8.0

Can we do anything with shading? I don't think this will work...

So, it looks like we downgrade Tika to 2.0.0 in nutch, or we make a release of Tika 2.9.1 with POI 5.2.3 and then give up until hadoop upgrades

@tballison
Copy link
Contributor Author

Even worse, POI 5.2.4 uses features of commons-io > 2.11.0. This means that POI won't work in hadoop 3.4.0.

@tballison
Copy link
Contributor Author

Stepping away from the keyboard. 😭

@tballison
Copy link
Contributor Author

Alright, the only thing that I think might work is Tika shading commons-io in tika-app, and then Nutch uses tika-app instead of the individual parser-modules etc. for parser-tika.

WDYT?

@sebastian-nagel
Copy link
Contributor

Tika shading commons-io in tika-app, and then Nutch uses tika-app instead of the individual parser-modules etc. for parser-tika.

And what about the dependency on tika-core in Nutch core, which is used for MIME detection?

@tballison
Copy link
Contributor Author

That should be ok if we remove CloseShieldInputStream.wrap() from tika-core (and may as well modify all of Tika).

@tballison
Copy link
Contributor Author

tballison commented Oct 3, 2023

Local build of the shim works with 2.9.1-SNAPSHOT, which includes the latest version of POI which will conflict with commons-io in Hadoop 3.4.0 if we don't use this shim (or if Hadoop doesn't upgrade commons-io further).

ParserStatus
failed=82
success=627

I'll downgrade Tika to 2.9.0 and push a shim release to maven central.

@tballison tballison marked this pull request as ready for review October 3, 2023 15:04
@tballison
Copy link
Contributor Author

I published the shim artifacts for: https://github.com/tballison/hadoop-safe-tika

It looks like they haven't made it into the main maven repositories yet. :(

Once they do, I think this PR and the hadoop-safe-tika shim will be ready for review.

This represents an ugly work-around, but I did get a clean build locally, and I ran @sebastian-nagel 's nutch-test-single-node-cluster tika tests.

One area for improvement might be to exclude the dependencies from the shim that Nutch already pulls in, like guava, for example.

The release process for the shim is quite easy. Happy to do that if necessary.

@sebastian-nagel
Copy link
Contributor

Hi @tballison, I've tested the shim artifact in local and pseudo-distributed mode: everything looks good.

  • no more exceptions about CloseShieldInputStream.wrap(...) during fetching (MIME detection) or parsing

+1

Note: I got marginally different results:

ParserStatus
       failed=84
       success=625

... but that could be because I have an outdated set of test documents. I'm using the default http.content.limit = 1 MiB.

@tballison tballison merged commit 97eb0b5 into apache:master Oct 20, 2023
1 check passed
@tballison tballison deleted the NUTCH-2959 branch October 20, 2023 18:29
@tballison
Copy link
Contributor Author

Thank you so much @sebastian-nagel !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants