Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aborting ArrayIndexOutOfBoundsException while indexing huge files #1281

Closed
lfcnassif opened this issue Aug 18, 2022 · 6 comments · Fixed by #1484
Closed

Aborting ArrayIndexOutOfBoundsException while indexing huge files #1281

lfcnassif opened this issue Aug 18, 2022 · 6 comments · Fixed by #1484
Assignees
Labels

Comments

@lfcnassif
Copy link
Member

@gabrieldmf sent this to me:

Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of bounds for length 71428
	at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at iped.engine.task.index.IndexTask.process(IndexTask.java:148) ~[iped-engine-4.0.2.jar:?]

It was processing an embedded 64GB flat vmdk disk partially encrypted by some ransomware, but it should work. Seems a Lucene issue to me, so I opened:
https://issues.apache.org/jira/browse/LUCENE-10681

@lfcnassif
Copy link
Member Author

@lfcnassif lfcnassif self-assigned this Aug 19, 2022
@lfcnassif lfcnassif changed the title Rare aborting ArrayIndexOutOfBoundsException while indexing a huge file Aborting ArrayIndexOutOfBoundsException while indexing embedded virtual disk segments or huge corrupted files Aug 19, 2022
@lfcnassif
Copy link
Member Author

lfcnassif commented Aug 19, 2022

Workaround for Lucene bug here was:

  • always split non first virtual disks segments (for indexing)
  • split virtual disks first (or single) segments with decoding error
  • split files with parsing error larger than 1GB
  • split xml/html files larger than 1GB

Except the last case, all others would be indexed using the strings parser anyway, so there is no problem to split them. About html/xml, their parsers are very lenient and don't throw exception for corrupted files. Files >= 1GB detected as HTML possibly are corrupted ones, so no problem to split. And for XML >= 1GB, I think they are uncommon, our XMLParser actually is a wrapper for HTMLParser (it is lenient, Tika's default XMLParser is not) + RawStringParser, and most common charsets used in XML (ISO-8859-1, UTF-8, UTF-16) are supported by RawStringsParser (for Latin1 scripts), so I think the cons for huge XMLs are small.

@lfcnassif lfcnassif changed the title Aborting ArrayIndexOutOfBoundsException while indexing embedded virtual disk segments or huge corrupted files Aborting ArrayIndexOutOfBoundsException while indexing embedded virtual disk segments or huge files Aug 25, 2022
@lfcnassif
Copy link
Member Author

lfcnassif commented Oct 17, 2022

Reopening, this was thrown again while testing a fix for #1358, I knew the fix didn't cover all cases...

@lfcnassif
Copy link
Member Author

lfcnassif commented Jan 20, 2023

I have 2 non perfect approaches to definitely fix this:

  1. If the extracted text size of an item (to be indexed) is greater than 1GB, break its binary content in 1GB chunks, creating new fragmented items, like unalloc and unknown files. The drawback is that if the item has a specific parser and was processed fine (like a huge PDF > 1GB), breaking its binary content will cause exceptions when parsing its fragments and the item data chunks will be processed using the RawStringsParser. Such items with a specific parser and more than 1GB of decoded text are pretty uncommon (but they are causing this issue);
  2. If the extracted text size of an item (to be indexed) is greater than 1GB, break its decoded text in 1GB chunks, creating new subitems, but each pointing to a different text chunk. The drawback is that for complex formats, like PDF, it is impossible to map the decoded text chunk to a sequential binary data chunk. So where the new subitems would start and end? I implemented this approach and kept the subitems start, end & size equal do the parent (original) item, but pointing to different text chunks for indexing. On UI, several similar subitems, with same size, are shown, but each with different indexed texts, so, for a specific search query, some could be returned, some could not. When user clicks on any of them, the whole original item text is decoded and shown in TextViewer, because we don't save decoded/indexed texts into the index today, we have to decode the item again, and we don't know what text chunk was indexed for that subitem, it's not saved. This makes the subitems to look very similar on UI, showing the same binary and text content, but they respond differently to index searches. This is acceptable to me, but may confuse some users.

For now I vote for 2. Any opinions or better ideas?

@lfcnassif
Copy link
Member Author

I'll go with approach 2. But since this fix depends on FragmentLargeBinaryTask to be always enabled, maybe I'll remove its enable/disable option like it was in 3.x versions. Ok for you @tc-wleite?

@wladimirleite
Copy link
Member

Yes @lfcnassif!
I was following this issue and I agree the proposed solution seems the best option. Removing the ability to disable FragmentLargeBinaryTask seems fine to me. In practice, users keep this option always enabled, and I can't see any scenario in which disabling it is really crucial.

@lfcnassif lfcnassif changed the title Aborting ArrayIndexOutOfBoundsException while indexing embedded virtual disk segments or huge files Aborting ArrayIndexOutOfBoundsException while indexing huge files Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants