Aborting ArrayIndexOutOfBoundsException while indexing huge files #1281

lfcnassif · 2022-08-18T15:53:18Z

@gabrieldmf sent this to me:

Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of bounds for length 71428
	at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - romseygeek - 2022-05-19 15:10:13]
	at iped.engine.task.index.IndexTask.process(IndexTask.java:148) ~[iped-engine-4.0.2.jar:?]

It was processing an embedded 64GB flat vmdk disk partially encrypted by some ransomware, but it should work. Seems a Lucene issue to me, so I opened:
https://issues.apache.org/jira/browse/LUCENE-10681

The text was updated successfully, but these errors were encountered:

lfcnassif · 2022-08-18T18:25:35Z

Found by @gabrieldmf:
https://issues.apache.org/jira/browse/LUCENE-8118

lfcnassif · 2022-08-19T20:34:43Z

Workaround for Lucene bug here was:

always split non first virtual disks segments (for indexing)
split virtual disks first (or single) segments with decoding error
split files with parsing error larger than 1GB
split xml/html files larger than 1GB

Except the last case, all others would be indexed using the strings parser anyway, so there is no problem to split them. About html/xml, their parsers are very lenient and don't throw exception for corrupted files. Files >= 1GB detected as HTML possibly are corrupted ones, so no problem to split. And for XML >= 1GB, I think they are uncommon, our XMLParser actually is a wrapper for HTMLParser (it is lenient, Tika's default XMLParser is not) + RawStringParser, and most common charsets used in XML (ISO-8859-1, UTF-8, UTF-16) are supported by RawStringsParser (for Latin1 scripts), so I think the cons for huge XMLs are small.

lfcnassif · 2022-10-17T11:47:17Z

Reopening, this was thrown again while testing a fix for #1358, I knew the fix didn't cover all cases...

lfcnassif · 2023-01-20T00:00:25Z

I have 2 non perfect approaches to definitely fix this:

If the extracted text size of an item (to be indexed) is greater than 1GB, break its binary content in 1GB chunks, creating new fragmented items, like unalloc and unknown files. The drawback is that if the item has a specific parser and was processed fine (like a huge PDF > 1GB), breaking its binary content will cause exceptions when parsing its fragments and the item data chunks will be processed using the RawStringsParser. Such items with a specific parser and more than 1GB of decoded text are pretty uncommon (but they are causing this issue);
If the extracted text size of an item (to be indexed) is greater than 1GB, break its decoded text in 1GB chunks, creating new subitems, but each pointing to a different text chunk. The drawback is that for complex formats, like PDF, it is impossible to map the decoded text chunk to a sequential binary data chunk. So where the new subitems would start and end? I implemented this approach and kept the subitems start, end & size equal do the parent (original) item, but pointing to different text chunks for indexing. On UI, several similar subitems, with same size, are shown, but each with different indexed texts, so, for a specific search query, some could be returned, some could not. When user clicks on any of them, the whole original item text is decoded and shown in TextViewer, because we don't save decoded/indexed texts into the index today, we have to decode the item again, and we don't know what text chunk was indexed for that subitem, it's not saved. This makes the subitems to look very similar on UI, showing the same binary and text content, but they respond differently to index searches. This is acceptable to me, but may confuse some users.

For now I vote for 2. Any opinions or better ideas?

lfcnassif · 2023-01-20T14:45:36Z

I'll go with approach 2. But since this fix depends on FragmentLargeBinaryTask to be always enabled, maybe I'll remove its enable/disable option like it was in 3.x versions. Ok for you @tc-wleite?

wladimirleite · 2023-01-20T14:53:14Z

Yes @lfcnassif!
I was following this issue and I agree the proposed solution seems the best option. Removing the ability to disable FragmentLargeBinaryTask seems fine to me. In practice, users keep this option always enabled, and I can't see any scenario in which disabling it is really crucial.

lfcnassif added bug 4.0.0-final labels Aug 18, 2022

lfcnassif self-assigned this Aug 19, 2022

lfcnassif added a commit that referenced this issue Aug 19, 2022

'#1281: always split embedded disks parts & first part not decoded fine

49215c6

lfcnassif added a commit that referenced this issue Aug 19, 2022

'#1281: always split large files with parsing exception

e149093

lfcnassif changed the title ~~Rare aborting ArrayIndexOutOfBoundsException while indexing a huge file~~ Aborting ArrayIndexOutOfBoundsException while indexing embedded virtual disk segments or huge corrupted files Aug 19, 2022

lfcnassif added a commit that referenced this issue Aug 19, 2022

'#1281: always split embedded disks parts & first part not decoded fine

68d1226

lfcnassif added a commit that referenced this issue Aug 19, 2022

'#1281: always split large files with parsing exception

b2ea61c

lfcnassif added a commit that referenced this issue Aug 19, 2022

close #1281: splits huge xml/html to avoid AIOOBE while indexing

514b82e

lfcnassif added a commit that referenced this issue Aug 19, 2022

close #1281: splits huge xml/html to avoid AIOOBE while indexing

30f00e3

lfcnassif closed this as completed in 794b7ae Aug 19, 2022

lfcnassif changed the title ~~Aborting ArrayIndexOutOfBoundsException while indexing embedded virtual disk segments or huge corrupted files~~ Aborting ArrayIndexOutOfBoundsException while indexing embedded virtual disk segments or huge files Aug 25, 2022

lfcnassif removed 4.0.0-final labels Sep 21, 2022

lfcnassif reopened this Oct 17, 2022

lfcnassif mentioned this issue Jan 3, 2023

SQLITE_BUSY error processing evidence #1469

Closed

lfcnassif changed the title ~~Aborting ArrayIndexOutOfBoundsException while indexing embedded virtual disk segments or huge files~~ Aborting ArrayIndexOutOfBoundsException while indexing huge files Jan 20, 2023

lfcnassif added a commit that referenced this issue Jan 20, 2023

'#1281: makes TextCache clonable and bounded at start and end

c9f88db

lfcnassif added a commit that referenced this issue Jan 20, 2023

'#1281: split all items which extracted text size exceeds 1GB

ae60ede

lfcnassif added a commit that referenced this issue Jan 20, 2023

'#1281: removes method call non existent anymore

281855b

lfcnassif added a commit that referenced this issue Jan 20, 2023

'#1281: optimization to seek to text start offset instead of skip

25c5e55

lfcnassif added a commit that referenced this issue Jan 20, 2023

'#1281: removes FragmentaLargeBinaryTask disable option

d9b1826

lfcnassif mentioned this issue Jan 20, 2023

#1281 aioobe indexing huge files #1484

Merged

lfcnassif closed this as completed in #1484 Jan 20, 2023

lfcnassif added a commit that referenced this issue Jan 20, 2023

'#1281: fix building - removes method call non existent anymore

623aae0

lfcnassif mentioned this issue Jan 23, 2023

FragmentLargeBinaryTask cannot be disabled #1465

Closed

lfcnassif mentioned this issue May 9, 2023

Aborting ArrayIndexOutOfBoundsException from Lucene when creating reports with huge files #1676

Closed

This was referenced Sep 11, 2023

NPE in SleuthkitClient when generating report with a virtual disk #1870

Closed

Hikvision, WFS and DHFS File System parser - DVR videos #1776

Draft

lfcnassif added a commit that referenced this issue Sep 18, 2023

'#1281: complement commit d9b1826

af66ce9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aborting ArrayIndexOutOfBoundsException while indexing huge files #1281

Aborting ArrayIndexOutOfBoundsException while indexing huge files #1281

lfcnassif commented Aug 18, 2022

lfcnassif commented Aug 18, 2022

lfcnassif commented Aug 19, 2022 •

edited

Loading

lfcnassif commented Oct 17, 2022 •

edited

Loading

lfcnassif commented Jan 20, 2023 •

edited

Loading

lfcnassif commented Jan 20, 2023

wladimirleite commented Jan 20, 2023

Aborting ArrayIndexOutOfBoundsException while indexing huge files #1281

Aborting ArrayIndexOutOfBoundsException while indexing huge files #1281

Comments

lfcnassif commented Aug 18, 2022

lfcnassif commented Aug 18, 2022

lfcnassif commented Aug 19, 2022 • edited Loading

lfcnassif commented Oct 17, 2022 • edited Loading

lfcnassif commented Jan 20, 2023 • edited Loading

lfcnassif commented Jan 20, 2023

wladimirleite commented Jan 20, 2023

lfcnassif commented Aug 19, 2022 •

edited

Loading

lfcnassif commented Oct 17, 2022 •

edited

Loading

lfcnassif commented Jan 20, 2023 •

edited

Loading