-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aborting ArrayIndexOutOfBoundsException while indexing huge files #1281
Comments
Workaround for Lucene bug here was:
Except the last case, all others would be indexed using the strings parser anyway, so there is no problem to split them. About html/xml, their parsers are very lenient and don't throw exception for corrupted files. Files >= 1GB detected as HTML possibly are corrupted ones, so no problem to split. And for XML >= 1GB, I think they are uncommon, our XMLParser actually is a wrapper for HTMLParser (it is lenient, Tika's default XMLParser is not) + RawStringParser, and most common charsets used in XML (ISO-8859-1, UTF-8, UTF-16) are supported by RawStringsParser (for Latin1 scripts), so I think the cons for huge XMLs are small. |
Reopening, this was thrown again while testing a fix for #1358, I knew the fix didn't cover all cases... |
I have 2 non perfect approaches to definitely fix this:
For now I vote for 2. Any opinions or better ideas? |
I'll go with approach 2. But since this fix depends on FragmentLargeBinaryTask to be always enabled, maybe I'll remove its enable/disable option like it was in 3.x versions. Ok for you @tc-wleite? |
Yes @lfcnassif! |
@gabrieldmf sent this to me:
It was processing an embedded 64GB flat vmdk disk partially encrypted by some ransomware, but it should work. Seems a Lucene issue to me, so I opened:
https://issues.apache.org/jira/browse/LUCENE-10681
The text was updated successfully, but these errors were encountered: