Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tika Metadata Add performance optimization #2043

Open
patrickdalla opened this issue Jan 5, 2024 Discussed in #2039 · 1 comment
Open

Tika Metadata Add performance optimization #2043

patrickdalla opened this issue Jan 5, 2024 Discussed in #2039 · 1 comment

Comments

@patrickdalla
Copy link
Collaborator

Discussed in #2039

Originally posted by patrickdalla January 2, 2024
Hi @wladimirleite , I'm back to work after my licences/vacation. I opened this discussion based on the performance issue you also noted when implementing #1999, about the bad implementation of metadata addition on multi valued metadata field. In that PR it seems that you found a different approach that do not need multivalue.
But the performance issued still exists, and affects any other parser that needs to add multiple metadata. Have you opened an specific issue to address this problem?

patrickdalla added a commit that referenced this issue Jan 5, 2024
without the need to add dependency on IPED engine on IPED parsers
module.
patrickdalla added a commit that referenced this issue Jan 5, 2024
patrickdalla added a commit that referenced this issue Jan 5, 2024
object does not already implementes SyncMetadata interface.
@patrickdalla
Copy link
Collaborator Author

patrickdalla commented Jan 5, 2024

Hi @wladimirleite, I was able to bybass and override protected original Tika implementation of metadata, at least in a level that could solve the problem of adding multiple values at once.
I created a MetadataWriteFilter implementation that does not set the metadata in the first and subsequent value additions, only in the last. The only problem is that, to know that the value is the last to be added, a initial call to allocateSpace method must be called.

So it was enough to set multiple values at once from an ArrayList of a single Parser avoiding the slow tika implementation. This is what I needed in SQLite split PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants