-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate JabRef search to Lucene #8857
Comments
I am sure this could be solved via a preference that allows users to choose where to store such files. Might especially be useful for full-text search index, since that one will probably be larger than normal search. We do have a preference right now to disable full-text search index, after some reports of it taking too long and re-indexing triggering for too often for large databases. As default I also favour a system folder so as to not pollute folders holding the library file. Hashing seems an interesting idea and the post is really well explained. Thanks :) |
(Working on the ADR on how - feel free to edit this comment to reach a final ADR - I copied the text from the issue) How to link bib entries to the indexContext and Problem StatementTo synchronize the index with the bibliography database, we need a mechanism to link an entry to the index. Considered Options
Decision Outcome
Pros and Cons of the OptionsUse
|
WhenContext and Problem StatementEvery time an entry changes, the index needs to change with it:
Decision OutcomeFor this, we have the abstract |
Status: Needs testing --> what if 50k entries exist? How to co-exist with the fulltext indexContext and Problem StatementWe noticed for the fulltext-search functionality that indexing takes too much time to be done by the GUI thread. We assume that this problem is not given with the normal bib-fields (as it's only a few hundred words at max and no file needs to be opened and parsed). Considered Options
Pros and Cons of the OptionsRun in the GUI threadI suggest (at least trying to) index bib-fields in the foreground and keep the fulltext-indexing in the background. A problem that immediately comes to mind: locks. Only one thread may write to the index at a time. If we keep everything in the same index, the background fulltext-indexer could block the indexing of the bib-fields. Solution could be to use two indices, but that makes the search more complicated. This problem needs further investigation. |
Status: proposed Note: I excluded the .rtf discussion. We should focus on .bib only Which contents of the Bib Library to index?Context and Problem StatementIndexed fields need to be transferred to Lucene explicitely. Considered Options
Decision OutcomeChosen option: "Index all fields", because the expectation on a search is that everything in the .bib file is searched and this is close enough. |
Where to store the indexContext and Problem StatementThe search index needs to be stored somewhere Considered Options
Decision OutcomeChosen option: "Use the directory provided by AppDirs", because the index is generated data, which can be regenerated on demand. AppDirs provides reasonable defaults for application data. We should implement a cleanup of the Pros and Cons of the OptionsUse the directory provided by AppDirs
Make it configurableThe default setting might be the directory returned by AppDirs
|
How to show the search results(No ADR until now) It should work as it currently works (needs double check)
|
Currently, JabRef implements it's own search syntax and backend for bib-fields. Fulltext pdf files are indexed by a Lucene backend. Since we already manage an Index for the fulltext search, we could also index the bib-fields for a faster, more versatile search functionality.
However, this is no easy task as keeping the index up-to-date poses multiple questions. Mainly how to link bib-entries to their index entry, when to update the index, what fields to index and where to store the index and how to show the results.
I summarize some thoughts below. I would like to work on these ideas over the next weeks and then maybe implement the functionality during JabCon2022.
How to link bib-entries to the index
One problem that I already struggled with when implementing the fulltext search is the absence of a unique key connecting a JabRef bibentry object to a corresponding entry in the lucene index. Citation keys are not necessarily present. JabRefs entry identifier is volatile and may be different each time JabRef opens. To synchronize the index however, we need a mechanism to link an entry to the index.
One solution could be hashes. When the user changes an entry, we would need to generate the hash before the change, update the indexed fields and then update the hash to the hash after the change. This would also allow us to easily check which entries need to be re-indexed at startup. Just compare all hashes in the library to all hashes in the index. Hashes that are not found in the index need to be indexed, hashes that are not found in the library need to be deleted.
When
Every time an entry changes, the index needs to change with it. This can be:
Also, we noticed for the fulltext-search functionality that indexing takes too much time to be done by the GUI thread. I assume that this problem is not given with the normal bib-fields (as it's only a few hundred words at max and no file needs to be opened and parsed). I suggest (at least trying to) index bib-fields in the foreground and keep the fulltext-indexing in the background. A problem that immediately comes to mind: locks. Only one thread may write to the index at a time. If we keep everything in the same index, the background fulltext-indexer could block the indexing of the bib-fields. Solution could be to use two indices, but that makes the search more complicated. This problem needs further investigation.
What
ALL bib-fields and linked files (if files can be parsed by JabRef, currently only .pdf but could probably easily be extended to txt, rtf... if that is a valid use-case).
Uncertain: How to treat custom fields. I am unsure if the fields-set needs to be fixed in the Lucene index or if one can add fields on the fly. This needs further inverstigation.
Where
Personally I would prefer having the index close to the bibfile, but the fulltext-index is currently stored in app-data folders (~/.local on linux) and AFAIK that is what programs are supposed to do so I suggest to keep that location.
How to show the results
I would like to highlight search matches in the table. Fulltext-results are currently shown in a tab in the entry editor - which I really do not like. Back when I implemented the feature, @calixtus proposed a way to show the results directly in the table by inserting a row under the corresponding entry that spans the whole table and shows the results. I cannot currently find the link Carl sent back then, but will look it up again. I think that would be a great way to highlight the search results.
The text was updated successfully, but these errors were encountered: