Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge libraries #12290

Open
koppor opened this issue Jan 30, 2018 · 9 comments
Open

Merge libraries #12290

koppor opened this issue Jan 30, 2018 · 9 comments

Comments

@koppor
Copy link
Member

koppor commented Jan 30, 2018

As researcher, I have created dozens of .bib files, which I want to consolidate into one: I would like to point JabRef do a directory. Then, it recursively crawls the directory for *.bib files. For each found file: Import it in the currently opened library.

For each entry:

  • If there is an equal entry, silently ignore it.
  • If there is an entry with the same key, do not import it (or open the merge entries dialog, maybe configurable.)
  • If there is a duplicate entry (according to our algorithms), there should a) popup the merge entries dialog) or b) do not the entry at all -- may configurable for "silent dropping" - see above
  • If the attached file is not stored relatively under the directory of the bib file where it is imported, ask to copy it.

Note that this issue refs #160. That issue is about updated paper-bib-files and a main file, whereas this issue here is about merging data of "old" bib files.

@koppor koppor added good first issue An issue intended for project-newcomers. Varies in difficulty. and removed good first issue An issue intended for project-newcomers. Varies in difficulty. labels Jan 30, 2018
@koppor
Copy link
Member Author

koppor commented Jun 13, 2022

Test cases need to be implemented. Either create a separate sub folder in JabRef or use jimfs:

    testImplementation ('com.google.jimfs:jimfs:1.2') {
        exclude group: "com.google.auto.service"
        exclude group: "com.google.code.findbugs"
        exclude group: "org.checkerframework"
    }

@leonzolati
Copy link

Hi we are a group of students from the ANU and we would really like to work on this issue. What is the procedure to go about doing that?

@leonzolati
Copy link

In addition to the above question, we would like to ask for some clarification surrounding definitions.
What is the difference between an equal and duplicate entry? When an entry has the same key why should we not import it - from our understanding a key is like an intext reference which can have duplicates right?
Thank you very much.

@koppor
Copy link
Member Author

koppor commented Oct 14, 2022

@leonzolati I assigned you. Thus, it should be clear for others that someone is working on it.

Providing some background:

  • .bib files can have several thousand entries
  • When merging different .bib files, they could be originating from different researchers.
  • The result of a merge is a single .bib file
  • I want a single .bib file open and then execute the function "Merge other bib files into current library..."
  • Merging 10 .bib files could lead to thousands of thousands of entries
  • As researcher, I do not want to have duplicates in my database.
  • I as user like the concept of Wizards guiding me through the features of JabRef
  • BibTeX does not allow having multiple entries in the same file having the SAME BibTeX key (because the key needs to be unique).

What is the difference between an equal and duplicate entry?

org.jabref.model.entry.BibEntry#equals compares two entries following the Java conventions for equality. org.jabref.logic.database.DuplicateCheck#compareEntriesStrictly uses JabRef's duplicate algorithm.

from our understanding a key is like an intext reference which can have duplicates right?

I don't get your question fully.

I think, you mean, the BibTeX key coolbook could be different for you and me. Thus, it is NOT enough to check for BibTeX key equivalence.

However, Leymann2022 could be the same entry. As researcher, I do NOT want to have the same entries in my database.

Maybe following helps: Please check the paragraph https://en.wikipedia.org/wiki/BibTeX#Basic_structure at. With \cite{KEY} (from a .tex file), a reference to an entry in a .bib database is made. JabRef manages .bib files only. Example .bib file: https://github.com/JabRef/jabref/blob/main/src/test/resources/testbib/complex.bib

I also have a small presentation at https://speakerdeck.com/koppor/jabref-and-open-source-development?slide=6 - however, it is somehow incomplete as \cite{KEY} is missing.

@leonzolati
Copy link

leonzolati commented Oct 18, 2022

thank you for this explanation it helped us a lot. Just to keep you informed, we have a working prototype of the code with some black-box testing but plan to increase code coverage in the coming days. We would like to ask for some additional clarification on this point:

what is meant by the following: If the attached file is not stored relatively under the directory of the bib file where it is imported, ask to copy it. Do you mean that the directory the the .bib files to merge should be in a child of the working directory and if it isn't, we should ask to copy it into a new directory that is a child of the working directory?

Thank you very much, Leon.

@koppor
Copy link
Member Author

koppor commented Nov 2, 2022

what is meant by the following: If the attached file is not stored relatively under the directory of the bib file where it is imported, ask to copy it.

Maybe, this is too confusing for the user and should be a separate functionality. :) - Forget in your PR.

@claell
Copy link
Contributor

claell commented Apr 27, 2023

Just out of curiosity, in which way is this different to opening both files in JabRef, copying all entries from one file and pasting them in the other file? Isn't there already some check for duplicates and possibilities of merging? And if not sufficient, should this be enhanced on the go while implementing this functionality?

@koppor
Copy link
Member Author

koppor commented Apr 28, 2023

It is more a convenience feature. JabRef currently does not allow for having a "view" on all bibtex libraries. JabRef is still file-based. Situation: the researcher group manages papers at c:\git-repositories\publications. I manage my bib at c:\git-repositories\private-library. I just want to know all entries of the whole group. I can open all bib files, but with > 300 publications of the whole group, the usability of JabRef would shrink. Moreover, I do not want to manually open the bib file of each new publication and put into JabRef. I just want, if I am in the mode of collecting references, to have a "sync" of existing references. Thereby, I do not want to think: Which publications are new? Which bib files might have changed...

I know

Nevertheless a good exercise for students to think of cases, edge cases, ...

For sure, the duplication check needs to be adapted (in any case) - refs #9769.

@claell
Copy link
Contributor

claell commented Oct 14, 2023

Got it, in your use case with a ton of different publications with own .bib files, that indeed makes sense!
One more thing to add to your suggestion: Having the imported entries grouped in the central database by their original source will be pretty helpful (at least to me). That also includes that duplicates won't simply be skipped, but in the case that the duplicate comes from a new source, they should get assigned to the corresponding group in the central database.

Additionally, there might be use cases where one wants to remove a group from an entry in the central database (possibly also remove the entry altogether if no groups from sources are left) in case where an entry is removed from a source. That might be even harder to implement in a robust way, though.

@koppor koppor transferred this issue from JabRef/jabref-koppor Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Free to take
Development

No branches or pull requests

3 participants