Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added option to import CFF files #7946

Merged
merged 9 commits into from
Sep 1, 2021
Merged

Conversation

AidanM11
Copy link
Contributor

@AidanM11 AidanM11 commented Jul 29, 2021

Added ability to import CFF files as requested in #7945. As per CFF specifications, there is only one entry per CFF file, and CFF entries are always of the type "Software". JabRef does not support all the fields that CFF does, so a fair amount of information could be lost on import (specifically details about the authors beyond their names). I still need to test this more thoroughly and add automated tests, so I've marked this as a draft. I also plan on adding export support.

fixes #7945

Images:
Capture
Real2
Capture3

  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked documentation: Is the information available and up to date? If not created an issue at https://github.com/JabRef/user-documentation/issues or, even better, submitted a pull request to the documentation repository.

@Siedlerchr
Copy link
Member

Thanks for tackling this issue. For the Authors take a look at Jabrefs Author List and Authors parser. Maybe you can use that.
For fields that do not map to standard fields, you could import/add them with a prefix

@Siedlerchr
Copy link
Member

For the java classes from the yaml, you can generate them from https://www.jsonschema2pojo.org/

@Siedlerchr
Copy link
Member

@AidanM11 What is the status here? It would be really cool if you could continue working on this

@AidanM11
Copy link
Contributor Author

Sorry about the delay, had a busy couple weeks. I've added tests and made some minor changes to the code. As of now, the importer can parse and map most of the fields CFF supports. The main exception is the person/entity system from CFF. In CFF, the authors and contact fields are lists of Person or Entity objects, which each have their own extensive lists of fields for all kinds of information. It seems like it would be very difficult to fit that system into the way jabref handles fields, so instead I set it up so that the importer just extracts the names of each object in the 'authors' list and concatenates them into an author string. In this case, the import operation seems unavoidably lossy.

Similarly, CFF supports a list of 'identifiers' of different types (url, swh, other). JabRef does have support for these fields, but the CFF specs appear allow the user to add as many of these identifiers as they like (I'm not entirely sure about this, but it seems like it). So a single entry could have multiple SWH identifiers, which seems tricky to work into the JabRef system.

However, there are also some 'static' fields in CFF that don't have JabRef counterparts (specifically: commit, license-url, repository-code, repository-artifact). You mentioned earlier that if CFF fields don't map to JabRef standard fields, I could "import/add them with a prefix". I wasn't sure what that meant, and was unable to find any examples of it in the codebase, so I was wondering if I could get a little more clarification on that.

So, overall, the importer is nearly finished, I just need to work out what to do with identifiers (maybe just use the first one, or do a comma-separated list?) and also what to do about the 'static' fields without mappings.

hm.put("abstract", StandardField.ABSTRACT);
hm.put("message", StandardField.COMMENT);
hm.put("date-released", StandardField.DATE);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fields that can't be mapped, you can simply create a new UnkownField: (example is taken from the MSBibImporter)
And set CFF_PREFIX to sth ilke cff-
new UnknownField(CFF_PREFIX + "medium"), "Medium");

JabRef can show non-standard fields in the entry editor as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be a better solution than prefixing those fields, since none of the citation styles would pick them up - but you might want to have the repository printed in the reference list. I've started a discussion at the biblatex repo how to handle those fields: plk/biblatex#1169

String aAlias = auth.vals.get("alias");

if (aName != null) {
authors.add(aName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JabRef has a class called Author which will handle the construction of authors, see org.jabref.model.entry
public Author(String first, String firstabbr, String von, String last, String jr)
Afterwards, you put in an AuthorLIst and return the and separated value
See for example the Medra-Fetcher (org.jabref.logic.importer.fetcher)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from String firstabbr, CFF has the same fields for person type objects (e.g., authors). The main difference between these and the entity type objects (for this use case at least) is that entities only have a single field name instead, which I guess could be imported as is?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that seems to be correct, for reference:

/**
* Creates the Author object. If any part of the name is absent, <CODE>null</CODE> must be passed; otherwise other methods may return erroneous results.
*
* @param first the first name of the author (may consist of several tokens, like "Charles Louis Xavier Joseph" in "Charles Louis Xavier Joseph de la Vall{\'e}e Poussin")
* @param firstabbr the abbreviated first name of the author (may consist of several tokens, like "C. L. X. J." in "Charles Louis Xavier Joseph de la Vall{\'e}e Poussin"). It is a responsibility of the caller to create a reasonable abbreviation of the first name.
* @param von the von part of the author's name (may consist of several tokens, like "de la" in "Charles Louis Xavier Joseph de la Vall{\'e}e Poussin")
* @param last the last name of the author (may consist of several tokens, like "Vall{\'e}e Poussin" in "Charles Louis Xavier Joseph de la Vall{\'e}e Poussin")
* @param jr the junior part of the author's name (may consist of several tokens, like "Jr. III" in "Smith, Jr. III, John")
*/
public Author(String first, String firstabbr, String von, String last, String jr) {

@sdruskat
Copy link

Hi 👋, CFF co-lead here. It's really great to see this work happening for JabRef!

As per CFF specifications, there is only one entry per CFF file, and CFF entries are always of the type "Software".

FYI: as of version 1.2.0, CFF also supports datasets experimentally (type: dataset).

@sdruskat
Copy link

Similarly, CFF supports a list of 'identifiers' of different types (url, swh, other). JabRef does have support for these fields, but the CFF specs appear allow the user to add as many of these identifiers as they like (I'm not entirely sure about this, but it seems like it). So a single entry could have multiple SWH identifiers, which seems tricky to work into the JabRef system.

Yes, people can supply several identifiers for a software (which reflects reality).
Perhaps you could take a hierarchical approach, checking for some things first, then others, etc., and pick one entry based on a heuristic? E.g.,

  1. CFF has root level doi field or exactly one doi entry in the identifiers section: take that
  2. CFF has no DOIs, but has exactly one swh entry in the identifiers: take that
  3. CFF has more than one swh-type identifier, including one with the rel type: take that
  4. etc.

Happy to help with such a heuristic which I think should ideally be based on the software citation principles as much as possible.

However, there are also some 'static' fields in CFF that don't have JabRef counterparts (specifically: commit, license-url, repository-code, repository-artifact). You mentioned earlier that if CFF fields don't map to JabRef standard fields, I could "import/add them with a prefix". I wasn't sure what that meant, and was unable to find any examples of it in the codebase, so I was wondering if I could get a little more clarification on that.

In terms of what the software citation principles suggest, repository-code will contain a value that's potentially more important to keep than the others, and may even be used to override a generic url field. But that's semantics which I'm not sure you want to have in JabRef?

@AidanM11
Copy link
Contributor Author

Alright, so I've changed the author handling to incorporate the JabRef Author system, as @Siedlerchr suggested, and it seems to be working well.
For the identifiers, I've implemented something similar to the heuristic @sdruskat suggested. As JabRef has separate fields for DOIs and SWHIDs, they have separate heuristics. Currently, they are:

For DOI:

  1. If there is a root-level DOI, use that
  2. If there is a single DOI in the identifiers list, use that
  3. Otherwise, do not use any DOI.

For SWHID

  1. If there is a single SWHID, use that.
  2. If there are multiple SWHIDs, but only one with the "rel" tag, use that one.
  3. Otherwise, do not use any SWHID.

As you can see, when there are multiple identifiers of the same type, the current system just discards them. This doesn't seem ideal, but as the CFF file does not contain any other information that could differentiate those identifiers, it seems like the only alternative is to choose a random one, and that seems like a bad option. I guess it could also put them all into the field in a comma-separated list, but that seems like it goes against the intended use of those fields.

Anyway, besides that, I've still got a small list of things to do before this pull request is finished (updated tests, handling fields without mappings, dataset type support, etc) so I'll keep working on it. Just wanted to make sure this identifier handling is satisfactory.

@Siedlerchr
Copy link
Member

I think this sounds good for the heuristic, DOI is always a good/valid option. Provides the easiest way to get any addtional metadata if needed.
I also added a devcall label, so that we @JabRef/developers can discuss this in detail in the next devcall if we come up with another idea or addtional considerations. But your approach so far sounds good

@jspaaks
Copy link

jspaaks commented Aug 24, 2021

Thanks @AidanM11 for doing this work! I like your heuristics. Perhaps one thing to consider is that it's relatively common to have two DOIs in identifiers: one for the version of the software, and one for the concept / collection / all versions. For such cases, neither would make it through if there isn't a root doi, right? I guess that is by design, since it is a bit of an undecidable problem. Anyway, an alternative could be to pick the first one and use that, something like identifiers.filter(type=doi)[0]. I understand though if you don't like to make this arbitrary choice. I've made a suggestion over at Citation File Format to introduce some kind of qualifier like datacite's isVersionOf, isCitedBy, etc in order to help tooling decide which element of identifiers is most appropriate for their purposes.

@AidanM11 AidanM11 marked this pull request as ready for review August 28, 2021 07:16
@Siedlerchr
Copy link
Member

Thanks for the work, can you please merge in the latest upstream main branch of JabRef and resolve the conflicts?

List<String> relSwhIds = swhIds.stream()
.filter(id -> id.split(":").length > 3) // quick filter for invalid swhids
.filter(id -> id.split(":")[2].equals("rel"))
.collect(Collectors.toList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here you can also use findFirst again as above

Copy link
Member

@Siedlerchr Siedlerchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me and thanks for the tests! I have just some minor improvements regarding the code style.

@Siedlerchr Siedlerchr added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Aug 30, 2021
@AidanM11
Copy link
Contributor Author

AidanM11 commented Sep 1, 2021

@Siedlerchr I've made the requested changes. Thanks for the review!

Copy link
Member

@Siedlerchr Siedlerchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lgtm!

Copy link
Member

@calixtus calixtus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed some minor style issues. Otherwise LGTM, thanks for making JabRef better.
As soon as tests are green, we're merging.

@calixtus calixtus merged commit 9f0430d into JabRef:main Sep 1, 2021
Siedlerchr added a commit that referenced this pull request Sep 2, 2021
* upstream/main:
  Enable copy&paste for IntelliJ run configuration
  GitBook: [main] one page modified
  GitBook: [main] one page modified
  GitBook: [main] 19 pages and 6 assets modified
  Refine documentation
  Oobranch f : add frontend (#7791)
  Observable Preferences C (General) (#8047)
  Welcome btut 🎉
  Observable preferences B (ProtectedTerms, EntryEditor and MrDlib) (#8046)
  Refine howto (#8049)
  Decision on handling of localized preferences.
  Added option to import CFF files (#7946)
  Observable preferences A (Appearance and Importer) (#8041)
  Squashed 'buildres/csl/csl-locales/' changes from ec6d62a9e7..7a507fc008
  Squashed 'buildres/csl/csl-styles/' changes from ec4a4c0..2b37392
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Citation File Format (CFF, yaml) as an import format
6 participants