Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review date handling with header consolidation and date serialization in references #807

Merged
merged 10 commits into from
Aug 18, 2021

Conversation

kermitt2
Copy link
Owner

@kermitt2 kermitt2 commented Aug 2, 2021

This is a follow-up of PR #761, with:

  • merging of date with header consolidation, e.g. when extracted is <date type="published" when="2011-05-23">23 May 2011</date> and consolidation is <date type="published" when="2011-05"/>, we keep the first one.

  • better date serialization in references, we output both normalized and raw extracted date (like done in the header), this is realized both for XML TEI and bibTeX format

Note: move toISOString() to class Date.java

Add tests for structured date merging.

@kermitt2
Copy link
Owner Author

kermitt2 commented Aug 2, 2021

A test PDF to see the header date merging in action :)

fulltext02.pdf

Copy link

@btut btut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments on how to deal with normalized vs non-normalized dates. Curious what you think!

* "2011" "2010" -> 2011
*/
public static Date merge(Date date1, Date date2) {
if (date1.getYear() != -1 && date2.getYear() != -1) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if date1.getYear() == -1, this still returns date1.
What about something like

if (date1.getYear() == -1) {
    return date2;
}
if (date2.getYear() == -1) {
    return date1;
}

// rest of the method as-is

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I share the same concern about this. If dat 1 is non existent doesn't make sense to take date2?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! when I wrote it, I was considering that it is irrelevant to try to merge something when data1 is undefined, but this is not covering the normal general usage.

Comment on lines 1994 to 1996
if (publication_date != null) {
bibtex.add(" year = {" + publication_date + "}");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think date is preferred over year and if used, year should only consist of a numeric year.
I see two options here:

  1. Put this in an else branch to the if before it and write it to the date field.
    So, if a normalized_publication_date exists, use it as iso string, else if the publication_date exists use that.
  2. Don't format the normalized_publication_date as ISO string but use the year, month and day fields instead. This would ensure that year, month and day are numeric and the (more detailed, with day ranges) publication_date could be put as a string in the date field. If both are given, it is up to the user to decide which to keep.

@kermitt2
Copy link
Owner Author

kermitt2 commented Aug 5, 2021

Linking to #800 (comment) from @koppor

I am asking a few questions for clarification below.

@kermitt2
Copy link
Owner Author

kermitt2 commented Aug 5, 2021

So what I understand from @koppor explanations, correct me if I am again wrong :D

  • preferably provide year and month for supporting common .bib processors and optionally date too. So I guess we can output all of them (year, month, day, date), based on the normalized ISO date?

  • general bibtex does not support the range format and anyway the Grobid ISO conversion neither. But if Grobid ISO conversion would support date range in the future, the range would go to the date field and year / month would be limited to the fixed part of the range.

  • if I understand well, there is no way to express a "raw" date string, like April 7 - 10, 2014 ("This should not happen. :)"). It means it won't appear in the bibtex format unfortunately. And it means that publication_date can never be used in the BibTeX output and that if the ISO normalization fails (for whatever reasons) there will be no date information at all due to constraints of the BibTeX format?

Note that the current bibtex format implementation for references output raw dates like that currently, so it can output year = {April 7 - 10, 2014}, -> it has to be changed too.

Naive remarks:

  • I am a random BibTeX user and I have plenty of non-numerical year fields, the format year = {2010--2011} looks very common. So that should be considered as legacy bibtex entries?

  • the XML format will provide to the users more information at different level of granularity. Just wondering, in the case of extracted information in JabRef, have you considered using first a more expressive format to interact with users before converting to BibTeX based on user's correction?

@btut
Copy link

btut commented Aug 5, 2021

Hi!
First things first, I am not a pro-bibtex-user either, so take my opinions and thoughts with a grain of salt and rely on @koppor 's expertice. I am sure he will give his input as well, but I cant help but give my thoughts too :)

So what I understand from @koppor explanations, correct me if I am again wrong :D

  • preferably provide year and month for supporting common .bib processors and optionally date too. So I guess we can output all of them (year, month, day, date), based on the normalized ISO date?

Thaths what I understood as well.

  • general bibtex does not support the range format and anyway the Grobid ISO conversion neither. But if Grobid ISO conversion would support date range in the future, the range would go to the date field and year / month would be limited to the fixed part of the range.

Yes. I think best case would be to parse the range and put the beginning of the range to the bibtex fields. So Januarry 16-28th 1987 would result in year=1987, month=1 and day=16. I don't know how easy the parsing would be for this case.
For now, I think it would be acceptable to put the full date string January 16-18th 1987 into the date field, as most citation styles would just use that verbatim (again, I would rely on @koppor 's expertice to confirm this).

  • if I understand well, there is no way to express a "raw" date string, like April 7 - 10, 2014 ("This should not happen. :)"). It means it won't appear in the bibtex format unfortunately. And it means that publication_date can never be used in the BibTeX output and that if the ISO normalization fails (for whatever reasons) there will be no date information at all due to constraints of the BibTeX format?

As above, 'wrong/malformatted' information is better than no information at all, so I would just put the raw string into the date field. If a user then uses a citation style that is not compatible, there is not much we can do. At least there is the raw date and the user can change that to fit his/her citation style.

Note that the current bibtex format implementation for references output raw dates like that currently, so it can output year = {April 7 - 10, 2014}, -> it has to be changed too.

I plan on making another PR next week to put my ideas into code. I would then put the raw string into the date field if year, month and day cannot be determined (normalized_pubblication_date==null). I think it is better to put the 'malformed' string into the date field as the year field is mostly expected to be an integer.

Naive remarks:

  • I am a random BibTeX user and I have plenty of non-numerical year fields, the format year = {2010--2011} looks very common. So that should be considered as legacy bibtex entries?

@koppor is the pro here :)

  • the XML format will provide to the users more information at different level of granularity. Just wondering, in the case of extracted information in JabRef, have you considered using first a more expressive format to interact with users before converting to BibTeX based on user's correction?

Not that I know of. But what kind of granularity do you mean? I think JabRef users (and bibtex users in general) are only interested in the fields that show up in their pubblication when they cite a work. So mainy author (without affiliation...), Journal, year, ... The addidtional information provided by Grobid (very impressive by the way) is not interesting for this use case, right? Am I missing something?

Thanks for putting your time into this. I am sure it is a really nice feature for JabRef users and people considering a migration to JabRef (as they would just have to import pdfs and Grobid would do the hard part!)

@kermitt2
Copy link
Owner Author

kermitt2 commented Aug 6, 2021

But what kind of granularity do you mean?

I was thinking giving normalized/parsed and raw data in explicit fields like in XML:

<date type="published" when="2014">April 7 - 10, 2014</date>

so just pushing the idea " 'wrong/malformatted' information is better than no information at all " (which I agree on behalf Grobid routine production of wrong/malformatted' information :D )

I think JabRef users (and bibtex users in general) are only interested in the fields that show up in their publication when they cite a work.

Yes !

Copy link
Collaborator

@lfoppiano lfoppiano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to return a new object. 👍

Another small detail. I was thinking to change it directly but I might have overlooked something else.
The class implements Comparable. I suggest implementing Comparable<Date> instead and remove the "compareTo(Object another)" so that the "client" will take care of passing the correct object.
If you agree, @kermitt2 I can implement it and update the usage in the other parts.

@kermitt2
Copy link
Owner Author

The class implements Comparable. I suggest implementing Comparable instead and remove the "compareTo(Object another)" so that the "client" will take care of passing the correct object.

implements Comparable<Date> correct ? -> sure it would be better indeed !

@btut btut mentioned this pull request Aug 12, 2021
@btut
Copy link

btut commented Aug 12, 2021

Sorry for the delay, I was finally able to implement my thoughts in #814.

I was thinking giving normalized/parsed and raw data in explicit fields like in XML

I like that idea as well. Introducing new fields in bibtex is not an issue (something like 'rawdate'). I am afraid that if we put the raw-string into the date field, users will assume that it is in a valid format and not bother fixing potential issues. Using an additional field would help here as the 'rawdate' would not end up in citations and users will notice and fix the format themselves (or lookup the correct date).
I would be interested in @koppor thoughts about that.

@btut
Copy link

btut commented Aug 17, 2021

@kermitt2 and @lfoppiano Google Summer-of-Code ends this week, it would be amazing if we could get this (or #814) merged before then so we can update the JabRef server and merge the feature. Is there anything I can do to help make that happen?

If not, we probably will update the JabRef server to #814 instead of the master branch, which is still fine, but it would be a great personal achievement for me to have everything merged and ready for users to try out.

Thanks for your support!

@kermitt2
Copy link
Owner Author

@btut I though we were waiting for @koppor feedback ?

@btut
Copy link

btut commented Aug 17, 2021

Oh, sorry. I just assumed since he did not disagree he would be ok with this. But you are right, let's see if @koppor has to add something.

@kermitt2
Copy link
Owner Author

I think there's just one point a bit unclear/contradictory:

@koppor
Copy link
Contributor

koppor commented Aug 17, 2021

So what I understand from @koppor explanations, correct me if I am again wrong :D

  • preferably provide year and month for supporting common .bib processors and optionally date too. So I guess we can output all of them (year, month, day, date), based on the normalized ISO date?
    Thaths what I understood as well.

+1

  • general bibtex does not support the range format and anyway the Grobid ISO conversion neither. But if Grobid ISO conversion would support date range in the future, the range would go to the date field and year / month would be limited to the fixed part of the range.
    Yes. I think best case would be to parse the range and put the beginning of the range to the bibtex fields. So Januarry 16-28th 1987 would result in year=1987, month=1 and day=16. I don't know how easy the parsing would be for this case.
    For now, I think it would be acceptable to put the full date string January 16-18th 1987 into the date field, as most citation styles would just use that verbatim (again, I would rely on @koppor 's expertice to confirm this).

+1 for year, month, day.

The date seems to be parsable (assuming the typo in Januarry is not intentional and it should be January. Then, it is:

1987-01-16/1987-01-28

(Other references listed at #800 (comment))

  • if I understand well, there is no way to express a "raw" date string, like April 7 - 10, 2014 ("This should not happen. :)"). It means it won't appear in the bibtex format unfortunately. And it means that publication_date can never be used in the BibTeX output and that if the ISO normalization fails (for whatever reasons) there will be no date information at all due to constraints of the BibTeX format?
    As above, 'wrong/malformatted' information is better than no information at all,

Yeah - the bibtex processors are robust for that case and will output some erros during procesisng. I also think, its better to have the information there than using other fields (and thus creating another standard).

Note that the current bibtex format implementation for references output raw dates like that currently, so it can output year = {April 7 - 10, 2014}, -> it has to be changed too.

I plan on making another PR next week to put my ideas into code. I would then put the raw string into the date field if year, month and day cannot be determined (normalized_pubblication_date==null). I think it is better to put the 'malformed' string into the date field as the year field is mostly expected to be an integer.

+1.

  • I am a random BibTeX user and I have plenty of non-numerical year fields, the format year = {2010--2011} looks very common. So that should be considered as legacy bibtex entries?
    @koppor is the pro here :)

The standard BibTeX styles just copy the year information verbatim into the library (Source: https://tex.stackexchange.com/a/121511/9075). Depends on the other bibtex tooling (natbib, plain latex, ...) how to process it.

natbib, for instance, needs a workaround to have year-ranges working: https://tex.stackexchange.com/a/385126/9075

It is difficult to support all "flavours" of bibtex (plain, natbib, biblatex, ...) in one .bib output. Especially, because the different bibtex styles (.bst files) process the field contents

I was thinking giving normalized/parsed and raw data in explicit fields like in XML

I like that idea as well. Introducing new fields in bibtex is not an issue (something like 'rawdate').

+1 for rawdate.

  • the XML format will provide to the users more information at different level of granularity. Just wondering, in the case of extracted information in JabRef, have you considered using first a more expressive format to interact with users before converting to BibTeX based on user's correction?

On the one hand, JabRef's main focus is BibTeX with tailored support of BibTeX special functionalities (such as pre-defined reusable strings, cleaning up entries, cross-references, ...). On the other hand, BibTeX is "just" a key/value serialization, JabRef offers any fields (https://docs.jabref.org/advanced/fields#define-your-own-fields) and treats unknown fields as string-typed. For instance, a "summary" field could be added (not being "just" a comment).

@koppor
Copy link
Contributor

koppor commented Aug 17, 2021

Note that only biblatex handles date (and possibly some bibtex styles; not all). Thus, the fields year, month, and day should always be filled if possible.

@btut
Copy link

btut commented Aug 17, 2021

So if I understood correctly:

Thus, the fields year, month, and day should always be filled if possible.

So then let's fill them using the integers from normalized_publication_date if present, otherwise place the raw date into the year field.
This also matches:

date in BibTeX is always ISO normalized -> "iso8601-2 Extended Format specification level 1", which is a "yes" to your answer

but contradicts:

If date cannot be normalized, raw date-string goes in date field (from #814)

That was my assumption. But if year needs to be filled, let's put the raw sting there (as is in this PR, changed it back in #814).

So #814 only adds that if the date is parseable, fill year, month and day additionally to only date.

I changed the base of #814 to just highlight the differences.

@kermitt2
Copy link
Owner Author

@btut I've added the changes from #814 here (with a typo fixed) and updated the tests to reflect the new expected output.

Not sure it's what is expected:

  • for
Kolb, S., Wirtz G.: Towards Application Portability in Platform as a Service
Proceedings of the 8th IEEE International Symposium on Service-Oriented System Engineering (SOSE), Oxford, United Kingdom, April 7 - 10, 2014.

we have:

@inproceedings{-1,
  author = {Kolb, S and Wirtz, G},
  booktitle = {Towards Application Portability in Platform as a Service Proceedings of the 8th IEEE International Symposium on Service-Oriented System Engineering (SOSE)},
  date = {2014},
  year = {2014},
  address = {Oxford, United Kingdom}
}

and not:

@inproceedings{-1,
  author = {Kolb, S and Wirtz, G},
  booktitle = {Towards Application Portability in Platform as a Service Proceedings of the 8th IEEE International Symposium on Service-Oriented System Engineering (SOSE)},
  date = {2014},
  year = {April 7 - 10, 2014},
  address = {Oxford, United Kingdom}
}

If it's ok like that, it's ready to merge!

@btut
Copy link

btut commented Aug 18, 2021

Thanks for your effort @kermitt2!
I like this result! It should be broadly compatible with BibTeX.

@kermitt2 kermitt2 merged commit e450e4f into master Aug 18, 2021
@btut
Copy link

btut commented Aug 18, 2021

Thanks again for your support!

@lfoppiano lfoppiano deleted the follow-up-761 branch August 19, 2021 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants