Review date handling with header consolidation and date serialization in references #807

kermitt2 · 2021-08-02T01:31:51Z

This is a follow-up of PR #761, with:

merging of date with header consolidation, e.g. when extracted is <date type="published" when="2011-05-23">23 May 2011</date> and consolidation is <date type="published" when="2011-05"/>, we keep the first one.
better date serialization in references, we output both normalized and raw extracted date (like done in the header), this is realized both for XML TEI and bibTeX format

Note: move toISOString() to class Date.java

Add tests for structured date merging.

…ferences

kermitt2 · 2021-08-02T01:38:20Z

A test PDF to see the header date merging in action :)

fulltext02.pdf

btut

I added some comments on how to deal with normalized vs non-normalized dates. Curious what you think!

btut · 2021-08-02T05:33:13Z

grobid-core/src/main/java/org/grobid/core/data/Date.java

+     * "2011" "2010" -> 2011
+     */
+    public static Date merge(Date date1, Date date2) {
+        if (date1.getYear() != -1 && date2.getYear() != -1) {


if date1.getYear() == -1, this still returns date1.
What about something like

if (date1.getYear() == -1) { return date2; } if (date2.getYear() == -1) { return date1; }

// rest of the method as-is

Indeed, I share the same concern about this. If dat 1 is non existent doesn't make sense to take date2?

yes! when I wrote it, I was considering that it is irrelevant to try to merge something when data1 is undefined, but this is not covering the normal general usage.

btut · 2021-08-02T05:40:01Z

grobid-core/src/main/java/org/grobid/core/data/BiblioItem.java

+            if (publication_date != null) {
+                bibtex.add("  year = {" + publication_date + "}");
+            }


I think date is preferred over year and if used, year should only consist of a numeric year.
I see two options here:

Put this in an else branch to the if before it and write it to the date field.
So, if a normalized_publication_date exists, use it as iso string, else if the publication_date exists use that.

Don't format the normalized_publication_date as ISO string but use the year, month and day fields instead. This would ensure that year, month and day are numeric and the (more detailed, with day ranges) publication_date could be put as a string in the date field. If both are given, it is up to the user to decide which to keep.

kermitt2 · 2021-08-05T20:47:13Z

Linking to #800 (comment) from @koppor

I am asking a few questions for clarification below.

kermitt2 · 2021-08-05T21:14:13Z

So what I understand from @koppor explanations, correct me if I am again wrong :D

preferably provide year and month for supporting common .bib processors and optionally date too. So I guess we can output all of them (year, month, day, date), based on the normalized ISO date?
general bibtex does not support the range format and anyway the Grobid ISO conversion neither. But if Grobid ISO conversion would support date range in the future, the range would go to the date field and year / month would be limited to the fixed part of the range.
if I understand well, there is no way to express a "raw" date string, like April 7 - 10, 2014 ("This should not happen. :)"). It means it won't appear in the bibtex format unfortunately. And it means that publication_date can never be used in the BibTeX output and that if the ISO normalization fails (for whatever reasons) there will be no date information at all due to constraints of the BibTeX format?

Note that the current bibtex format implementation for references output raw dates like that currently, so it can output year = {April 7 - 10, 2014}, -> it has to be changed too.

Naive remarks:

I am a random BibTeX user and I have plenty of non-numerical year fields, the format year = {2010--2011} looks very common. So that should be considered as legacy bibtex entries?
the XML format will provide to the users more information at different level of granularity. Just wondering, in the case of extracted information in JabRef, have you considered using first a more expressive format to interact with users before converting to BibTeX based on user's correction?

btut · 2021-08-05T22:00:32Z

Hi!
First things first, I am not a pro-bibtex-user either, so take my opinions and thoughts with a grain of salt and rely on @koppor 's expertice. I am sure he will give his input as well, but I cant help but give my thoughts too :)

So what I understand from @koppor explanations, correct me if I am again wrong :D

preferably provide year and month for supporting common .bib processors and optionally date too. So I guess we can output all of them (year, month, day, date), based on the normalized ISO date?

Thaths what I understood as well.

general bibtex does not support the range format and anyway the Grobid ISO conversion neither. But if Grobid ISO conversion would support date range in the future, the range would go to the date field and year / month would be limited to the fixed part of the range.

Yes. I think best case would be to parse the range and put the beginning of the range to the bibtex fields. So Januarry 16-28th 1987 would result in year=1987, month=1 and day=16. I don't know how easy the parsing would be for this case.
For now, I think it would be acceptable to put the full date string January 16-18th 1987 into the date field, as most citation styles would just use that verbatim (again, I would rely on @koppor 's expertice to confirm this).

if I understand well, there is no way to express a "raw" date string, like April 7 - 10, 2014 ("This should not happen. :)"). It means it won't appear in the bibtex format unfortunately. And it means that publication_date can never be used in the BibTeX output and that if the ISO normalization fails (for whatever reasons) there will be no date information at all due to constraints of the BibTeX format?

As above, 'wrong/malformatted' information is better than no information at all, so I would just put the raw string into the date field. If a user then uses a citation style that is not compatible, there is not much we can do. At least there is the raw date and the user can change that to fit his/her citation style.

Note that the current bibtex format implementation for references output raw dates like that currently, so it can output year = {April 7 - 10, 2014}, -> it has to be changed too.

I plan on making another PR next week to put my ideas into code. I would then put the raw string into the date field if year, month and day cannot be determined (normalized_pubblication_date==null). I think it is better to put the 'malformed' string into the date field as the year field is mostly expected to be an integer.

Naive remarks:

I am a random BibTeX user and I have plenty of non-numerical year fields, the format year = {2010--2011} looks very common. So that should be considered as legacy bibtex entries?

@koppor is the pro here :)

the XML format will provide to the users more information at different level of granularity. Just wondering, in the case of extracted information in JabRef, have you considered using first a more expressive format to interact with users before converting to BibTeX based on user's correction?

Not that I know of. But what kind of granularity do you mean? I think JabRef users (and bibtex users in general) are only interested in the fields that show up in their pubblication when they cite a work. So mainy author (without affiliation...), Journal, year, ... The addidtional information provided by Grobid (very impressive by the way) is not interesting for this use case, right? Am I missing something?

Thanks for putting your time into this. I am sure it is a really nice feature for JabRef users and people considering a migration to JabRef (as they would just have to import pdfs and Grobid would do the hard part!)

…ing fields and simpler/shorter

kermitt2 · 2021-08-06T22:59:02Z

But what kind of granularity do you mean?

I was thinking giving normalized/parsed and raw data in explicit fields like in XML:

<date type="published" when="2014">April 7 - 10, 2014</date>

so just pushing the idea " 'wrong/malformatted' information is better than no information at all " (which I agree on behalf Grobid routine production of wrong/malformatted' information :D )

I think JabRef users (and bibtex users in general) are only interested in the fields that show up in their publication when they cite a work.

Yes !

lfoppiano

Good idea to return a new object. 👍

Another small detail. I was thinking to change it directly but I might have overlooked something else.
The class implements Comparable. I suggest implementing Comparable<Date> instead and remove the "compareTo(Object another)" so that the "client" will take care of passing the correct object.
If you agree, @kermitt2 I can implement it and update the usage in the other parts.

kermitt2 · 2021-08-10T02:25:25Z

The class implements Comparable. I suggest implementing Comparable instead and remove the "compareTo(Object another)" so that the "client" will take care of passing the correct object.

implements Comparable<Date> correct ? -> sure it would be better indeed !

btut · 2021-08-12T09:11:55Z

Sorry for the delay, I was finally able to implement my thoughts in #814.

I was thinking giving normalized/parsed and raw data in explicit fields like in XML

I like that idea as well. Introducing new fields in bibtex is not an issue (something like 'rawdate'). I am afraid that if we put the raw-string into the date field, users will assume that it is in a valid format and not bother fixing potential issues. Using an additional field would help here as the 'rawdate' would not end up in citations and users will notice and fix the format themselves (or lookup the correct date).
I would be interested in @koppor thoughts about that.

btut · 2021-08-17T09:17:44Z

@kermitt2 and @lfoppiano Google Summer-of-Code ends this week, it would be amazing if we could get this (or #814) merged before then so we can update the JabRef server and merge the feature. Is there anything I can do to help make that happen?

If not, we probably will update the JabRef server to #814 instead of the master branch, which is still fine, but it would be a great personal achievement for me to have everything merged and ready for users to try out.

Thanks for your support!

kermitt2 · 2021-08-17T10:16:05Z

@btut I though we were waiting for @koppor feedback ?

btut · 2021-08-17T10:25:31Z

Oh, sorry. I just assumed since he did not disagree he would be ok with this. But you are right, let's see if @koppor has to add something.

kermitt2 · 2021-08-17T10:35:39Z

I think there's just one point a bit unclear/contradictory:

date in BibTeX is always ISO normalized -> "iso8601-2 Extended Format specification level 1", which is a "yes" to your answer (from Accept application/x-bibtex for processHeaderDocument #800 (comment))
If date cannot be normalized, raw date-string goes in date field (from Revise BibTeX date output #814)

koppor · 2021-08-17T10:50:48Z

So what I understand from @koppor explanations, correct me if I am again wrong :D

preferably provide year and month for supporting common .bib processors and optionally date too. So I guess we can output all of them (year, month, day, date), based on the normalized ISO date?
Thaths what I understood as well.

+1

general bibtex does not support the range format and anyway the Grobid ISO conversion neither. But if Grobid ISO conversion would support date range in the future, the range would go to the date field and year / month would be limited to the fixed part of the range.
Yes. I think best case would be to parse the range and put the beginning of the range to the bibtex fields. So Januarry 16-28th 1987 would result in year=1987, month=1 and day=16. I don't know how easy the parsing would be for this case.
For now, I think it would be acceptable to put the full date string January 16-18th 1987 into the date field, as most citation styles would just use that verbatim (again, I would rely on @koppor 's expertice to confirm this).

+1 for year, month, day.

The date seems to be parsable (assuming the typo in Januarry is not intentional and it should be January. Then, it is:

1987-01-16/1987-01-28

(Other references listed at #800 (comment))

if I understand well, there is no way to express a "raw" date string, like April 7 - 10, 2014 ("This should not happen. :)"). It means it won't appear in the bibtex format unfortunately. And it means that publication_date can never be used in the BibTeX output and that if the ISO normalization fails (for whatever reasons) there will be no date information at all due to constraints of the BibTeX format?
As above, 'wrong/malformatted' information is better than no information at all,

Yeah - the bibtex processors are robust for that case and will output some erros during procesisng. I also think, its better to have the information there than using other fields (and thus creating another standard).

Note that the current bibtex format implementation for references output raw dates like that currently, so it can output year = {April 7 - 10, 2014}, -> it has to be changed too.

I plan on making another PR next week to put my ideas into code. I would then put the raw string into the date field if year, month and day cannot be determined (normalized_pubblication_date==null). I think it is better to put the 'malformed' string into the date field as the year field is mostly expected to be an integer.

+1.

I am a random BibTeX user and I have plenty of non-numerical year fields, the format year = {2010--2011} looks very common. So that should be considered as legacy bibtex entries?
@koppor is the pro here :)

The standard BibTeX styles just copy the year information verbatim into the library (Source: https://tex.stackexchange.com/a/121511/9075). Depends on the other bibtex tooling (natbib, plain latex, ...) how to process it.

natbib, for instance, needs a workaround to have year-ranges working: https://tex.stackexchange.com/a/385126/9075

It is difficult to support all "flavours" of bibtex (plain, natbib, biblatex, ...) in one .bib output. Especially, because the different bibtex styles (.bst files) process the field contents

I was thinking giving normalized/parsed and raw data in explicit fields like in XML

I like that idea as well. Introducing new fields in bibtex is not an issue (something like 'rawdate').

+1 for rawdate.

the XML format will provide to the users more information at different level of granularity. Just wondering, in the case of extracted information in JabRef, have you considered using first a more expressive format to interact with users before converting to BibTeX based on user's correction?

On the one hand, JabRef's main focus is BibTeX with tailored support of BibTeX special functionalities (such as pre-defined reusable strings, cleaning up entries, cross-references, ...). On the other hand, BibTeX is "just" a key/value serialization, JabRef offers any fields (https://docs.jabref.org/advanced/fields#define-your-own-fields) and treats unknown fields as string-typed. For instance, a "summary" field could be added (not being "just" a comment).

koppor · 2021-08-17T11:05:05Z

Note that only biblatex handles date (and possibly some bibtex styles; not all). Thus, the fields year, month, and day should always be filled if possible.

btut · 2021-08-17T11:37:25Z

So if I understood correctly:

Thus, the fields year, month, and day should always be filled if possible.

So then let's fill them using the integers from normalized_publication_date if present, otherwise place the raw date into the year field.
This also matches:

date in BibTeX is always ISO normalized -> "iso8601-2 Extended Format specification level 1", which is a "yes" to your answer

but contradicts:

If date cannot be normalized, raw date-string goes in date field (from #814)

That was my assumption. But if year needs to be filled, let's put the raw sting there (as is in this PR, changed it back in #814).

So #814 only adds that if the date is parseable, fill year, month and day additionally to only date.

I changed the base of #814 to just highlight the differences.

kermitt2 · 2021-08-18T07:14:02Z

@btut I've added the changes from #814 here (with a typo fixed) and updated the tests to reflect the new expected output.

Not sure it's what is expected:

for

Kolb, S., Wirtz G.: Towards Application Portability in Platform as a Service
Proceedings of the 8th IEEE International Symposium on Service-Oriented System Engineering (SOSE), Oxford, United Kingdom, April 7 - 10, 2014.

we have:

@inproceedings{-1,
  author = {Kolb, S and Wirtz, G},
  booktitle = {Towards Application Portability in Platform as a Service Proceedings of the 8th IEEE International Symposium on Service-Oriented System Engineering (SOSE)},
  date = {2014},
  year = {2014},
  address = {Oxford, United Kingdom}
}

and not:

@inproceedings{-1,
  author = {Kolb, S and Wirtz, G},
  booktitle = {Towards Application Portability in Platform as a Service Proceedings of the 8th IEEE International Symposium on Service-Oriented System Engineering (SOSE)},
  date = {2014},
  year = {April 7 - 10, 2014},
  address = {Oxford, United Kingdom}
}

If it's ok like that, it's ready to merge!

btut · 2021-08-18T07:36:37Z

Thanks for your effort @kermitt2!
I like this result! It should be broadly compatible with BibTeX.

btut · 2021-08-18T08:10:08Z

Thanks again for your support!

merge date with header consolidation, better date serialization in re…

01b4d35

…ferences

kermitt2 requested a review from lfoppiano August 2, 2021 01:31

kermitt2 mentioned this pull request Aug 2, 2021

Accept application/x-bibtex for processHeaderDocument #800

Merged

btut reviewed Aug 2, 2021

View reviewed changes

lfoppiano added 3 commits August 5, 2021 16:31

update naming

5f86ee9

add more tests on toIsoString

cc36bc3

add more tests on toIsoString

25d8182

lfoppiano and others added 3 commits August 6, 2021 10:00

Merge branch 'master' into follow-up-761

c938eb8

return date2 if date1's year is not valid

a953965

rewrite merge to have a new instance Date, avoid loosing some raw str…

1b73754

…ing fields and simpler/shorter

lfoppiano reviewed Aug 10, 2021

View reviewed changes

implements Comparable<Date> + cosmetics

b02dcd7

btut mentioned this pull request Aug 12, 2021

Revise BibTeX date output #814

Closed

kermitt2 added 2 commits August 18, 2021 08:27

addition from #814

9a41c57

update tests

0bd69f0

kermitt2 merged commit e450e4f into master Aug 18, 2021

lfoppiano deleted the follow-up-761 branch August 19, 2021 08:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review date handling with header consolidation and date serialization in references #807

Review date handling with header consolidation and date serialization in references #807

kermitt2 commented Aug 2, 2021

kermitt2 commented Aug 2, 2021

btut left a comment

btut Aug 2, 2021

lfoppiano Aug 5, 2021

kermitt2 Aug 5, 2021

btut Aug 2, 2021

kermitt2 commented Aug 5, 2021

kermitt2 commented Aug 5, 2021

btut commented Aug 5, 2021

kermitt2 commented Aug 6, 2021

lfoppiano left a comment •

edited

Loading

kermitt2 commented Aug 10, 2021

btut commented Aug 12, 2021

btut commented Aug 17, 2021

kermitt2 commented Aug 17, 2021

btut commented Aug 17, 2021

kermitt2 commented Aug 17, 2021

koppor commented Aug 17, 2021

koppor commented Aug 17, 2021

btut commented Aug 17, 2021

kermitt2 commented Aug 18, 2021

btut commented Aug 18, 2021

btut commented Aug 18, 2021

Review date handling with header consolidation and date serialization in references #807

Review date handling with header consolidation and date serialization in references #807

Conversation

kermitt2 commented Aug 2, 2021

kermitt2 commented Aug 2, 2021

btut left a comment

Choose a reason for hiding this comment

btut Aug 2, 2021

Choose a reason for hiding this comment

lfoppiano Aug 5, 2021

Choose a reason for hiding this comment

kermitt2 Aug 5, 2021

Choose a reason for hiding this comment

btut Aug 2, 2021

Choose a reason for hiding this comment

kermitt2 commented Aug 5, 2021

kermitt2 commented Aug 5, 2021

btut commented Aug 5, 2021

kermitt2 commented Aug 6, 2021

lfoppiano left a comment • edited Loading

Choose a reason for hiding this comment

kermitt2 commented Aug 10, 2021

btut commented Aug 12, 2021

btut commented Aug 17, 2021

kermitt2 commented Aug 17, 2021

btut commented Aug 17, 2021

kermitt2 commented Aug 17, 2021

koppor commented Aug 17, 2021

koppor commented Aug 17, 2021

btut commented Aug 17, 2021

kermitt2 commented Aug 18, 2021

btut commented Aug 18, 2021

btut commented Aug 18, 2021

lfoppiano left a comment •

edited

Loading