-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review date handling with header consolidation and date serialization in references #807
Conversation
A test PDF to see the header date merging in action :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments on how to deal with normalized vs non-normalized dates. Curious what you think!
* "2011" "2010" -> 2011 | ||
*/ | ||
public static Date merge(Date date1, Date date2) { | ||
if (date1.getYear() != -1 && date2.getYear() != -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if date1.getYear() == -1, this still returns date1.
What about something like
if (date1.getYear() == -1) {
return date2;
}
if (date2.getYear() == -1) {
return date1;
}
// rest of the method as-is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I share the same concern about this. If dat 1 is non existent doesn't make sense to take date2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! when I wrote it, I was considering that it is irrelevant to try to merge something when data1 is undefined, but this is not covering the normal general usage.
if (publication_date != null) { | ||
bibtex.add(" year = {" + publication_date + "}"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think date
is preferred over year
and if used, year
should only consist of a numeric year.
I see two options here:
- Put this in an else branch to the if before it and write it to the
date
field.
So, if anormalized_publication_date
exists, use it as iso string, else if thepublication_date
exists use that. - Don't format the
normalized_publication_date
as ISO string but use theyear
,month
andday
fields instead. This would ensure thatyear
,month
andday
are numeric and the (more detailed, with day ranges)publication_date
could be put as a string in thedate
field. If both are given, it is up to the user to decide which to keep.
Linking to #800 (comment) from @koppor I am asking a few questions for clarification below. |
So what I understand from @koppor explanations, correct me if I am again wrong :D
Note that the current bibtex format implementation for references output raw dates like that currently, so it can output Naive remarks:
|
Hi!
Thaths what I understood as well.
Yes. I think best case would be to parse the range and put the beginning of the range to the bibtex fields. So
As above, 'wrong/malformatted' information is better than no information at all, so I would just put the raw string into the
I plan on making another PR next week to put my ideas into code. I would then put the raw string into the date field if year, month and day cannot be determined (normalized_pubblication_date==null). I think it is better to put the 'malformed' string into the date field as the year field is mostly expected to be an integer.
@koppor is the pro here :)
Not that I know of. But what kind of granularity do you mean? I think JabRef users (and bibtex users in general) are only interested in the fields that show up in their pubblication when they cite a work. So mainy author (without affiliation...), Journal, year, ... The addidtional information provided by Grobid (very impressive by the way) is not interesting for this use case, right? Am I missing something? Thanks for putting your time into this. I am sure it is a really nice feature for JabRef users and people considering a migration to JabRef (as they would just have to import pdfs and Grobid would do the hard part!) |
I was thinking giving normalized/parsed and raw data in explicit fields like in XML: <date type="published" when="2014">April 7 - 10, 2014</date> so just pushing the idea " 'wrong/malformatted' information is better than no information at all " (which I agree on behalf Grobid routine production of wrong/malformatted' information :D )
Yes ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to return a new object. 👍
Another small detail. I was thinking to change it directly but I might have overlooked something else.
The class implements Comparable. I suggest implementing Comparable<Date>
instead and remove the "compareTo(Object another)" so that the "client" will take care of passing the correct object.
If you agree, @kermitt2 I can implement it and update the usage in the other parts.
|
Sorry for the delay, I was finally able to implement my thoughts in #814.
I like that idea as well. Introducing new fields in bibtex is not an issue (something like 'rawdate'). I am afraid that if we put the raw-string into the date field, users will assume that it is in a valid format and not bother fixing potential issues. Using an additional field would help here as the 'rawdate' would not end up in citations and users will notice and fix the format themselves (or lookup the correct date). |
@kermitt2 and @lfoppiano Google Summer-of-Code ends this week, it would be amazing if we could get this (or #814) merged before then so we can update the JabRef server and merge the feature. Is there anything I can do to help make that happen? If not, we probably will update the JabRef server to #814 instead of the master branch, which is still fine, but it would be a great personal achievement for me to have everything merged and ready for users to try out. Thanks for your support! |
Oh, sorry. I just assumed since he did not disagree he would be ok with this. But you are right, let's see if @koppor has to add something. |
I think there's just one point a bit unclear/contradictory:
|
+1
+1 for year, month, day. The date seems to be parsable (assuming the typo in Januarry is not intentional and it should be 1987-01-16/1987-01-28 (Other references listed at #800 (comment))
Yeah - the bibtex processors are robust for that case and will output some erros during procesisng. I also think, its better to have the information there than using other fields (and thus creating another standard).
+1.
The standard BibTeX styles just copy the year information verbatim into the library (Source: https://tex.stackexchange.com/a/121511/9075). Depends on the other bibtex tooling (natbib, plain latex, ...) how to process it. natbib, for instance, needs a workaround to have year-ranges working: https://tex.stackexchange.com/a/385126/9075 It is difficult to support all "flavours" of bibtex (plain, natbib, biblatex, ...) in one .bib output. Especially, because the different bibtex styles (.bst files) process the field contents
+1 for
On the one hand, JabRef's main focus is BibTeX with tailored support of BibTeX special functionalities (such as pre-defined reusable strings, cleaning up entries, cross-references, ...). On the other hand, BibTeX is "just" a key/value serialization, JabRef offers any fields (https://docs.jabref.org/advanced/fields#define-your-own-fields) and treats unknown fields as string-typed. For instance, a "summary" field could be added (not being "just" a comment). |
Note that only biblatex handles |
So if I understood correctly:
So then let's fill them using the integers from normalized_publication_date if present, otherwise place the raw date into the year field.
but contradicts:
That was my assumption. But if year needs to be filled, let's put the raw sting there (as is in this PR, changed it back in #814). So #814 only adds that if the date is parseable, fill year, month and day additionally to only date. I changed the base of #814 to just highlight the differences. |
@btut I've added the changes from #814 here (with a typo fixed) and updated the tests to reflect the new expected output. Not sure it's what is expected:
we have:
and not:
If it's ok like that, it's ready to merge! |
Thanks for your effort @kermitt2! |
Thanks again for your support! |
This is a follow-up of PR #761, with:
merging of date with header consolidation, e.g. when extracted is
<date type="published" when="2011-05-23">23 May 2011</date>
and consolidation is<date type="published" when="2011-05"/>
, we keep the first one.better date serialization in references, we output both normalized and raw extracted date (like done in the header), this is realized both for XML TEI and bibTeX format
Note: move
toISOString()
to classDate.java
Add tests for structured date merging.