Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: issues in exporters and citations for PermaLink/non-DOI PIDs #10790

Open
wants to merge 16 commits into
base: develop
Choose a base branch
from

Conversation

vera
Copy link
Contributor

@vera vera commented Aug 22, 2024

What this PR does / why we need it:

This PR fixes three issues with PermaLink PIDs in metadata exporters and citations.

  1. DDI & DDI HTML exporters: Previously, the export included <IDNo agency="perma">perma:N4H-/M015D56B</IDNo> (DDI XML) and "Identification Number: perma:N4H-/M015D56B" (DDI HTML). This change removes the extra slash in the middle of the identifier.

  2. EndNote XML citation: Previously, the citation included <electronic-resource-num>perma/N4H-/M015D56B</electronic-resource-num>. With this change, perma/ is replaced with perma: and the extra slash in the middle of the identifier is also removed. (Note, this is the only change that also affects DOIs: for DOIs, doi/ is replaced with doi:.)

  3. BibTeX citation: Previously, doi = {...} was always included in the citation. With this change, it is only included when the PID is actually a DOI. A url = {...} is always included (this was already the case).

Which issue(s) this PR closes:

In combination with #10775 and #10759 (or #10615), this addresses #10769 and #10768

Special notes for your reviewer:

-

Suggestions on how to test this:
(from @qqmyers)

  1. Configure permalink PID providers w/ "" and "/" separators, create datasets using them and verify that the DDI/DDI Html exports do not/do contain the / separator (respectively). Confirm that a Fake DOI provider (or test DataCite provider) still work as before w.r.t. how they appear in the DDI/DDI Html exports.

  2. Verify that the EndNote citation for datasets using DOIs and PermaLinks include the PIDs as specified in the best option as discussed in this comment and below. Handles could also be tested but I'm not sure if there's a test setup for that.

  3. Verify that the BibTeX citation format uses doi= for DOIs and only url= for PermaLinks/Handles (if we can test them). (pre-PR PermaLinks would have an extra doi= line.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

-

Is there a release notes update needed for this change?:
Provided

Additional documentation:

-

@@ -163,7 +163,7 @@ private static void createStdyDscr(XMLStreamWriter xmlw, DatasetDTO datasetDto)
String persistentAuthority = datasetDto.getAuthority();
String persistentId = datasetDto.getIdentifier();

String pid = persistentProtocol + ":" + persistentAuthority + "/" + persistentId;
String pid = persistentProtocol + ":" + persistentAuthority + ("perma".equals(persistentAgency) ? "" : "/") + persistentId;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

permalinks can have a separator - seems like we need to add that to the DTO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I just tested with a PermaLink PID that uses a separator and it is indeed missing in the DDI exports.

I thought maybe it would be merged into the authority, like I think a configured DOI authority and shoulder are merged in persistentAuthority.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shoulder just becomes part of the identifier

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just pushed a commit adding the separator to the DTO. It's marked as WIP because for some reason it doesn't work yet (separator is always null). I will have time to troubleshoot tomorrow

If I am on the wrong track with that commit, please let me know!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to change the domain object? Can't we delegate the PID string generation to GlobalId#toString()?

Copy link
Contributor Author

@vera vera Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just pushed a commit adding the separator to the DTO. It's marked as WIP because for some reason it doesn't work yet (separator is always null). I will have time to troubleshoot tomorrow

This issue is now fixed. Tested with DOI, PermaLink without separator and PermaLink with separator

For PermaLinks with separators, the database needs to be manually updated (alter table dvobject set separator = 'whatever' where protocol = 'perma' and authority = 'myauthority';). This is a little bit cumbersome but I don't know of a better way to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we delegate the PID string generation to GlobalId#toString()?

@qqmyers would you prefer a solution using the GlobalId method instead of adding the separator to the database? I could take a look at implementing it that way instead (later today or on Monday)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I was thinking that you didn't need the new column on dvobject (you can use PidUtil.parseAsGlobalID(String protocol, String authority, String identifier) to get a GlobalId), but it actually solves a limitation of permalinks right now - you can't have two providers that differ only by the separator because we don't store the separator and have to infer it. With that in mind, I think the new column is a good way to go. (I'd still use the GlobalId convenience methods whenever they're available, to avoid replicating the formatting code - looked like you mostly do that now except in DDIExportUtil where there's weirdness from tests sending null identifiers and for the one format with { in it.)

@@ -619,8 +621,7 @@ private void createEndNoteXML(XMLStreamWriter xmlw) throws XMLStreamException {
}
if (persistentId != null) {
xmlw.writeStartElement("electronic-resource-num");
String electResourceNum = persistentId.getProtocol() + "/" + persistentId.getAuthority() + "/"
+ persistentId.getIdentifier();
String electResourceNum = persistentId.asString();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has it been verified that an initial : after protocol is OK for EndNote import? It seems like the original code used / because ICPSR did. Can someone look at what EndNote exports and make sure we match it here? @adam3smith says he can check this next week if no one else has access.

Copy link
Contributor Author

@vera vera Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I wondered about that slash there. I tried to find some info on the EndNote XML format but found only this schema, which didn't specify anything about the format of electronic-resource-num, so I thought it might have been a mistake to use the slash. It would be great if someone could check.

"<electronic-resource-num>doi/10.5072/FK2/LK0D1H</electronic-resource-num>" +
"<electronic-resource-num>doi:10.5072/FK2/LK0D1H</electronic-resource-num>" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird that this old test expected doi/10... instead of doi:10.... It feels like this PR is cleaning things up! ❤️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qqmyers commented above that this slash might be required by EndNote, although it does look weird to me 🤔

String pidUri = pid;
//Some tests don't send real PIDs - don't try to get their URL form
if(!pidUri.equals("null:null/null")) {
if(!pidUri.equals("null:nullnullnull")) {
pidUri= PidUtil.parseAsGlobalID(persistentProtocol, persistentAuthority, persistentId).asURL();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you avoid line 167 by checking for parseAsGlobalId returning null here and setting pid/pidUri using GlobalId if it exists and to the constant "null:nullnullnull" if it doesn't? (Mostly a minor style thing but it avoids writing out unexpected invalid identifier entries as well.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


xmlw.writeStartElement("docDscr");
xmlw.writeStartElement("citation");
xmlw.writeStartElement("titlStmt");
writeFullElement(xmlw, "titl", dto2Primitive(version, DatasetFieldConstant.title), datasetDto.getMetadataLanguage());
xmlw.writeStartElement("IDNo");
writeAttribute(xmlw, "agency", persistentAgency);
xmlw.writeCharacters(persistentProtocol + ":" + persistentAuthority + "/" + persistentId);
xmlw.writeCharacters(persistentProtocol + ":" + persistentAuthority + persistentSeparator + persistentId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same - can you parse the authority/separator/identifier and use GlobalId.asString() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@vera
Copy link
Contributor Author

vera commented Aug 26, 2024

Hm. I tested this with DOIs, PermaLinks without separators and PermaLinks with separators and the export looks fine. However, I just saw that when I create a new dataset with a PermaLink with a separator, it doesn't actually get stored in the db (separator is null).

I think a setSeparator call needs to be added here in AbstractPidProvider#generatePid?

https://github.com/IQSS/dataverse/blob/develop/src/main/java/edu/harvard/iq/dataverse/pidproviders/AbstractPidProvider.java#L183

I also briefly tried adding unit tests for the export of PermaLink datasets to DdiExportUtilTest, but I didn't find an easy way to configure a PermaLink provider for a unit test, is there one?

@qqmyers
Copy link
Member

qqmyers commented Aug 26, 2024

You probably need to set the separator to / in the Abstract class and to the real one in the Perma provider.

There are some unit tests for the providers - see the PidUtilTest class and I think you can set up whatever provider you want that way.

@vera
Copy link
Contributor Author

vera commented Aug 27, 2024

Thanks, fixed the bug and added two tests. Now I think the only open issue is the question about the format of the EndNote citation

@pdurbin pdurbin added the Size: 10 A percentage of a sprint. 7 hours. label Sep 12, 2024
@pdurbin pdurbin added the Type: Bug a defect label Oct 9, 2024
@qqmyers
Copy link
Member

qqmyers commented Oct 22, 2024

@vera we're trying to move this forward through review/QA. Can you update from develop and fix the merge issues? I'm checking on the question about EndNote again.

@cmbz cmbz added the FY25 Sprint 8 FY25 Sprint 8 (2024-10-09 - 2024-10-23) label Oct 23, 2024
# Conflicts:
#	src/main/java/edu/harvard/iq/dataverse/export/ddi/DdiExportUtil.java
#	src/main/resources/db/migration/V6.3.0.2.sql
@vera
Copy link
Contributor Author

vera commented Oct 23, 2024

@qqmyers that's great, thank you. I've resolved the merge issues.

@qqmyers
Copy link
Member

qqmyers commented Oct 23, 2024

Thanks @vera - now it looks like there are DdiExportUtilTest failures. The results are missing /citation?persistentId=perma:

OK - looks like you saw it already :-)

@coveralls
Copy link

coveralls commented Oct 23, 2024

Coverage Status

coverage: 22.583% (+0.01%) from 22.573%
when pulling 0d75b88 on vera:fix/perma-exporters-and-citations
into ed391eb on IQSS:develop.

@vera
Copy link
Contributor Author

vera commented Oct 23, 2024

Yep should be fixed :)

@cmbz cmbz added the FY25 Sprint 9 FY25 Sprint 9 (2024-10-23 - 2024-11-06) label Oct 23, 2024
@adam3smith
Copy link
Contributor

Jim asked me to look at the Endnote XML export change: this doesn't cause new issues, but doi: isn't ideal either.
Both Endnote and Zotero will import the string in <electronic-resource-num> as is to the DOI field. That means both doi:10.1234/5678 and doi/10.1234/5678 produce incorrect citations unless fixed.
Here are some suggestion for better handling this:

Best option: Completely separate the handling of DOI and handle on export:
DOI goes into <electronic-resource-num> without any prefix
handle goes into <urls><web-urls><url>

Second best option: Parse the perma prefix separately:
DOI goes into <electronic-resource-num> without any prefix
handle goes into <electronic-resource-num> with either hdl/ or the full handle URL prefix

Third best option: Put all identifiers into <electronic-resource-num> without prefix (will get handles wrong, but they can be detected otherwise.

I think the behavior in this PR is fourth best if the goal of metadata export is to get metadata into reference managers and citations so they produce correct&accurate references

@vera
Copy link
Contributor Author

vera commented Oct 24, 2024

Thanks for checking!

Should I go ahead and implement one of those options? (the best option?)

@qqmyers
Copy link
Member

qqmyers commented Oct 24, 2024

Given that the EndNote format now changes, I'd suggest adding a short release note pointing out this improvement/backward incompatibility.

@vera
Copy link
Contributor Author

vera commented Oct 24, 2024

@adam3smith Currently, all PID URLs (https://doi.org/..., https://hdl.handle.net/... or the permalink) are already inserted under <urls><related-urls><url>:

xmlw.writeStartElement("urls");
xmlw.writeStartElement("related-urls");
xmlw.writeStartElement("url");
xmlw.writeCharacters(getPersistentId().asURL());
xmlw.writeEndElement(); // url
xmlw.writeEndElement(); // related-urls
xmlw.writeEndElement(); // urls

Would you recommend changing that to <urls><web-urls><url> in general?

@vera
Copy link
Contributor Author

vera commented Oct 25, 2024

I've now implemented the best option as suggested:

  • removing prefix from electronic-resource-num
  • inserting handles and PermaLinks as web-urls
  • (no change to current behavior) inserting DOIs as related-urls

If DOIs should also be inserted as web-urls, we could also do that.

(FYI, I am out of office next week. :))

@qqmyers
Copy link
Member

qqmyers commented Oct 25, 2024

Thanks @vera - can you add a quick release note?

Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I added a release note. Thanks for the extra work with the EndNote format!

@qqmyers qqmyers unassigned qqmyers and vera Oct 25, 2024
@vera
Copy link
Contributor Author

vera commented Nov 4, 2024

Thanks for adding the release note! @qqmyers

@cmbz cmbz added the FY25 Sprint 10 FY25 Sprint 10 (2024-11-06 - 2024-11-20) label Nov 7, 2024
@cmbz cmbz added the FY25 Sprint 11 FY25 Sprint 11 (2024-11-20 - 2024-12-04) label Nov 21, 2024
@cmbz cmbz added the FY25 Sprint 12 FY25 Sprint 12 (2024-12-04 - 2024-12-18) label Dec 5, 2024
@cmbz cmbz added the FY25 Sprint 14 FY25 Sprint 14 (2025-01-02 - 2025-01-15) label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FY25 Sprint 8 FY25 Sprint 8 (2024-10-09 - 2024-10-23) FY25 Sprint 9 FY25 Sprint 9 (2024-10-23 - 2024-11-06) FY25 Sprint 10 FY25 Sprint 10 (2024-11-06 - 2024-11-20) FY25 Sprint 11 FY25 Sprint 11 (2024-11-20 - 2024-12-04) FY25 Sprint 12 FY25 Sprint 12 (2024-12-04 - 2024-12-18) FY25 Sprint 14 FY25 Sprint 14 (2025-01-02 - 2025-01-15) Size: 10 A percentage of a sprint. 7 hours. Type: Bug a defect
Projects
Status: Ready for QA ⏩
7 participants