Make Dataverse produce valid DDI 3648 #9484

landreev · 2023-03-29T14:58:42Z

What this PR does / why we need it:

copy-and-pasting my last comment from the issue:

Just to clarify a couple of things from an earlier discussion:

sizing:

* We will address the immediate issue of the bad ddi xml exports by looking specifically at what has been reported.
...
* If we find that the validator needs work, we will create a new separate issue when this is complete

"Looking specifically at what has been reported" may not easily apply. This is a very old issue, with a lot of back-and-forth (that's very hard to read), and many of the things reported earlier have already been fixed in other PRs. So I assumed that the goal of the PR was "make Dataverse produce valid DDI". (i.e., if something not explicitly mentioned here is obviously failing validation, it needed to be fixed too - it did not make sense to make a PR that would fix some things, but still produce ddi records that fail validation; especially since people have been waiting for it to be fixed since 2017).

The previously discussed automatic validation - adding code to the exporter that would validate in real time every ddi record produced, and only cache it if it passes the validation - does make sense to be left as a separate sprint-sized task. (the validation itself is not hard to add; but we'll need to figure out how to report the errors). I have enabled the validation test in DDIExporterTest.testExportDataset() however, so, in the meantime, after we merge this PR, any developer working on the ddi exporter will be alerted if they break it by introducing something invalid, because they won't be able to build their branch.

To clarify, in the current state, the exporter in my branch is producing valid ddi xml for our control "all fields" dataset, plus all the other datasets used in our tests, and whatever I could think of to test. It does NOT guarantee that there is no possible scenario where it can still output something illegal! So, yes, it is important to add auto-validation. And, if and when somebody finds another such scenario, we will treat it as a new issue.

A couple of arbitrary decisions had to be made. I will spell it out in the PR description. My general approach was, if something does not translate from our metadata to the ddi format 1:1, just drop it and move on. We don't assume that it's a goal, to preserve all of our metadata when exporting DC, it's obvious that only a subset of our block fields can be exported in that format. But it's not a possibility with the ddi either, now that we have multiple blocks and the application is no longer centered around quantitative social science. So, no need to sweat a lost individual field here and there.

Which issue(s) this PR closes:

Closes #3648

Special notes for your reviewer:

See the description above. Read the linked issue... at your own risk - it goes for miles, and years. Many things reported there have been fixed already. Some things reported there as late as in 2022 had already been fixed as of 2020.

The "arbitrary decisions" mentioned above:

In our metadata block, both the "Producer" and the "Distributor" have these same 4 sub-fields:
*Affiliation
*Abbreviation
*Logo
*URL

In the DDI however the producer and distrbtr have the attributes as follows:

<producer affiliation="..." abbr="..." role="...">
but 
<distrbtr affiliation="..." abbr="..." URI="...">

but our exporter was writing all FOUR of the attributes above in each section. I addressed this by simply dropping the URI= from the former, and role= from the latter.
(that said, anyone has any idea as to WHY we were putting the logo into the role attribute?? - as in, ending up with this in our test dataset: role="http://DistributorLogoURL2.org"; the export util is still doing this in the <producer> section.)

The other thing:
Our geospatial block allows for multiple bounding boxes. But the sumDscr in the DDI only allows one:

<xs:sequence>
   <xs:element ref="timePrd" minOccurs="0" maxOccurs="unbounded"/>
   <xs:element ref="collDate" minOccurs="0" maxOccurs="unbounded"/>
   <xs:element ref="nation" minOccurs="0" maxOccurs="unbounded"/>
   <xs:element ref="geogCover" minOccurs="0" maxOccurs="unbounded"/>
   <xs:element ref="geogUnit" minOccurs="0" maxOccurs="unbounded"/>
   <xs:element ref="geoBndBox" minOccurs="0"/>
   <xs:element ref="boundPoly" minOccurs="0" maxOccurs="unbounded"/>
   <xs:element ref="anlyUnit" minOccurs="0" maxOccurs="unbounded"/>
   <xs:element ref="universe" minOccurs="0" maxOccurs="unbounded"/>
   <xs:element ref="dataKind" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>

I've addressed this by simply using the first one, and disregarding the rest, if the export util encounters a dataset with multiple bounding boxes. There may be a cleaner solution, but I can't think of one now (and I needed to keep the amount of work on this PR manageable since it was supposed to be a "33"). In the discussion in the issue, it was suggested that we make the field non-multiple-allowed on the Dataverse side as well. That would be easy... except it's not clear at all what we would then do with existing datasets that already have multiple boundingboxes... With simple text fields, when we need to make something multiple single, it's easy to just concatenate multiple values... but you can't really combine bounding boxes.
On the other hand, while only one <geoBndBox> is allowed in a <sumDscr>, multiple <sumDscr> sections ARE actually allowed. So we could use that to be able to export such multiple values... but that would also be difficult... for reasons. If anyone has any constructive ideas, please speak up, but it will need to be handled as a separate issue. For now, producing valid DDI xml is the priority.

Suggestions on how to test this:

CESSDA Metadata Validator (https://cmv.cessda.eu/#!validation) is an excellent tool for testing DDI records. I'm assuming "CESSDA DATA CATALOGUE (CDC) DDI2.5 PROFILE - MONOLINGUAL: 1.0.4" is the correct validation profile to use.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

…ts as specified in the schema. #3648

…e ddi schema #3648

…a violations *for the control dataset we are using in our tests*. There is almost certainly more that needs to be done. #3648

…nvalid. #3648

github-actions · 2023-03-29T14:59:39Z

src/main/java/edu/harvard/iq/dataverse/export/ddi/DdiExportUtil.java

+               <xs:element ref="sampProc" minOccurs="0" maxOccurs="unbounded"/>
+               <xs:element ref="sampleFrame" minOccurs="0" maxOccurs="unbounded"/>
+               <xs:element ref="targetSampleSize" minOccurs="0" maxOccurs="unbounded"/>
+	       <xs:element ref="deviat" minOccurs="0" maxOccurs="unbounded"/>


🚫 [reviewdog] <com.puppycrawl.tools.checkstyle.checks.whitespace.FileTabCharacterCheck> _{reported by reviewdog 🐶}
File contains tab characters (this is the first instance).

coveralls · 2023-03-29T15:01:32Z

Coverage: 20.199% (+0.02%) from 20.181% when pulling 9545531 on 3648-invalid-ddi into 49ef7f8 on develop.

mreekie · 2023-03-29T19:11:19Z

Sprint Kickoff:

about 10 points left:

qqmyers

Other than a a couple comments, this all looks good. Mostly reordering to match what the schema requires. Looks like the tests are passing. The style checker is complaining about tab chars though.

src/main/java/edu/harvard/iq/dataverse/export/ddi/DdiExportUtil.java

qqmyers · 2023-04-05T21:53:33Z

src/main/java/edu/harvard/iq/dataverse/export/ddi/DdiExportUtil.java

@@ -946,9 +1064,10 @@ private static void writeDistributorsElement(XMLStreamWriter xmlw, DatasetVersio
                                if (!distributorURL.isEmpty()) {
                                    writeAttribute(xmlw, "URI", distributorURL);
                                }
-                                if (!distributorLogoURL.isEmpty()) {
+                                /* NOT IN THE SCHEMA! -L.A.if (!distributorLogoURL.isEmpty()) {
+                                   (and why were we putting this logo into the "role" field anyway?? - same with producerLogo above!)


Does it make sense to remove the producer logo URL now with these other breaking changes? No strong opinion, but if it really isn't useful, it might be easier to change now than in a later PR.

Just in case it's helpful: I wrote about this a bit. Back then I guessed that the distributor logo url was put into this "role" field "to preserve that metadata during migrations from Dataverse 3.x to 4.x".

@qqmyers @jggautier Thank you for the comments. I've been meaning to finalize this - after everything else is fixed, that is. But hoping to wrap this up first thing next week.
I'm leaning towards the simplest/brutest force solution that Jim mentioned - just dropping this "role" attribute, since the logo url makes zero sense in it.
Julian, thank you for the writeup. I dug a bit further, by reading the old DVN code. And it doesn't look like it was done for the purposes of the migration either. It seems like this is even more ancient history, that it may be tracing its history to a hack that DVNs used to rely on to pass the logos when harvesting from each other. However, unless I'm really not reading it right, it just doesn't seem like it was doing the same thing as the current Dataverse exporter - it was never putting that logo url in the "role" attribute; it was encoding it as an HTML link inside the text of the producer field, with a "role=..." attribute of its own... But then, when the export and import code were being ported to Dataverse, this may have been changed simply by mistake (?). Does this make any sense? - it doesn't make complete sense to me, I feel like I may be missing something - but I am fairly confident by now that nobody could possibly be relying on that attribute in its current form now. So I am indeed feeling like killing it.

Just confirming that I have dropped this dubious "role=logo" attribute from the export.

kcondon · 2023-04-06T18:05:06Z

@landreev Does using the notes field as an extension device per Micah Altman and our early medical metadata attempts provide any useful advantage in edge cases or would that just add "junk" that then becomes equally hard to manage?

…he API), and the corresponding control ddi export. #3648

…uide importable, using kcondon's fixes. #3648

landreev · 2023-04-17T23:49:43Z

(deleted the previous version of the comment; since I'm correcting it, and also because it somehow got attached to an unrelated thread above)

@kcondon I was able to confirm that both of the "all fields" json examples that we provide in the source tree: ./src/test/java/edu/harvard/iq/dataverse/export/ddi/dataset-create-new-all-ddi-fields.json and scripts/api/data/dataset-create-new-all-default-fields.json are now importable (I applied your fixes to both of the above in my branch). And, both of the resulting datasets are exportable as ddi, and the produced ddi is valid. Tested both in my build env. and on dataverse-internal (deployed my build there).
The dataset that you showed me earlier there, FK2/JA1SDI that was created via the form, is still failing to export, with the same misleading error message. I'll be investigating it first thing in the morning. (something controlled vocab.-related, it looks like, but I need to figure out what exactly). w

…made multiple in PR #9254; would be great to put together a process for developers who need to make changes to fields in metadata blocks that would help them to know of all the places where changes like this need to be made. (not the first time, when something breaks, in ddi export specifically, after a field is made multiple). #3648

landreev · 2023-04-19T22:52:50Z

The export failure on the test dataset FK2/JA1SDI mentioned in the previous comment, that delayed the QA of the PR, was only happening on account of the productionPlace having been made multiple recently (pr #9254). Kevin's manually-created dataset actually did contain 2 productionPlace entries. Our supplied sample “all fields” importable json however only had 1, so the automated test export wasn't bombing on it.
I checked in a fix. But it would be great to work out a process for developers making changes to fields in metadata blocks that would help them to know of all the places where changes like this need to be made. (This was not the first time when something broke, in the ddi export specifically, after a field was made multiple).

landreev · 2023-04-20T13:16:06Z

(I need to take a look at the failed tests run)

…ssage in the logs. (#3648)

landreev · 2023-04-20T20:09:28Z

@landreev Does using the notes field as an extension device per Micah Altman and our early medical metadata attempts provide any useful advantage in edge cases or would that just add "junk" that then becomes equally hard to manage?

I had to think about how to answer this. Short answer should be, yes, we can add as much information as we want in custom notes. This extra, custom information is only useful to someone who knows where to look for it. But then it can't hurt either. Providing nothing in these notes violates the schema. We had a few notes with illegal attributes that were invalidating the xml, but we got rid of them.

But it should be a matter of some case-by-case consideration, whether any specific piece of metadata actually warrants a custom note. The medical metadata experiments from way back were an attempt to work around the DDI being THE main import/export format/vehicle around which the application was built back then. We now have our json, with its capacity for accommodating any arbitrary metadata block serving this purpose. So it seems reasonable to only use DDI for its intended purpose only, as a format custom-built for QSS. There are things we definitely want to keep using custom notes for, IMO. For example, all of our files have mime types. This is a universally useful piece of information. We do want to have it encoded, but there is no standard place for it in the <otherMat> section of the DDI. So we made up a specially-formatted note, with some fixed attributes and we use it for the content type of every file in a dataset. It's another no-brainer (again, IMO) that we don't want DDI to be "everything" format. If we have a dataset with metadata from the History, Linguistics and Chemistry blocks - we don't really have any practical need to shove it all into the custom notes.

For the cases in between, case-by-case, really. Is there a reason to insist that everything from our Citation block be included in the DDI? - probably not? Like, that "distributor logo url", that we were putting into an illegal attribute of the <distrbtr> section; sure, it could be moved into a dedicated note, and/or attached to the free text in the section. But the chances of anyone outside Dataverse actually needing that bit of information seemed so unlikely that I just dropped it in this PR.

kaczmirek · 2023-05-11T10:07:49Z

Good to see that a milestone has been attached to this. Will this issue solve the following violations with the CESSDA validator or is there a way for me to check it. The first seem to be DDI schema violations and are thus part of this ticket I assume.

Schema Violations
org.xml.sax.SAXParseException; lineNumber: 72; columnNumber: 13; cvc-complex-type.2.4.a: Invalid content was found starting with element '{"ddi:codebook:2_5":nation}'. One of '{"ddi:codebook:2_5":dataKind}' is expected.
org.xml.sax.SAXParseException; lineNumber: 87; columnNumber: 15; cvc-complex-type.2.4.a: Invalid content was found starting with element '{"ddi:codebook:2_5":collMode}'. One of '{"ddi:codebook:2_5":collSitu, "ddi:codebook:2_5":actMin, "ddi:codebook:2_5":ConOps, "ddi:codebook:2_5":weight, "ddi:codebook:2_5":cleanOps}' is expected.
org.xml.sax.SAXParseException; lineNumber: 103; columnNumber: 15; cvc-complex-type.2.4.a: Invalid content was found starting with element '{"ddi:codebook:2_5":setAvail}'. One of '{"ddi:codebook:2_5":notes}' is expected.

Constraint Violations
'/codeBook/@xml:lang' is mandatory

The schema violations seem to arise from non-compliance with sequence of tags. Here are some fixes that have been suggested to me:
Moved notes to be the last element in /codeBook/stdyDscr/dataAccs/ (check <xs:complexType name="dataAccsType"> in XSD)
Moved dataKind to be the last element in /codeBook/stdyDscr/stdyInfo/sumDscr/ (check <xs:complexType name="sumDscrType"> in XSD)
Moved sources to be after collMode and resInstru in /codeBook/stdyDscr/method/dataColl/ (check <xs:complexType name="dataCollType"> in XSD)
Added missing required element titl in /codeBook/stdyDscr/othrStdyMat/relPubl/citation/titlStmt/ (check <xs:complexType name="titlStmtType"> in XSD)

pdurbin · 2023-05-11T14:46:38Z

@kaczmirek hi! Yes, I put the 5.14 milestone on this issue because it was closed by this pull request, which will be part of our next release (5.14):

Make Dataverse produce valid DDI 3648 #9484

If it's straighforward for you to build and install Dataverse from the "develop" branch (where we merge pull requests), please go ahead and check for anything you think we have missed. Thanks!

kaczmirek · 2023-08-23T10:41:01Z

@pdurbin I did not see this solved in the release notes for 5.14 or did I miss something? Was it rescheduled into 5.15?

pdurbin · 2023-08-23T15:48:06Z

@kaczmirek ah. We didn't highlight this issue in the release notes apart from this:

"For the complete list of code changes in this release, see the 5.14 milestone on GitHub."

Maybe we should have! It's probably a big deal to a lot of people.

landreev added 13 commits March 23, 2023 16:49

first quick rearrangements in the ddi export util (#3648)

5b015a8

rewrote the sumDscr section method to comply with the order of elemen…

d5f1e84

…ts as specified in the schema. #3648

added a couple of todo:s as reminders #3648

1517b59

rearranged the geographicCoverage fields, to maintain the order in th…

b8dddac

…e ddi schema #3648

This should be enough to produce a valid ddi w/no constraint or schem…

645c343

…a violations *for the control dataset we are using in our tests*. There is almost certainly more that needs to be done. #3648

more/better sequence order fixes. #3648

e79a414

changed this test so that it does NOT expect the exported xml to be i…

ce9a941

…nvalid. #3648

updated mini-dataset ddi export #3648

9eacbe3

wrong file, oops #3648

5297fb8

updated mini-dataset ddi export #3648

bfb72cd

the "full", all-fields-populated dataset export, updated. #3648

bf21975

tests (should be passing now). #3648

82debde

Merge branch 'develop' into 3648-invalid-ddi

de66928

github-actions bot reviewed Mar 29, 2023

View reviewed changes

mreekie added the Size: 3 A percentage of a sprint. 2.1 hours. label Mar 29, 2023

mreekie added Size: 10 A percentage of a sprint. 7 hours. and removed Size: 3 A percentage of a sprint. 2.1 hours. labels Mar 29, 2023

comments (#3648)

308a8ee

This was referenced Mar 30, 2023

DDI 2.5 OtherMat and FileDesc #9489

Closed

Make Dataverse produce valid DDI codebook 2.5 XML #3648

Closed

qqmyers approved these changes Apr 5, 2023

View reviewed changes

qqmyers reviewed Apr 5, 2023

View reviewed changes

src/main/java/edu/harvard/iq/dataverse/export/ddi/DdiExportUtil.java Outdated Show resolved Hide resolved

qqmyers reviewed Apr 5, 2023

View reviewed changes

kcondon self-assigned this Apr 6, 2023

landreev added 2 commits April 11, 2023 17:18

the json "all fields" that @kcondon fixed (making it importable via t…

496ab1c

…he API), and the corresponding control ddi export. #3648

just a comment that was no longer needed. #3648

35db2a8

pdurbin assigned landreev Apr 17, 2023

made the sample "all default fields" dataset json we provide in the g…

b6c2599

…uide importable, using kcondon's fixes. #3648

landreev added 3 commits April 19, 2023 17:38

Got rid of the dubious "role=logo" attribute export. #3648

74497e5

cleaned up a comment to satisfy style checker. #3648

4717311

changed an exception catch that was resulting in a misleading erro me…

9545531

…ssage in the logs. (#3648)

Merge branch 'develop' into 3648-invalid-ddi

b9bb909

kcondon merged commit aba08e6 into develop Apr 24, 2023

kcondon deleted the 3648-invalid-ddi branch April 24, 2023 14:49

pdurbin mentioned this pull request Apr 26, 2023

make Alternative Title repeatable #9440

Merged

pdurbin added this to the 5.14 milestone May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Dataverse produce valid DDI 3648 #9484

Make Dataverse produce valid DDI 3648 #9484

landreev commented Mar 29, 2023 •

edited

Loading

github-actions bot Mar 29, 2023

coveralls commented Mar 29, 2023 •

edited

Loading

mreekie commented Mar 29, 2023

qqmyers left a comment

qqmyers Apr 5, 2023

jggautier Apr 5, 2023 •

edited

Loading

landreev Apr 14, 2023

landreev Apr 20, 2023

kcondon commented Apr 6, 2023

landreev commented Apr 17, 2023

landreev commented Apr 19, 2023

landreev commented Apr 20, 2023

landreev commented Apr 20, 2023

kaczmirek commented May 11, 2023 •

edited

Loading

pdurbin commented May 11, 2023

kaczmirek commented Aug 23, 2023

pdurbin commented Aug 23, 2023

Make Dataverse produce valid DDI 3648 #9484

Make Dataverse produce valid DDI 3648 #9484

Conversation

landreev commented Mar 29, 2023 • edited Loading

github-actions bot Mar 29, 2023

Choose a reason for hiding this comment

coveralls commented Mar 29, 2023 • edited Loading

mreekie commented Mar 29, 2023

qqmyers left a comment

Choose a reason for hiding this comment

qqmyers Apr 5, 2023

Choose a reason for hiding this comment

jggautier Apr 5, 2023 • edited Loading

Choose a reason for hiding this comment

landreev Apr 14, 2023

Choose a reason for hiding this comment

landreev Apr 20, 2023

Choose a reason for hiding this comment

kcondon commented Apr 6, 2023

landreev commented Apr 17, 2023

landreev commented Apr 19, 2023

landreev commented Apr 20, 2023

landreev commented Apr 20, 2023

kaczmirek commented May 11, 2023 • edited Loading

pdurbin commented May 11, 2023

kaczmirek commented Aug 23, 2023

pdurbin commented Aug 23, 2023

landreev commented Mar 29, 2023 •

edited

Loading

coveralls commented Mar 29, 2023 •

edited

Loading

jggautier Apr 5, 2023 •

edited

Loading

kaczmirek commented May 11, 2023 •

edited

Loading