Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Dataverse produce valid DDI 3648 #9484

Merged
merged 22 commits into from
Apr 24, 2023
Merged

Make Dataverse produce valid DDI 3648 #9484

merged 22 commits into from
Apr 24, 2023

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Mar 29, 2023

What this PR does / why we need it:

copy-and-pasting my last comment from the issue:

Just to clarify a couple of things from an earlier discussion:

sizing:

* We will address the immediate issue of the bad ddi xml exports by looking specifically at what has been reported.
...
* If we find that the validator needs work, we will create a new separate issue when this is complete

"Looking specifically at what has been reported" may not easily apply. This is a very old issue, with a lot of back-and-forth (that's very hard to read), and many of the things reported earlier have already been fixed in other PRs. So I assumed that the goal of the PR was "make Dataverse produce valid DDI". (i.e., if something not explicitly mentioned here is obviously failing validation, it needed to be fixed too - it did not make sense to make a PR that would fix some things, but still produce ddi records that fail validation; especially since people have been waiting for it to be fixed since 2017).

The previously discussed automatic validation - adding code to the exporter that would validate in real time every ddi record produced, and only cache it if it passes the validation - does make sense to be left as a separate sprint-sized task. (the validation itself is not hard to add; but we'll need to figure out how to report the errors). I have enabled the validation test in DDIExporterTest.testExportDataset() however, so, in the meantime, after we merge this PR, any developer working on the ddi exporter will be alerted if they break it by introducing something invalid, because they won't be able to build their branch.

To clarify, in the current state, the exporter in my branch is producing valid ddi xml for our control "all fields" dataset, plus all the other datasets used in our tests, and whatever I could think of to test. It does NOT guarantee that there is no possible scenario where it can still output something illegal! So, yes, it is important to add auto-validation. And, if and when somebody finds another such scenario, we will treat it as a new issue.

A couple of arbitrary decisions had to be made. I will spell it out in the PR description. My general approach was, if something does not translate from our metadata to the ddi format 1:1, just drop it and move on. We don't assume that it's a goal, to preserve all of our metadata when exporting DC, it's obvious that only a subset of our block fields can be exported in that format. But it's not a possibility with the ddi either, now that we have multiple blocks and the application is no longer centered around quantitative social science. So, no need to sweat a lost individual field here and there.

Which issue(s) this PR closes:

Closes #3648

Special notes for your reviewer:

See the description above. Read the linked issue... at your own risk - it goes for miles, and years. Many things reported there have been fixed already. Some things reported there as late as in 2022 had already been fixed as of 2020.

The "arbitrary decisions" mentioned above:

In our metadata block, both the "Producer" and the "Distributor" have these same 4 sub-fields:
*Affiliation
*Abbreviation
*Logo
*URL

In the DDI however the producer and distrbtr have the attributes as follows:

<producer affiliation="..." abbr="..." role="...">
but 
<distrbtr affiliation="..." abbr="..." URI="...">

but our exporter was writing all FOUR of the attributes above in each section. I addressed this by simply dropping the URI= from the former, and role= from the latter.
(that said, anyone has any idea as to WHY we were putting the logo into the role attribute?? - as in, ending up with this in our test dataset: role="http://DistributorLogoURL2.org"; the export util is still doing this in the <producer> section.)

The other thing:
Our geospatial block allows for multiple bounding boxes. But the sumDscr in the DDI only allows one:

<xs:sequence>
   <xs:element ref="timePrd" minOccurs="0" maxOccurs="unbounded"/>
   <xs:element ref="collDate" minOccurs="0" maxOccurs="unbounded"/>
   <xs:element ref="nation" minOccurs="0" maxOccurs="unbounded"/>
   <xs:element ref="geogCover" minOccurs="0" maxOccurs="unbounded"/>
   <xs:element ref="geogUnit" minOccurs="0" maxOccurs="unbounded"/>
   <xs:element ref="geoBndBox" minOccurs="0"/>
   <xs:element ref="boundPoly" minOccurs="0" maxOccurs="unbounded"/>
   <xs:element ref="anlyUnit" minOccurs="0" maxOccurs="unbounded"/>
   <xs:element ref="universe" minOccurs="0" maxOccurs="unbounded"/>
   <xs:element ref="dataKind" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>

I've addressed this by simply using the first one, and disregarding the rest, if the export util encounters a dataset with multiple bounding boxes. There may be a cleaner solution, but I can't think of one now (and I needed to keep the amount of work on this PR manageable since it was supposed to be a "33"). In the discussion in the issue, it was suggested that we make the field non-multiple-allowed on the Dataverse side as well. That would be easy... except it's not clear at all what we would then do with existing datasets that already have multiple boundingboxes... With simple text fields, when we need to make something multiple single, it's easy to just concatenate multiple values... but you can't really combine bounding boxes.
On the other hand, while only one <geoBndBox> is allowed in a <sumDscr>, multiple <sumDscr> sections ARE actually allowed. So we could use that to be able to export such multiple values... but that would also be difficult... for reasons. If anyone has any constructive ideas, please speak up, but it will need to be handled as a separate issue. For now, producing valid DDI xml is the priority.

Suggestions on how to test this:

CESSDA Metadata Validator (https://cmv.cessda.eu/#!validation) is an excellent tool for testing DDI records. I'm assuming "CESSDA DATA CATALOGUE (CDC) DDI2.5 PROFILE - MONOLINGUAL: 1.0.4" is the correct validation profile to use.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

<xs:element ref="sampProc" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="sampleFrame" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="targetSampleSize" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="deviat" minOccurs="0" maxOccurs="unbounded"/>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [reviewdog] <com.puppycrawl.tools.checkstyle.checks.whitespace.FileTabCharacterCheck> reported by reviewdog 🐶
File contains tab characters (this is the first instance).

@coveralls
Copy link

coveralls commented Mar 29, 2023

Coverage Status

Coverage: 20.199% (+0.02%) from 20.181% when pulling 9545531 on 3648-invalid-ddi into 49ef7f8 on develop.

@mreekie mreekie added the Size: 3 A percentage of a sprint. 2.1 hours. label Mar 29, 2023
@mreekie
Copy link

mreekie commented Mar 29, 2023

Sprint Kickoff:

  • about 10 points left:

@mreekie mreekie added Size: 10 A percentage of a sprint. 7 hours. and removed Size: 3 A percentage of a sprint. 2.1 hours. labels Mar 29, 2023
Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than a a couple comments, this all looks good. Mostly reordering to match what the schema requires. Looks like the tests are passing. The style checker is complaining about tab chars though.

@@ -946,9 +1064,10 @@ private static void writeDistributorsElement(XMLStreamWriter xmlw, DatasetVersio
if (!distributorURL.isEmpty()) {
writeAttribute(xmlw, "URI", distributorURL);
}
if (!distributorLogoURL.isEmpty()) {
/* NOT IN THE SCHEMA! -L.A.if (!distributorLogoURL.isEmpty()) {
(and why were we putting this logo into the "role" field anyway?? - same with producerLogo above!)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to remove the producer logo URL now with these other breaking changes? No strong opinion, but if it really isn't useful, it might be easier to change now than in a later PR.

Copy link
Contributor

@jggautier jggautier Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case it's helpful: I wrote about this a bit. Back then I guessed that the distributor logo url was put into this "role" field "to preserve that metadata during migrations from Dataverse 3.x to 4.x".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qqmyers @jggautier Thank you for the comments. I've been meaning to finalize this - after everything else is fixed, that is. But hoping to wrap this up first thing next week.
I'm leaning towards the simplest/brutest force solution that Jim mentioned - just dropping this "role" attribute, since the logo url makes zero sense in it.
Julian, thank you for the writeup. I dug a bit further, by reading the old DVN code. And it doesn't look like it was done for the purposes of the migration either. It seems like this is even more ancient history, that it may be tracing its history to a hack that DVNs used to rely on to pass the logos when harvesting from each other. However, unless I'm really not reading it right, it just doesn't seem like it was doing the same thing as the current Dataverse exporter - it was never putting that logo url in the "role" attribute; it was encoding it as an HTML link inside the text of the producer field, with a "role=..." attribute of its own... But then, when the export and import code were being ported to Dataverse, this may have been changed simply by mistake (?). Does this make any sense? - it doesn't make complete sense to me, I feel like I may be missing something - but I am fairly confident by now that nobody could possibly be relying on that attribute in its current form now. So I am indeed feeling like killing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming that I have dropped this dubious "role=logo" attribute from the export.

@kcondon kcondon self-assigned this Apr 6, 2023
@kcondon
Copy link
Contributor

kcondon commented Apr 6, 2023

@landreev Does using the notes field as an extension device per Micah Altman and our early medical metadata attempts provide any useful advantage in edge cases or would that just add "junk" that then becomes equally hard to manage?

@landreev
Copy link
Contributor Author

(deleted the previous version of the comment; since I'm correcting it, and also because it somehow got attached to an unrelated thread above)

@kcondon I was able to confirm that both of the "all fields" json examples that we provide in the source tree: ./src/test/java/edu/harvard/iq/dataverse/export/ddi/dataset-create-new-all-ddi-fields.json and scripts/api/data/dataset-create-new-all-default-fields.json are now importable (I applied your fixes to both of the above in my branch). And, both of the resulting datasets are exportable as ddi, and the produced ddi is valid. Tested both in my build env. and on dataverse-internal (deployed my build there).
The dataset that you showed me earlier there, FK2/JA1SDI that was created via the form, is still failing to export, with the same misleading error message. I'll be investigating it first thing in the morning. (something controlled vocab.-related, it looks like, but I need to figure out what exactly). w

…made

multiple in PR #9254; would be great to put together a process for developers
who need to make changes to fields in metadata blocks that would help
them to know of all the places where changes like this need to be made.
(not the first time, when something breaks, in ddi export specifically, after
a field is made multiple). #3648
@landreev
Copy link
Contributor Author

The export failure on the test dataset FK2/JA1SDI mentioned in the previous comment, that delayed the QA of the PR, was only happening on account of the productionPlace having been made multiple recently (pr #9254). Kevin's manually-created dataset actually did contain 2 productionPlace entries. Our supplied sample “all fields” importable json however only had 1, so the automated test export wasn't bombing on it.
I checked in a fix. But it would be great to work out a process for developers making changes to fields in metadata blocks that would help them to know of all the places where changes like this need to be made. (This was not the first time when something broke, in the ddi export specifically, after a field was made multiple).

@landreev
Copy link
Contributor Author

(I need to take a look at the failed tests run)

@landreev
Copy link
Contributor Author

@landreev Does using the notes field as an extension device per Micah Altman and our early medical metadata attempts provide any useful advantage in edge cases or would that just add "junk" that then becomes equally hard to manage?

I had to think about how to answer this. Short answer should be, yes, we can add as much information as we want in custom notes. This extra, custom information is only useful to someone who knows where to look for it. But then it can't hurt either. Providing nothing in these notes violates the schema. We had a few notes with illegal attributes that were invalidating the xml, but we got rid of them.

But it should be a matter of some case-by-case consideration, whether any specific piece of metadata actually warrants a custom note. The medical metadata experiments from way back were an attempt to work around the DDI being THE main import/export format/vehicle around which the application was built back then. We now have our json, with its capacity for accommodating any arbitrary metadata block serving this purpose. So it seems reasonable to only use DDI for its intended purpose only, as a format custom-built for QSS. There are things we definitely want to keep using custom notes for, IMO. For example, all of our files have mime types. This is a universally useful piece of information. We do want to have it encoded, but there is no standard place for it in the <otherMat> section of the DDI. So we made up a specially-formatted note, with some fixed attributes and we use it for the content type of every file in a dataset. It's another no-brainer (again, IMO) that we don't want DDI to be "everything" format. If we have a dataset with metadata from the History, Linguistics and Chemistry blocks - we don't really have any practical need to shove it all into the custom notes.

For the cases in between, case-by-case, really. Is there a reason to insist that everything from our Citation block be included in the DDI? - probably not? Like, that "distributor logo url", that we were putting into an illegal attribute of the <distrbtr> section; sure, it could be moved into a dedicated note, and/or attached to the free text in the section. But the chances of anyone outside Dataverse actually needing that bit of information seemed so unlikely that I just dropped it in this PR.

@kcondon kcondon merged commit aba08e6 into develop Apr 24, 2023
@kcondon kcondon deleted the 3648-invalid-ddi branch April 24, 2023 14:49
@pdurbin pdurbin added this to the 5.14 milestone May 10, 2023
@kaczmirek
Copy link

kaczmirek commented May 11, 2023

Good to see that a milestone has been attached to this. Will this issue solve the following violations with the CESSDA validator or is there a way for me to check it. The first seem to be DDI schema violations and are thus part of this ticket I assume.

Schema Violations
org.xml.sax.SAXParseException; lineNumber: 72; columnNumber: 13; cvc-complex-type.2.4.a: Invalid content was found starting with element '{"ddi:codebook:2_5":nation}'. One of '{"ddi:codebook:2_5":dataKind}' is expected.
org.xml.sax.SAXParseException; lineNumber: 87; columnNumber: 15; cvc-complex-type.2.4.a: Invalid content was found starting with element '{"ddi:codebook:2_5":collMode}'. One of '{"ddi:codebook:2_5":collSitu, "ddi:codebook:2_5":actMin, "ddi:codebook:2_5":ConOps, "ddi:codebook:2_5":weight, "ddi:codebook:2_5":cleanOps}' is expected.
org.xml.sax.SAXParseException; lineNumber: 103; columnNumber: 15; cvc-complex-type.2.4.a: Invalid content was found starting with element '{"ddi:codebook:2_5":setAvail}'. One of '{"ddi:codebook:2_5":notes}' is expected.

Constraint Violations
'/codeBook/@xml:lang' is mandatory

The schema violations seem to arise from non-compliance with sequence of tags. Here are some fixes that have been suggested to me:
Moved notes to be the last element in /codeBook/stdyDscr/dataAccs/ (check <xs:complexType name="dataAccsType"> in XSD)
Moved dataKind to be the last element in /codeBook/stdyDscr/stdyInfo/sumDscr/ (check <xs:complexType name="sumDscrType"> in XSD)
Moved sources to be after collMode and resInstru in /codeBook/stdyDscr/method/dataColl/ (check <xs:complexType name="dataCollType"> in XSD)
Added missing required element titl in /codeBook/stdyDscr/othrStdyMat/relPubl/citation/titlStmt/ (check <xs:complexType name="titlStmtType"> in XSD)

@pdurbin
Copy link
Member

pdurbin commented May 11, 2023

@kaczmirek hi! Yes, I put the 5.14 milestone on this issue because it was closed by this pull request, which will be part of our next release (5.14):

If it's straighforward for you to build and install Dataverse from the "develop" branch (where we merge pull requests), please go ahead and check for anything you think we have missed. Thanks!

@kaczmirek
Copy link

@pdurbin I did not see this solved in the release notes for 5.14 or did I miss something? Was it rescheduled into 5.15?

@pdurbin
Copy link
Member

pdurbin commented Aug 23, 2023

@kaczmirek ah. We didn't highlight this issue in the release notes apart from this:

"For the complete list of code changes in this release, see the 5.14 milestone on GitHub."

Maybe we should have! It's probably a big deal to a lot of people.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 10 A percentage of a sprint. 7 hours.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make Dataverse produce valid DDI codebook 2.5 XML
8 participants