-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Dataverse produce valid DDI 3648 #9484
Conversation
<xs:element ref="sampProc" minOccurs="0" maxOccurs="unbounded"/> | ||
<xs:element ref="sampleFrame" minOccurs="0" maxOccurs="unbounded"/> | ||
<xs:element ref="targetSampleSize" minOccurs="0" maxOccurs="unbounded"/> | ||
<xs:element ref="deviat" minOccurs="0" maxOccurs="unbounded"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚫 [reviewdog] <com.puppycrawl.tools.checkstyle.checks.whitespace.FileTabCharacterCheck> reported by reviewdog 🐶
File contains tab characters (this is the first instance).
Sprint Kickoff:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than a a couple comments, this all looks good. Mostly reordering to match what the schema requires. Looks like the tests are passing. The style checker is complaining about tab chars though.
src/main/java/edu/harvard/iq/dataverse/export/ddi/DdiExportUtil.java
Outdated
Show resolved
Hide resolved
@@ -946,9 +1064,10 @@ private static void writeDistributorsElement(XMLStreamWriter xmlw, DatasetVersio | |||
if (!distributorURL.isEmpty()) { | |||
writeAttribute(xmlw, "URI", distributorURL); | |||
} | |||
if (!distributorLogoURL.isEmpty()) { | |||
/* NOT IN THE SCHEMA! -L.A.if (!distributorLogoURL.isEmpty()) { | |||
(and why were we putting this logo into the "role" field anyway?? - same with producerLogo above!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to remove the producer logo URL now with these other breaking changes? No strong opinion, but if it really isn't useful, it might be easier to change now than in a later PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case it's helpful: I wrote about this a bit. Back then I guessed that the distributor logo url was put into this "role" field "to preserve that metadata during migrations from Dataverse 3.x to 4.x".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qqmyers @jggautier Thank you for the comments. I've been meaning to finalize this - after everything else is fixed, that is. But hoping to wrap this up first thing next week.
I'm leaning towards the simplest/brutest force solution that Jim mentioned - just dropping this "role" attribute, since the logo url makes zero sense in it.
Julian, thank you for the writeup. I dug a bit further, by reading the old DVN code. And it doesn't look like it was done for the purposes of the migration either. It seems like this is even more ancient history, that it may be tracing its history to a hack that DVNs used to rely on to pass the logos when harvesting from each other. However, unless I'm really not reading it right, it just doesn't seem like it was doing the same thing as the current Dataverse exporter - it was never putting that logo url in the "role" attribute; it was encoding it as an HTML link inside the text of the producer field, with a "role=..." attribute of its own... But then, when the export and import code were being ported to Dataverse, this may have been changed simply by mistake (?). Does this make any sense? - it doesn't make complete sense to me, I feel like I may be missing something - but I am fairly confident by now that nobody could possibly be relying on that attribute in its current form now. So I am indeed feeling like killing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming that I have dropped this dubious "role=logo" attribute from the export.
@landreev Does using the notes field as an extension device per Micah Altman and our early medical metadata attempts provide any useful advantage in edge cases or would that just add "junk" that then becomes equally hard to manage? |
…he API), and the corresponding control ddi export. #3648
…uide importable, using kcondon's fixes. #3648
(deleted the previous version of the comment; since I'm correcting it, and also because it somehow got attached to an unrelated thread above) @kcondon I was able to confirm that both of the "all fields" json examples that we provide in the source tree: |
…made multiple in PR #9254; would be great to put together a process for developers who need to make changes to fields in metadata blocks that would help them to know of all the places where changes like this need to be made. (not the first time, when something breaks, in ddi export specifically, after a field is made multiple). #3648
The export failure on the test dataset |
(I need to take a look at the failed tests run) |
…ssage in the logs. (#3648)
I had to think about how to answer this. Short answer should be, yes, we can add as much information as we want in custom notes. This extra, custom information is only useful to someone who knows where to look for it. But then it can't hurt either. Providing nothing in these notes violates the schema. We had a few notes with illegal attributes that were invalidating the xml, but we got rid of them. But it should be a matter of some case-by-case consideration, whether any specific piece of metadata actually warrants a custom note. The medical metadata experiments from way back were an attempt to work around the DDI being THE main import/export format/vehicle around which the application was built back then. We now have our json, with its capacity for accommodating any arbitrary metadata block serving this purpose. So it seems reasonable to only use DDI for its intended purpose only, as a format custom-built for QSS. There are things we definitely want to keep using custom notes for, IMO. For example, all of our files have mime types. This is a universally useful piece of information. We do want to have it encoded, but there is no standard place for it in the For the cases in between, case-by-case, really. Is there a reason to insist that everything from our Citation block be included in the DDI? - probably not? Like, that "distributor logo url", that we were putting into an illegal attribute of the |
Good to see that a milestone has been attached to this. Will this issue solve the following violations with the CESSDA validator or is there a way for me to check it. The first seem to be DDI schema violations and are thus part of this ticket I assume. Schema Violations Constraint Violations The schema violations seem to arise from non-compliance with sequence of tags. Here are some fixes that have been suggested to me: |
@kaczmirek hi! Yes, I put the 5.14 milestone on this issue because it was closed by this pull request, which will be part of our next release (5.14): If it's straighforward for you to build and install Dataverse from the "develop" branch (where we merge pull requests), please go ahead and check for anything you think we have missed. Thanks! |
@pdurbin I did not see this solved in the release notes for 5.14 or did I miss something? Was it rescheduled into 5.15? |
@kaczmirek ah. We didn't highlight this issue in the release notes apart from this: "For the complete list of code changes in this release, see the 5.14 milestone on GitHub." Maybe we should have! It's probably a big deal to a lot of people. |
What this PR does / why we need it:
copy-and-pasting my last comment from the issue:
Just to clarify a couple of things from an earlier discussion:
"Looking specifically at what has been reported" may not easily apply. This is a very old issue, with a lot of back-and-forth (that's very hard to read), and many of the things reported earlier have already been fixed in other PRs. So I assumed that the goal of the PR was "make Dataverse produce valid DDI". (i.e., if something not explicitly mentioned here is obviously failing validation, it needed to be fixed too - it did not make sense to make a PR that would fix some things, but still produce ddi records that fail validation; especially since people have been waiting for it to be fixed since 2017).
The previously discussed automatic validation - adding code to the exporter that would validate in real time every ddi record produced, and only cache it if it passes the validation - does make sense to be left as a separate sprint-sized task. (the validation itself is not hard to add; but we'll need to figure out how to report the errors). I have enabled the validation test in
DDIExporterTest.testExportDataset()
however, so, in the meantime, after we merge this PR, any developer working on the ddi exporter will be alerted if they break it by introducing something invalid, because they won't be able to build their branch.To clarify, in the current state, the exporter in my branch is producing valid ddi xml for our control "all fields" dataset, plus all the other datasets used in our tests, and whatever I could think of to test. It does NOT guarantee that there is no possible scenario where it can still output something illegal! So, yes, it is important to add auto-validation. And, if and when somebody finds another such scenario, we will treat it as a new issue.
A couple of arbitrary decisions had to be made. I will spell it out in the PR description. My general approach was, if something does not translate from our metadata to the ddi format 1:1, just drop it and move on. We don't assume that it's a goal, to preserve all of our metadata when exporting DC, it's obvious that only a subset of our block fields can be exported in that format. But it's not a possibility with the ddi either, now that we have multiple blocks and the application is no longer centered around quantitative social science. So, no need to sweat a lost individual field here and there.
Which issue(s) this PR closes:
Closes #3648
Special notes for your reviewer:
See the description above. Read the linked issue... at your own risk - it goes for miles, and years. Many things reported there have been fixed already. Some things reported there as late as in 2022 had already been fixed as of 2020.
The "arbitrary decisions" mentioned above:
In our metadata block, both the "Producer" and the "Distributor" have these same 4 sub-fields:
*Affiliation
*Abbreviation
*Logo
*URL
In the DDI however the
producer
anddistrbtr
have the attributes as follows:but our exporter was writing all FOUR of the attributes above in each section. I addressed this by simply dropping the
URI=
from the former, androle=
from the latter.(that said, anyone has any idea as to WHY we were putting the logo into the role attribute?? - as in, ending up with this in our test dataset:
role="http://DistributorLogoURL2.org"
; the export util is still doing this in the<producer>
section.)The other thing:
Our geospatial block allows for multiple bounding boxes. But the
sumDscr
in the DDI only allows one:I've addressed this by simply using the first one, and disregarding the rest, if the export util encounters a dataset with multiple bounding boxes. There may be a cleaner solution, but I can't think of one now (and I needed to keep the amount of work on this PR manageable since it was supposed to be a "33"). In the discussion in the issue, it was suggested that we make the field non-multiple-allowed on the Dataverse side as well. That would be easy... except it's not clear at all what we would then do with existing datasets that already have multiple boundingboxes... With simple text fields, when we need to make something multiple single, it's easy to just concatenate multiple values... but you can't really combine bounding boxes.
On the other hand, while only one
<geoBndBox>
is allowed in a<sumDscr>
, multiple<sumDscr>
sections ARE actually allowed. So we could use that to be able to export such multiple values... but that would also be difficult... for reasons. If anyone has any constructive ideas, please speak up, but it will need to be handled as a separate issue. For now, producing valid DDI xml is the priority.Suggestions on how to test this:
CESSDA Metadata Validator (https://cmv.cessda.eu/#!validation) is an excellent tool for testing DDI records. I'm assuming "CESSDA DATA CATALOGUE (CDC) DDI2.5 PROFILE - MONOLINGUAL: 1.0.4" is the correct validation profile to use.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?:
Additional documentation: