Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7844 codemeta schema #7877

Merged
merged 19 commits into from
Dec 21, 2022
Merged

7844 codemeta schema #7877

merged 19 commits into from
Dec 21, 2022

Conversation

poikilotherm
Copy link
Contributor

@poikilotherm poikilotherm commented May 17, 2021

What this PR does / why we need it:
This is adding the CodeMeta Schema as a default out of the box schema for (new) installations.
This pull request is a first step. Please see the discussion points below for your review. We need to be careful about the scope of this first step to keep compatibility in mind. (There is no schema migration present in the Dataverse application, so when changing data types etc, we need to write SQL database migrations manually!)

TODOs

  • Test TSV, make screenshots
  • Add release note
  • Sort out remaining questions (see below)

Which issue(s) this PR closes:

Closes #7844

Special notes for your reviewer:

  1. With the new feature of "metadata block facets" per collection should we use a different displayName for the block? (It currently is "Software Metadata (CodeMeta 2.0)")
  2. Should we use the W3C proposed vocabulary for applicationCategory?
    • Should we go ahead and add ResearchApplication to this list and reach out to schema.org and CodeMeta people to push for adding it to the list? (Maybe Google, too?)
    • Should we go ahead and reach out to CodeMeta about a field on scientific method used in the software? (Not covered by subject field, which is very coarse anyway)
  3. Should we make the *Requirements fields use integer values of byte? kilobyte? megabyte? (or similar for CPU) instead of arbitrary text values?
  4. Do we want to add docs about the crosswalk of "Dataverse Metadata" to "CodeMeta" to the guides?
  5. What other docs do we want to include?
  6. Do we want to add https://github.com/SoftwareUnderstanding/software_types (which would extend this beyond pure CodeMeta)
  7. Do we want to add a field to allow documenting computational methods in use?
    • There is no standard, vocabulary, schema or ontology for this yet, we'd be on our own.
    • This might as well be done via Citation Blocks Keywords
    • We could leave this for a later extension of the block

Suggestions on how to test this:

  • Load the TSV via the usual API call.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

On dataset creation:
image

On dataset editing:
image

As JSON-LD export:
image

Is there a release notes update needed for this change?:

  • Yet to be done, as review/extension/discussion needed.

Additional documentation:

Tagging @doigl @atrisovic @4tikhonov @jggautier @djbrooke @pdurbin (I don't know the GH names of the other WG members)

@coveralls
Copy link

coveralls commented May 17, 2021

Coverage Status

Coverage remained the same at 19.326% when pulling fcc36d0 on poikilotherm:7844-codemeta-schema into 2dbf9b7 on IQSS:develop.

@jggautier
Copy link
Contributor

jggautier commented May 17, 2021

This is great! I'd like to point out two issues that I think are most pressing and I hope could be resolved before this is merged:

  1. I think there's a stray tab in line 11, applicationCategory, that's splitting its displayName into two columns:

    Screen Shot 2021-05-17 at 10 55 40 AM
  2. There are a few different ways that the Citation metadatablock's fields are still designed to describe data as opposed to software. It looks like we'll be tackling these issues in future work (such as how metadata is exported), but I hope some of the issues that users will see when depositing software can be resolved:

    • The tooltips for most of the fields, even fields that make sense for describing software, such as Title, include the word "Dataset".

      One solution might be to generalize the text in the tooltips of the fields in the Citation metadatablock, for example by replacing the word "Dataset" with "deposit".

    • Some of the fields wouldn't make sense for describing software at all, such as "Series", "Date of collection" and "Type of data". If someone is depositing software, I would think they wouldn't need to see these fields.

      To prevent depositors of software from seeing fields that they wouldn't need, one solution might be to recommend that repositories use a Dataverse collection only for software deposits, and when setting up that collection they should hide the fields in the Citation metadatablock that describe data (e.g. "Data of collection" and "Type of data") and enable the Software metadatablock for that Dataverse collection. So repositories, especially "self curated" ones, should not have users deposit a mix of datasets and software into the same Dataverse collection because there wouldn't be a way for the Dataverse software to know if what the user is depositing is data or software, so the Dataverse software has no way of showing the relevant metadata fields.

@poikilotherm
Copy link
Contributor Author

poikilotherm commented May 17, 2021

I once had the idea to actually make the citation block pluggable, non-mandatory. I know this requires A LOT, but maybe it's a way to go, if we don't want other archictural changes like abstracting the concept of sets.

However, this seems beyond scope. Thanks for the pointer for the description issue, I'll fix that right away.

@poikilotherm
Copy link
Contributor Author

@doigl has some data about the candidates for displayOnCreate from DaRuS:

grafik

I agree on all of those, except for applicationCategory, which has been used not with the vocabulary from W3C but free text. I still think we should not do that.

@poikilotherm
Copy link
Contributor Author

poikilotherm commented May 18, 2021

There is a list of programming languages in WikiData, containing ~1500 entries. (Via https://en.wikiversity.org/wiki/Research_in_programming_Wikidata/Programming_languages)

There is an extensive list of operating systems (not names alone) in WikiData with ~1100 entries. (Via https://en.wikiversity.org/wiki/Research_in_programming_Wikidata/Operating_systems)

We might wanna play with the OS query to select only instances that are not a subclass of another OS and not "based on" to gain the top level ones only.

@poikilotherm
Copy link
Contributor Author

poikilotherm commented May 20, 2021

I checked on the autocomplete/filtering support for controlled vocabulary fields in compound fields. Here's what I found:

  1. For primitive fields using a CV we use a filter input in the dropdown in case of a "check multiple" metadata field, but not for single values. This has been done as part of UI proposal: as a Dataverse user, I want autocompletion for (long) controlled vocab metadata #6000 / PR Subjects disappear when clicked in metadata editing #6339 (I knew there was an old issue for this... 😄 )
  2. This change has not been introduced for "check multiple" in compound fields. No idea why. Tagging @mheppler here.
  3. The remaining issue of single value fields has never been addressed, but @TaniaSchlatter mentioned a few thoughts.

I guess adding the filter functionality to the "check multiple" fields in compound fields is an easy way forward. As this seems like a good discussion for Dataverse software decoupled from this issue about CodeMeta, I'm going to create that little issue now. ↪️#7888

After revisiting the schema, I see that the field operatingSystem is "allow multiple", but the (potential) CV field for the OS name would - of course - not be "check multiple". So we still need a solution for number 3 above, if we want this. ↪️#7889

@mfenner
Copy link

mfenner commented May 27, 2021

@poikilotherm as Codemeta is close to version 3.0 (https://blog.datacite.org/codemeta-we-need-your-feedback/), applicationCategory and scientific method are good topics to discuss now. Would the Dataverse community want them to become part of Codemeta?

And what is the timing for this pull request with regards to Codemeta 2.0 vs. Codemeta 3.0 (which is still a few months away)?

@poikilotherm
Copy link
Contributor Author

poikilotherm commented May 27, 2021

@poikilotherm as Codemeta is close to version 3.0 (https://blog.datacite.org/codemeta-we-need-your-feedback/), applicationCategory and scientific method are good topics to discuss now. Would the Dataverse community want them to become part of Codemeta?

@mfenner I think there is a high demand for these fields not only within the boundaries of the Dataverse community. I know that @sdruskat is also looking into this matter for his PhD thesis.

Are you aware of any existing, reusable controlled vocabularies, preferably as RDF/SKOS/JSON-LD/sth. with a PID, we could reuse for a field like scientificMethod? Dataverse soonish will have support to use those kind of sources within the UI (#7712)

And what is the timing for this pull request with regards to Codemeta 2.0 vs. Codemeta 3.0 (which is still a few months away)?

I'm not so sure about this. Maybe it would be a good start to create a schema for 2.0 now and upgrade to 3.0 later on. It's a rather low hanging fruit. It might become necessary to introduce a migration method in Dataverse, but this seems like a good addition beyond the CodeMeta use case.

- Add missing displayOrder values
- Fix missing type for software requirements
- Avoid splitting up compound fields too much,
  otherwise data is not exportable to schema.org
  or CodeMeta JSON-LD without special handling (IQSS#7856)
- Tweak order
- Tweak descriptions and examples
- Fix whitespaces and line endings
@pdurbin
Copy link
Member

pdurbin commented Jul 21, 2022

@poikilotherm I couldn't get this tsv to load without making a few changes. I put them in a pull request for you to review and perhaps merge: poikilotherm#553

@poikilotherm
Copy link
Contributor Author

Thanks @pdurbin!

Just today I picked up working on this again (not yet pushed).

There's lots of stuff to be moved around, which will also incorporate your changes😉

@poikilotherm
Copy link
Contributor Author

@pdurbin @mreekie I just pushed the necessary changes to revert the addition to the schema. Also updated to latest develop. Dunno why the RTD CI fails, but seems unrelated.

@poikilotherm
Copy link
Contributor Author

(We'll make a PR to back out the schema.xml change for computational workflow as well, for consistency.)

Chop chop here we go #9225

@mreekie
Copy link

mreekie commented Dec 14, 2022

added to sprint Dec 15, 2022

@mreekie mreekie added the NIH OTA: 1.3.1 3 | 1.3.1 | Support software metadata | 5 prdOwnThis is an item synched from the product planning... label Dec 15, 2022
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick review. I haven't loaded up the block.

datasetfieldtype.softwareHelp.title=Software Help/Documentation
datasetfieldtype.softwareHelp.description=Link to help texts or documentation
datasetfieldtype.softwareHelp.watermark=e.g. https://user.github.io/project/docs
datasetfieldtype.readme.title=Readme
Copy link
Member

@pdurbin pdurbin Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't "README" little more standard? (Instead of "Readme".) If others agree, we should change the tsv as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the filename should be sth with README. But do we want an all caps field name in the UI?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was trying to suggest all caps README in the UI. That's what you have in the description ("Link to the README of the project") and the watermark ("e.g. https://github.com/user/project/blob/main/README.md"), both of which appear in the UI, so it should probably be consistent, right?

It's weird, Codemeta itself has "link to software Readme file" as a description at https://codemeta.github.io/terms/ but before codemeta/codemeta@0818c31 it was all caps README:

  • before: "A URL for the software README file"
  • after: "link to software Readme file"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't committed and pushed this, but here's how more consistent all uppercase README would look (name, description, and watermark):

Screen Shot 2022-12-21 at 9 39 55 AM

src/main/java/propertyFiles/codeMeta20.properties Outdated Show resolved Hide resolved
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with this locally and it's looking good!

I'm sending it to QA but I'll make a few observations:

  • A lot of these fields would benefit from a picklist (programming languages, etc.) so I hope that we'll see a pull request to add some external controlled vocabularies.
  • For the tooltips, there is some inconsistency of final periods being present or absent.
  • It's weird that SVN is listed before Git, but that's because of CodeMeta and Schema.org.
  • I'm slightly weirded out by the inconsistency between Readme (title) and README (tooltip and watermark).
  • For some fields, it would be nice to have units (memory requirements, for example) but this is feedback to give upstream to CodeMeta and Schema.org, I imagine.
  • I find "Target Product" to be a bit odd. Again, this is feedback to send upstream. I think the idea is that if, for example, you're creating a plugin for WordPress, you can put WordPress as the target product.
  • There is a failing API test (FilesIT.test_008_ReplaceFileAlreadyDeleted) but I'm sure it has nothing to do with this metadata block, which isn't even loaded.
  • It seems like Oliver would like more feedback earlier in the process. He posted about this at https://groups.google.com/g/dataverse-community/c/heNotzADbaQ/m/DJItrFjFBAAJ but in practice, developers like me don't take a serious look until the work (a PR in this case) make it into a sprint. So maybe we could improve our process here.

@kcondon kcondon merged commit ee019ab into IQSS:develop Dec 21, 2022
@kcondon kcondon self-assigned this Dec 21, 2022
@pdurbin pdurbin added this to the 5.13 milestone Dec 21, 2022
@jggautier
Copy link
Contributor

jggautier commented Jan 12, 2023

I was pinged a while back but thought I should reply now that I finally found the time to answer after the winter break.

We'd like to back out the schema.xml change.

(We'll make a PR to back out the schema.xml change for computational workflow as well, for consistency.)

Seems like @jggautier has given his blessing, especially since it's experimental.

I'm not sure what the schema.xml change was and how that's related to this being experimental. Is that what I gave my blessing to? Is the effect of the schema.xml change that this won't be a default metadatablock in future Dataverse installations? Does that mean that experimental, as it's been used for this and the workflow metadatablock, means that it'll be included in a release but the feature won't be turned on by default in Dataverse installations?

I agree about more feedback earlier in the process (and @poikilotherm has been using many opportunities over the years to encourage feedback), and I'd like to add that I think it's important to plan, as early in the process as possible, for evaluating solutions after they've been merged, too, even more so if we're so uncertain about a solution that we label it experimental.

@pdurbin
Copy link
Member

pdurbin commented Jan 13, 2023

@jggautier you probably missed the discussion but to sum up, only changes to non-experimental blocks should result in a change to schema.xml.

That is, schema.xml contains field for all the block that we ship. All these blocks are enabled by default and will "just work" because schema.xml has the fields already.

I hope this helps. This whole experimental blocks concept is quite new, of course!

@jggautier
Copy link
Contributor

Ah thanks. That's how I understood it. Experimental metadatablocks shouldn't be enabled in installations by default when those installations use the version of the software that includes that experimental metadatablock. Those installations will need to take extra steps to enable it.

It's just not clear to me how a metadatablock becomes not experimental.

@pdurbin
Copy link
Member

pdurbin commented Jan 13, 2023

It hasn't happened yet! 😄 I hope we find out with CodeMeta!

@poikilotherm poikilotherm deleted the 7844-codemeta-schema branch January 16, 2023 12:00
@mreekie mreekie added pm.GREI-d-1.3.1 NIH, yr1, aim3, task1: Support software metadata pm.GREI-d-1.3.2 NIH, yr1, aim3, task2: R & D phase biomedical workflows support labels Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Metadata HERMES related to @hermes-hmc work on Dataverse code NIH OTA: 1.3.1 3 | 1.3.1 | Support software metadata | 5 prdOwnThis is an item synched from the product planning... pm.GREI-d-1.3.1 NIH, yr1, aim3, task1: Support software metadata pm.GREI-d-1.3.2 NIH, yr1, aim3, task2: R & D phase biomedical workflows support Size: 10 A percentage of a sprint. 7 hours.
Projects
Status: No status
Status: Valuation
Development

Successfully merging this pull request may close these issues.

Include CodeMeta schema out of the box
10 participants