Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resourceType for dataset files #5086

Open
philippconzett opened this issue Sep 24, 2018 · 30 comments
Open

resourceType for dataset files #5086

philippconzett opened this issue Sep 24, 2018 · 30 comments
Labels
Feature: Metadata Type: Feature a feature request User Role: Curator Curates and reviews datasets, manages permissions

Comments

@philippconzett
Copy link
Contributor

File DOIs from Dataverse are marked with "Dataset" in DataCite Fabrica, thus in the same way as dataset DOIs are; see this screenshot:

image

According to @pdurbin (cf. this post in the Dataverse Google Group),

"Dataset" is coming from at https://github.com/IQSS/dataverse/blob/v4.9.2/src/main/resources/edu/harvard/iq/dataverse/datacite_metadata_template.xml#L12 which is referenced from https://github.com/IQSS/dataverse/blob/v4.9.2/src/main/java/edu/harvard/iq/dataverse/DOIDataCiteRegisterService.java#L279 . As you can see, it's hard coded to "Dataset". You're saying that for files it should be something other that "Dataset", right? "File" or whatever. If so, can you please open a GitHub issue about this? We recently worked on this part of the code at #4795 for #4782 if you'd like to take a look."

I suggest that the metadata of files in Dataverse be changed, so that their DOIs show up not as "Dataset", but as "Dataset file" in DataCite Fabrica. I'm not sure which metadata field we should use for this. The DataCite metadata field resourceType resourceTypeGeneral is mandatory, and I guess it is the value of this field that is reflected in DataCite Fabrica. But according to the DataCite Metadata Schema 4.0, resourceTypeGeneral can only contain the following controlled list values:

Audiovisual
Collection
Dataset
Event
Image
InteractiveResource
Model
PhysicalObject
Service
Software
Sound
Text (15)
Workflow
Other

The list does not contain "Dataset file" or similar. So maybe we just have to specify the field ResourceType, which can contain any value. I suggest a general term like "File", which covers the parts of most types of datasets. Combined with resourceTypeGeneral, we then would get the following resource type description for dataset files:

Dataset/File

where Dataset = resourceTypeGeneral, and File = resourceType.

@philippconzett
Copy link
Contributor Author

I'm not sure whether I understand your question, @jggautier. But DataCite now displays all our files as datasets in the search engine; cf. . This search results in 1 041 datasets, but we only have 178 datasets. So the rest are files.

@jggautier
Copy link
Contributor

Hi @philippconzett. Did you mean this question?: "Are dataset and file metadata records already sent to EZID/DataCite being updated?"

I referenced this github issue in that issue (#5060), which is about investigating if EZID and DataCite are getting any new metadata that Dataverse sends (as Dataverse changes things like the resourceType values for files) and making sure that the existing metadata records that EZID and DataCite have are updated to reflect those changes. Please let me know if you have any questions.

But DataCite now displays all our files as datasets in the search engine

In the Google Group conversation I thought we were discussing only how the datasets and files were displayed in Fabrica. But here do you mean the list of resource types in DataCite Search?

screen shot 2018-10-05 at 10 40 41 am

@philippconzett
Copy link
Contributor Author

Hi @jggautier, sorry for the confusion, but I think the display behavior in DataCite Fabrica and in DataCite Search are both based on the Resource type. But I'm not sure whether there is a Resource type = File (or Dataset File) in DataCite. I guess other data repository applications also are interested in getting their file DOIs viewed as files and not as datasets in both DataCite Fabrica and in DataCite Search.

@jggautier
Copy link
Contributor

I agree that in DataCite Search, the resource type is based on the controlled vocab you listed, and there's nothing like file. I like your earlier suggestion:

So maybe we just have to specify the field ResourceType, which can contain any value. I suggest a general term like "File", which covers the parts of most types of datasets. Combined with resourceTypeGeneral, we then would get the following resource type description for dataset files:

Dataset/File

where Dataset = resourceTypeGeneral, and File = resourceType

As long as we don't get too semantic with the word "file," since I imagine some people might ask "what about archived files, like zip files, or things in datasets that are collections of files?" Would you say the value is in being able to, in Dataset Search and Fabrica, distinguish between and filter for datasets versus the things within datasets that have bytes?

We'll have to get DataCite involved, and their metadata team has been responsive during similar conversations about resourceType in their DataCite Metadata forum.

Would you mind writing them about this use case?

@philippconzett
Copy link
Contributor Author

Thanks, @jggautier, I have raised this issue in the DataCite Metadata forum; see this posting.

@mfenner
Copy link

mfenner commented Oct 7, 2018

I suggest to distinguish between what can be done with the DataCite Metadata Schema now, and how the metadata schema could be updated in the future (the next schema release for the end of 2018 is basically finalized, so that would be second half of 2019 the earliest).

With the current schema resourceTypeGeneral Dataset is the best fit, and you can add granularity via resourceType (which is a free text field). I like DataFile, but would also consider DataDownload, which is used in DCAT and schema.org: https://schema.org/DataDownload.

@pdurbin
Copy link
Member

pdurbin commented Oct 9, 2018

@mfenner thanks for mentioning DataDownload, which seems like an emerging standard for providing the URLs to download individual files. Last week I wrote about it at whole-tale/whole-tale#35 (comment) in the context of #4371.

@philippconzett
Copy link
Contributor Author

I just noted that this issue is still discussed also by other users; cf. this thread in the Dataverse Google group.

@philippconzett
Copy link
Contributor Author

I'd like to urge DataCite (@mfenner) to follow up on this issue. The current situation is quite unsatisfactory as file metadata is confused with dataset metadata, resulting in i.a. a proliferation of file metadata records listed in DataCite Search result lists and ORCID record search result lists.

Currently, DataCite (in DataCite Fabrica) offers the following values for Resource Type General:

image

For files within a dataset, I suggest we use Dataset file or Dataset part or Part of Dataset.

Thanks!

@philippconzett
Copy link
Contributor Author

See also the the discussion thread Granularity of datasets in the PID Forum.

@mfenner
Copy link

mfenner commented Jul 10, 2020

@philippconzett you beat me to it, I was just about to post the link.

@jggautier
Copy link
Contributor

jggautier commented Feb 11, 2022

I'm helping look into an issue with how the metadata that Dataverse sends to DataCite affects how datasets and files are displayed in an Elsevier product called Data Monitor (https://www.elsevier.com/solutions/data-monitor). Data Monitor apparently grabs from DataCite the metadata of Harvard Dataverse Repository datasets and files (for files that were assigned PIDs before the feature was turned off). And apparently Data Monitor uses some sort of algorithm to figure out which files are parts of which datasets so that it's possible in their product to display only datasets.

I'm planning on contacting Elsevier to find out more and all of this reminded me of this issue. Might be helpful to learn what Elsevier is doing with the DataCite metadata it gets.

@philippconzett
Copy link
Contributor Author

I guess the answer might be as simple as the file DOIs in Dataverse having the structure

dataset DOI + file suffix

Example from DataverseNO:
Dataset DOI: https://doi.org/10.18710/QBSWEH
File DOI: https://doi.org/10.18710/QBSWEH/FJR0YN

@jggautier
Copy link
Contributor

Hmm, that might be a factor, too.

Folks at Elsevier confirmed today that from the metadata it gets from DataCite, the "HasPart" relationType in dataset metadata and the "IsPartOf" relationType in file metadata is used to figure out which files are part of which datasets. Doesn't sound very foolproof to me, since a dataset could be a part of another dataset. But maybe publishers aren't sending that kind of relationship metadata to DataCite and maybe despite DataCite's reservations, most publishers are registering DOIs with the kind of structure from your examples.

@j-n-c
Copy link
Contributor

j-n-c commented Sep 21, 2022

This issue came about on a recent discussion on the community group.

In the (current) latest version of Dataverse Software (5.11.1), resourceTypeGenetal is still hardcoded to Dataset: https://github.com/IQSS/dataverse/blob/develop/src/main/resources/edu/harvard/iq/dataverse/datacite_metadata_template.xml

It would be great if the priority for this issue could be increased to that the interoperability between the Dataverse software and other platforms could be increased

@qqmyers
Copy link
Member

qqmyers commented Sep 21, 2022

Attempting to summarize this issue - there are ~3 proposals for what is needed to have files recognized:

On the community call there was discussion of checking on the proposal for the next DataCite schema to see if something is included w.r.t. a different resourceType for files. If someone checks that, we could either provide feedback on the proposal and/or plan to change the file resourceType when that option is available.

Are there other things proposed here that could/should be acted on?

@pdurbin
Copy link
Member

pdurbin commented Sep 21, 2022

From a quick look at the 4.5 draft at https://datacite-metadata-schema.readthedocs.io/en/4.5_draft/appendices/appendix_1/resourceTypeGeneral.html via the RFC at https://docs.google.com/document/d/1UyQQwtjnu-4_4zXE4TFZ74-mjLZI3NkEf8RrF0WeOdI/edit?usp=sharing we still we still don't have a good way to distinguish between a dataset and a datafile. Here's the list:

  • Audiovisual
  • Book
  • BookChapter
  • Collection
  • ComputationalNotebook
  • ConferencePaper
  • ConferenceProceeding
  • DataPaper
  • Dataset
  • Dissertation
  • Event
  • Image
  • Instrument
  • InteractiveResource
  • Journal
  • JournalArticle
  • Model
  • OutputManagementPlan
  • PeerReview
  • PhysicalObject
  • Preprint
  • Report
  • Service
  • Software
  • Sound
  • Standard
  • StudyRegistration
  • Text
  • Workflow
  • Other

Dataset is described like this...

... and it seems in line with what we call a dataset in Dataverse. (Here's a link to the example: https://doi.org/10.1594/PANGAEA.804876 .) It's metadata, and from that metadata you can figure out how to download the actual data files.

If something like DataFile or just Data appeared in the list above, I'd probably say we should use it for files in Dataverse. But there's nothing there so we're sort of stuck unless we get something like DataFile or Data added to the DataCite schema.

Another thought... would it help to use multiple resourceTypeGeneral types for files? That is, send a different resourceTypeGeneral to DataCite based on the file type. Based on the most popular file types in Harvard Dataverse, here's a proposed mapping:

  • Image -> Image
  • Data -> ???
  • Text -> Text
  • Unknown -> Other
  • Document -> Text
  • Tabular Data -> ???
  • Archive -> ???
  • Code -> Software
  • FITS -> Other
  • Audio -> Audiovisual
  • Shape -> ???
  • Video -> Audiovisual
  • Other -> Other
  • Network Data -> Other
  • Model -> Model
  • Chemical -> Other
  • Binary -> Other
  • Biosequence -> Other
  • Test -> Other
  • Message -> Other

Obviously, this falls down for Data and Tabular Data. I simply put ??? above for those. Here's a screenshot to make this a bit more concrete:

Screen Shot 2022-09-21 at 10 52 40 AM

@mfenner
Copy link

mfenner commented Sep 21, 2022

The discussion Dataset/Datafile is an older one, going back to for example how schema.org and DCAT handle this. I think it is worth discussing again for the 4.5 schema.

@qqmyers
Copy link
Member

qqmyers commented Sep 22, 2022

@mfenner - what's the best way for us to do this as a community? I think there are multiple people and groups interested in this. I see the online v4.5 material has comment forms. Should we just use those?

@philippconzett
Copy link
Contributor Author

Glad you are revitalizing this discussion. The issue was recently discussed on a Dataverse community call (see notes) and I also pitched it at the DataCite member meeting earlier this weak.

@mfenner
Copy link

mfenner commented Sep 23, 2022

I would provide feedback via https://datacite-metadata-schema.readthedocs.io/en/4.5_draft/index.html, there is a comments box at the bottom. I am also interested in this topic via my involvement in the InvenioRDM project.

@philippconzett
Copy link
Contributor Author

Thanks, @mfenner. Is there a deadline for feedback to be considered for v. 4.5?
@qqmyers Maybe we could have a Dataverse Metadata IG call about this?

@mfenner
Copy link

mfenner commented Sep 26, 2022

I don‘t know the timeline of the 4.5 release, as I am no longer involved.

@philippconzett
Copy link
Contributor Author

Sorry, @mfenner, I keep forgetting you no longer are at DataCite :-/
I just saw that the Google doc which the GitHub page links to will be open for comment through October 17, 2022.

@mfenner
Copy link

mfenner commented Sep 26, 2022

No problem, I still care about the DataCite metadata schema, now mainly in the context of my work on InvenioRDM.

@pdurbin
Copy link
Member

pdurbin commented Sep 30, 2022

I was just on a call with @mjbuys and I wanted to ask him about his thoughts on resourceTypeGeneral for files. 😄

Again, my take is adding a very generic type such as "Data" or "DataFile" would help.

(Alongside "Audiovisual", "Book", etc. from the list at https://datacite-metadata-schema.readthedocs.io/en/4.5_draft/appendices/appendix_1/resourceTypeGeneral.html .)

@mjbuys
Copy link

mjbuys commented Oct 3, 2022

Thanks @pdurbin. Unfortunately as this is a proposed change to the schema (rather than comments on the draft 4.5 schema), this would need to be considered for future releases. For context, all changes to the schema go through community validation and extensive discussion in the metadata sub-groups. It would be great if you can submit an idea through our roadmap (https://datacite.org/roadmap.html), see metadata changes at the bottom of each tab.

I am tagging @KellyStathis who leads our work with the metadata working group. It may be that we can address this use case through use of both resourceTypeGeneral and resourceType properties to describe the different entities; and use the relationType field to describe the relationship between these entities (the dataset and the datafile). @KellyStathis what are your thoughts? (@pdurbin Kelly is out through next Monday so you will likely only get a response then. Let me know if this is more pressing).

@pdurbin
Copy link
Member

pdurbin commented Oct 6, 2022

@mjbuys thanks for clarifying about the Oct 17 deadline, that it's to leave comments rather than make proposed changes (late in the game! 😄 ).

A bunch of us just met about metadata and what to send to DataCite in the future. Notes are here: https://docs.google.com/document/d/1tNnvVh8jYY1g53BEwpJmMmm9w6Vgy_Q7RrmFjGnYOyA/edit?usp=sharing

In summary, we're pretty sure we'd like to use our OpenAIRE export as a basis for making improvements to what we send to DataCite.

That doesn't really address this issue (#5086) about files, so I'm getting a little off-topic. 😄 Some day.

Anyway, yes, we'd love to chat more with you and @KellyStathis some day. No, it isn't pressing. 🏖️ Thanks! We'll be in touch!

@KellyStathis
Copy link

My initial 2 cents:

  • With the current schema (4.x) I would suggest resourceTypeGeneral=Dataset with a specific resourceType of "DataFile" (or similar), and linkage through RelatedIdentifiers with HasPart/IsPartOf relationTypes.
  • I would not rely exclusively on the "dependent file DOIs" to encode a whole/part relationship between datasets and files.

Going forward, I see the benefit of having a more structured way to distinguish files (like a specific ResourceTypeGeneral)—among other reasons, because it is important for aggregators to be able to filter these out. As @jggautier mentioned above, HasPart/IsPartOf can can also be used for a dataset that is part of another dataset, so it isn't foolproof. I've saved a link to this discussion in our internal system for tracking schema suggestions, so we can take this suggestion into account for version 5.0. Additional thoughts via our Roadmap are also welcome!

It is also worth considering how this would intersect with the proposed Distribution property in 4.5. At the dataset DOI level, there could be some redundancy between the RelatedIdentifier property (HasPart) and the Distribution property's contents—both of which may include references to file-level DOIs. Discussion about this proposed Distribution property is in datacite/schema-docs#7 and in the RFC Google Doc: DataCite Metadata Schema 4.5: Request for Comments.

I'm also curious if anyone knows of other repository platforms registering DOIs for files, in addition to Dataverse? That would also be helpful for us in understanding the use case. (It sounds like InvenioRDM is interested in this, @mfenner?)

@mfenner
Copy link

mfenner commented Oct 15, 2022

Thanks @KellyStathis, distribution is basically about the same idea (e.g. dataset and distribution in DCAT), I missed that in my initial comment. The InvenioRDM community is currently mainly focused on launching repositories in production, a DataCite metadata schema change is probably more interesting a bit later, e.g. 2024.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Metadata Type: Feature a feature request User Role: Curator Curates and reviews datasets, manages permissions
Projects
Status: Interested
Development

No branches or pull requests

8 participants