Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use case: Dataset size characteristics #161

Closed
VladimirAlexiev opened this issue Mar 8, 2018 · 12 comments
Closed

Use case: Dataset size characteristics #161

VladimirAlexiev opened this issue Mar 8, 2018 · 12 comments
Assignees
Labels
catalog dcat due for closing Issue that is going to be closed if there are no objection within 6 days statistics ucr
Milestone

Comments

@VladimirAlexiev
Copy link

Submitting a new USE CASE:


Dataset subsets and size characteristics

Status:

Identifier: ID51 (proposed)

Creator: Vladimir Alexiev, Ontotext

Deliverable(s): DCAT1.1

Tags

semantics statistics size

Stakeholders

Data consumers often need to know how many of what sort of entities are included in a dataset.
In an aggregation scenario, different subsets (parts of a dataset) need to be expressed, eg because they come from different data providers.

Eg in the euBusinessGraph project we have a need to describe Company datasets by different providers,
what properties are included in each (eg ebg:isStartup, org:orgActivity),
and some partition info eg "the dataset covers jurisdiction Italy" or "the dataset has 1000 Italian startups"
(i.e. rov:RegisteredOrganization with ebg:isStartup=true and jurisdiction Italy)

Problem statement

DCAT 1.0 has only a property dcat:byteSize, which is pretty useless to describe any aspect of dataset content or value.

And it has no means of expressing subsets.

Existing approaches

VOID statistics includes these void: counting props: triples, entities, classes, properties, distinctSubjects, distinctObjects, documents.

Very importantly, these can be used on subsets such as classPartition and propertyPartition, which provides very powerful means to describe exactly what kinds of entities are present, and how many are in the dataset.
Thus I believe that subsets are instrumental in expressing the fine-grained content of a dataset.

Links

Schema issue https://github.com/schemaorg/schemaorg/issues/1855

Requirements

Ability to express the fine-grained content of a dataset:

  • Ability to express subsets of a dataset.
  • Describe subsets by kind of entity (e.g. Companies vs Events) and/or entity characteristics (e.g. Italian companies, Startups, Startups in Italy)
  • The kinds and characteristics should be expressed by URLs
  • Express the count of entities in a dataset or subset
  • Optionally, express other dataset size characteristics. E.g. in RDF context, that's number of triples and nodes

Notes:

  • It's pretty clear how to do this for RDF datasets (see VOID). The real challenge is how to do it for other datasets.
  • I think it's mandatory to express subset characterization with URLs and not text.
  • I think there's also need to express assertions to be used for characterization, eg ebg:isStartup=true

Related use cases

ID33, ID7, RDSAT, RSS.

This one could be merged into ID33 to provide further details.

@dr-shorthair
Copy link
Contributor

This looks like a clear gap in DCAT capabilities.
Primarily applies to Dataset, but also to Catalog

Could UCR team extract Requirements from this?

@andrea-perego
Copy link
Contributor

andrea-perego commented Mar 9, 2018

@VladimirAlexiev
Copy link
Author

So both this one (ID51) and 5.44 (ID44) talk of subsets but also about other topics.
I feel that subsets are such a crucial topic, they should be split into their own requirement.

@jpullmann
Copy link

The proposal seems to merge two concerns: 1) expressing the structure of data sets, like modeled in DATS (hasPart) and 2) indicating the size property of the composite or the individual parts. My suggestion is to rephrase this UC to cover only 2) and create and link here a new UC on behalf of 1)

@VladimirAlexiev
Copy link
Author

hi @jpullmann! I've reread this proposal and I agree with you. As I wrote "subsets are such a crucial topic, they should be split into their own requirement". Requirement 5.44 (ID44) also talks of subsets, but for another reason.

Someone needs to take subset characterization parts from this one (ID51), 5.44 (ID44), and your suggestion (DATS hasPart) and consolidate them. I could try it but I'm not sure I can capture other people's suggestions adequately. Are there designated editors in this WG?

@dr-shorthair
Copy link
Contributor

Editors names on each draft https://www.w3.org/2017/dxwg/wiki/Main_Page#Deliverables
For UCR document it is Ixchel Faniel, Jaroslav Pullmann, Rob Atkinson.

@aisaac aisaac removed the dataset label May 29, 2018
@dr-shorthair
Copy link
Contributor

dr-shorthair commented May 30, 2018

Relationship of a dataset to subsets is part of #81
This issue should focus on better description of dataset size, perhaps including extent on different dimensions.

I've renamed it to reflect that focus.

@dr-shorthair dr-shorthair changed the title Dataset subsets and size characteristics Dataset size characteristics May 30, 2018
@agbeltran
Copy link
Member

agbeltran commented Aug 14, 2018

As per the last comment, assuming that the related datasets (sub-datasets) are considered in the requirement #81, should this UC be added to the UCR document focusing on the dataset size characteristics where the main requirements would be around (copied from above):

  • Express the count of entities in a dataset or subset
  • Optionally, express other dataset size characteristics. E.g. in RDF context, that's number of triples and nodes

Ping to @fanieli @jpullmann @rob-metalinkage

@davebrowning
Copy link
Contributor

Also related to existing issue/requirement #84 and some of the discussion of #313.

@andrea-perego andrea-perego changed the title Dataset size characteristics Use case: Dataset size characteristics Sep 26, 2019
@dr-shorthair
Copy link
Contributor

@VladimirAlexiev are you interested in reviving this?

@andrea-perego andrea-perego added this to the DCAT3 2PWD milestone Mar 13, 2021
@andrea-perego andrea-perego added the due for closing Issue that is going to be closed if there are no objection within 6 days label Mar 13, 2021
@andrea-perego
Copy link
Contributor

Unless there are any objections, I propose we close this issue.

@andrea-perego
Copy link
Contributor

Noting no objections, I'm closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
catalog dcat due for closing Issue that is going to be closed if there are no objection within 6 days statistics ucr
Projects
None yet
Development

No branches or pull requests

9 participants