Use case: Dataset size characteristics #161

VladimirAlexiev · 2018-03-08T09:21:43Z

Submitting a new USE CASE:

Dataset subsets and size characteristics

Status:

Identifier: ID51 (proposed)

Creator: Vladimir Alexiev, Ontotext

Deliverable(s): DCAT1.1

Stakeholders

Data consumers often need to know how many of what sort of entities are included in a dataset.
In an aggregation scenario, different subsets (parts of a dataset) need to be expressed, eg because they come from different data providers.

Eg in the euBusinessGraph project we have a need to describe Company datasets by different providers,
what properties are included in each (eg ebg:isStartup, org:orgActivity),
and some partition info eg "the dataset covers jurisdiction Italy" or "the dataset has 1000 Italian startups"
(i.e. rov:RegisteredOrganization with ebg:isStartup=true and jurisdiction Italy)

Problem statement

DCAT 1.0 has only a property dcat:byteSize, which is pretty useless to describe any aspect of dataset content or value.

And it has no means of expressing subsets.

Existing approaches

VOID statistics includes these void: counting props: triples, entities, classes, properties, distinctSubjects, distinctObjects, documents.

Very importantly, these can be used on subsets such as classPartition and propertyPartition, which provides very powerful means to describe exactly what kinds of entities are present, and how many are in the dataset.
Thus I believe that subsets are instrumental in expressing the fine-grained content of a dataset.

Links

Schema issue https://github.com/schemaorg/schemaorg/issues/1855

Requirements

Ability to express the fine-grained content of a dataset:

Ability to express subsets of a dataset.
Describe subsets by kind of entity (e.g. Companies vs Events) and/or entity characteristics (e.g. Italian companies, Startups, Startups in Italy)
The kinds and characteristics should be expressed by URLs
Express the count of entities in a dataset or subset
Optionally, express other dataset size characteristics. E.g. in RDF context, that's number of triples and nodes

Notes:

It's pretty clear how to do this for RDF datasets (see VOID). The real challenge is how to do it for other datasets.
I think it's mandatory to express subset characterization with URLs and not text.
I think there's also need to express assertions to be used for characterization, eg ebg:isStartup=true

Related use cases

ID33, ID7, RDSAT, RSS.

This one could be merged into ID33 to provide further details.

The text was updated successfully, but these errors were encountered:

dr-shorthair · 2018-03-09T02:27:10Z

This looks like a clear gap in DCAT capabilities.
Primarily applies to Dataset, but also to Catalog

Could UCR team extract Requirements from this?

andrea-perego · 2018-03-09T10:19:40Z

@dr-shorthair , @VladimirAlexiev ,

I think this use case relates very much to the following reqs:

and also to the following UC:

5.44 Identification of versioned datasets and subsets

VladimirAlexiev · 2018-03-10T21:48:29Z

So both this one (ID51) and 5.44 (ID44) talk of subsets but also about other topics.
I feel that subsets are such a crucial topic, they should be split into their own requirement.

jpullmann · 2018-03-16T15:46:34Z

The proposal seems to merge two concerns: 1) expressing the structure of data sets, like modeled in DATS (hasPart) and 2) indicating the size property of the composite or the individual parts. My suggestion is to rephrase this UC to cover only 2) and create and link here a new UC on behalf of 1)

VladimirAlexiev · 2018-04-25T07:42:45Z

hi @jpullmann! I've reread this proposal and I agree with you. As I wrote "subsets are such a crucial topic, they should be split into their own requirement". Requirement 5.44 (ID44) also talks of subsets, but for another reason.

Someone needs to take subset characterization parts from this one (ID51), 5.44 (ID44), and your suggestion (DATS hasPart) and consolidate them. I could try it but I'm not sure I can capture other people's suggestions adequately. Are there designated editors in this WG?

dr-shorthair · 2018-04-25T08:31:41Z

Editors names on each draft https://www.w3.org/2017/dxwg/wiki/Main_Page#Deliverables
For UCR document it is Ixchel Faniel, Jaroslav Pullmann, Rob Atkinson.

dr-shorthair · 2018-05-30T08:47:39Z

Relationship of a dataset to subsets is part of #81
This issue should focus on better description of dataset size, perhaps including extent on different dimensions.

I've renamed it to reflect that focus.

agbeltran · 2018-08-14T18:47:20Z

As per the last comment, assuming that the related datasets (sub-datasets) are considered in the requirement #81, should this UC be added to the UCR document focusing on the dataset size characteristics where the main requirements would be around (copied from above):

Express the count of entities in a dataset or subset

Optionally, express other dataset size characteristics. E.g. in RDF context, that's number of triples and nodes

Ping to @fanieli @jpullmann @rob-metalinkage

davebrowning · 2019-09-25T06:45:45Z

Also related to existing issue/requirement #84 and some of the discussion of #313.

dr-shorthair · 2021-02-24T01:19:46Z

@VladimirAlexiev are you interested in reviving this?

andrea-perego · 2021-03-20T09:26:35Z

Unless there are any objections, I propose we close this issue.

andrea-perego · 2021-03-27T19:36:37Z

Noting no objections, I'm closing this issue.

andrea-perego added the ucr label Mar 8, 2018

dr-shorthair assigned jpullmann, rob-metalinkage and fanieli Mar 9, 2018

dr-shorthair added dataset catalog labels Mar 9, 2018

VladimirAlexiev mentioned this issue Mar 12, 2018

Related vocabularies mapping [RVM] #88

Closed

dr-shorthair added the dcat label Mar 13, 2018

aisaac removed the dataset label May 29, 2018

dr-shorthair changed the title ~~Dataset subsets and size characteristics~~ Dataset size characteristics May 30, 2018

dr-shorthair added the statistics label Feb 6, 2019

andrea-perego changed the title ~~Dataset size characteristics~~ Use case: Dataset size characteristics Sep 26, 2019

andrea-perego added this to the DCAT3 2PWD milestone Mar 13, 2021

andrea-perego added the due for closing Issue that is going to be closed if there are no objection within 6 days label Mar 13, 2021

andrea-perego closed this as completed Mar 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use case: Dataset size characteristics #161

Use case: Dataset size characteristics #161

VladimirAlexiev commented Mar 8, 2018

dr-shorthair commented Mar 9, 2018

andrea-perego commented Mar 9, 2018 •

edited

Loading

VladimirAlexiev commented Mar 10, 2018

jpullmann commented Mar 16, 2018

VladimirAlexiev commented Apr 25, 2018

dr-shorthair commented Apr 25, 2018

dr-shorthair commented May 30, 2018 •

edited

Loading

agbeltran commented Aug 14, 2018 •

edited by andrea-perego

Loading

davebrowning commented Sep 25, 2019

dr-shorthair commented Feb 24, 2021

andrea-perego commented Mar 20, 2021

andrea-perego commented Mar 27, 2021

Use case: Dataset size characteristics #161

Use case: Dataset size characteristics #161

Comments

VladimirAlexiev commented Mar 8, 2018

Dataset subsets and size characteristics

Tags

Stakeholders

Problem statement

Existing approaches

Links

Requirements

Related use cases

dr-shorthair commented Mar 9, 2018

andrea-perego commented Mar 9, 2018 • edited Loading

VladimirAlexiev commented Mar 10, 2018

jpullmann commented Mar 16, 2018

VladimirAlexiev commented Apr 25, 2018

dr-shorthair commented Apr 25, 2018

dr-shorthair commented May 30, 2018 • edited Loading

agbeltran commented Aug 14, 2018 • edited by andrea-perego Loading

davebrowning commented Sep 25, 2019

dr-shorthair commented Feb 24, 2021

andrea-perego commented Mar 20, 2021

andrea-perego commented Mar 27, 2021

andrea-perego commented Mar 9, 2018 •

edited

Loading

dr-shorthair commented May 30, 2018 •

edited

Loading

agbeltran commented Aug 14, 2018 •

edited by andrea-perego

Loading