Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary statistics [RSS] #84

Closed
jpullmann opened this issue Jan 18, 2018 · 39 comments
Closed

Summary statistics [RSS] #84

jpullmann opened this issue Jan 18, 2018 · 39 comments

Comments

@jpullmann
Copy link

Summary statistics [RSS]

Express summary statistics and descriptive metrics to characterize a Dataset.


Related use cases: Summarization/Characterization of datasets [ID33] 
@makxdekkers
Copy link
Contributor

DQV should be able to meet these requirements.

@dr-shorthair
Copy link
Contributor

Do we also need statistics on distributions? This requirement is suggested by the comment submitted by Daniel Pop [1]. Of course it also depends on how we resolve the matter of 'information equivalence' of different distributions.

[1] https://lists.w3.org/Archives/Public/public-dxwg-comments/2019Jan/0013.html

@andrea-perego
Copy link
Contributor

I think we shouldn't prevent this - as done for other information. The question is how to put this option in the spec.

I'm a bit reluctant to explicitly add properties in class definitions where we don't have real-world use cases and/or implementation evidence. So, this could be included in the "guidance" part of the spec.

About "information equivalence", (again) -1 to it. This ends up to be a matter of the "granularity" of the notion of dataset, which is mainly a data provider choice (possibly also based on the requirements of the intended users).

@makxdekkers
Copy link
Contributor

@dr-shorthair I am not quite sure how you derive a requirement for statistics on datasets? If there is a need for it, maybe we could refer to DQV or Data Cube?
In my mind, Daniel's point (1) could be resolved by modelling the real-time data stream as a dcat:DataService and modelling the CSVs as separate datasets.

@dr-shorthair
Copy link
Contributor

The statistic that Daniel mentioned is the frequency or spacing of members in a time series, where various distributions might have fixed spacing that is different (usually coarser) than what is available from the underlying dataset. I was on the point of creating an explicit issue for this aspect alone, but since this is an aspect of dataset statistics I thought it would be best to open the discussion here first.

@dr-shorthair
Copy link
Contributor

@makxdekkers I did not derive a new requirement for dataset statistics - this was one of the original requirements taken from UCR.

However, I do wonder if time-series are such a common case that they might deserve special treatment. i.e. complement dct:temporal (coverage) with one more number - the item-accrual-periodicity. And since dct:accrualPeriodicity has been hijacked (in the DCAT context) to describe the publication period, it might have to be a new property? See #728

@smrgeoinfo
Copy link
Contributor

If I understand correctly, the concept @dr-shorthair is looking for is named temporalResolution in ISO19115-1, and is important for evaluating datasets that have temporal coverage. There is a corresponding spatialResolution property that is equally important if you're evaluating spatial data.

@dr-shorthair
Copy link
Contributor

dr-shorthair commented Feb 17, 2019

@smrgeoinfo yes - I think we need to pair

  • dct:temporal - Temporal Coverage - i.e. the temporal extent of the dataset, the time interval that this dataset describes
    with
  • dcat:temporalResolution - smallest time period resolvable in the data; e.g. temporal spacing of a regular time series

And, while I'm a little wary of treading too far down a path that should be managed through a geospatial profile, since we already have

  • dct:spatial - Geospatial coverage - i.e. the spatial extent of the dataset
    it is not much of a stretch to match this with
  • dcat:spatialResolution - smallest distance separating items in the data

(and stop there).

@andrea-perego
Copy link
Contributor

For spatial / temporal resolution, see UC15, which describes the general context and provides the relevant references.

These topics were discussed by the SDW WG, and then with the DWBP WG (in particular, with @aisaac and @riccardoAlbertoni ), which led to a proposal on how to specify it by using DQV.

The proposal is included as an example (focussing on spatial resolution only) in DQV, §6.13 (Express dataset precision and accuracy), which was in turn re-used into SDW's Best Practice 14 (Describe the positional accuracy of spatial data).

We should therefore re-use and consolidate that approach.

About consolidation, I summarised what I see as issues to be addressed in the context of the possible revisions to GeoDCAT-AP ( see SEMICeu/GeoDCAT-AP#3).

For our convenience, I copy-paste below the relevant text from SEMICeu/GeoDCAT-AP#3:

Basically, DQV models this information as observations / measurements of a given quality metric (which corresponds to a given type of resolution).

[...]

[Adopting] This [solution] would however require the definition of two groups of individuals:

  1. Those corresponding to the different types of resolution (denoting a quality metric).
  2. Those corresponding to each of the different levels of resolution (denoting the measurement of a specific quality metric).

As far as the first group is concerned (i.e., the different types of resolution), these individuals can be defined in DQV as follows:

:SpatialResolutionAsEquivalentScale a dqv:Metric;
  skos:definition "Spatial resolution of a dataset expressed as equivalent scale,
	  by using a representative fraction (e.g., 1:1,000, 1:1,000,000)."@en ;
  dqv:expectedDataType xsd:decimal ;
  dqv:inDimension dqv:precision .
    
:SpatialResolutionAsDistance a dqv:Metric;
  skos:definition "Spatial resolution of a dataset expressed as distance"@en ;
  dqv:expectedDataType xsd:decimal ;
  dqv:inDimension dqv:precision .

This initial list can be further extended. E.g.:

:SpatialResolutionAsHorizontalGroundDistance a dqv:Metric;
  skos:definition "Spatial resolution of a dataset expressed as horizontal ground distance"@en ;
  dqv:expectedDataType xsd:decimal ;
  dqv:inDimension dqv:precision .
    
:SpatialResolutionAsVerticalDistance a dqv:Metric;
  skos:definition "Spatial resolution of a dataset expressed as vertical distance"@en ;
  dqv:expectedDataType xsd:decimal ;
  dqv:inDimension dqv:precision .
    
:SpatialResolutionAsAngularDistance a dqv:Metric;
  skos:definition "Spatial resolution of a dataset expressed as angular distance"@en ;
  dqv:expectedDataType xsd:decimal ;
  dqv:inDimension dqv:precision .    

The question is in which space such individuals should be defined [...].

The definition of individuals in the second group is however more problematic, since the level of resolution and unit of measurement are arbitrary (1:1000, 1:100, 1m, 1km, 100m, 10 decimal degrees, etc.).

Possible options include the following ones:

  1. Define only the individuals corresponding to the types of spatial / temporal resolution, whereas the individuals expressing the actual resolution will be defined at the data level. This solution is not optimal, since it will result in multiple definitions of the same individuals.
  2. Define individuals only for some levels of resolution and units of measurements - e.g., the most common ones. This solution may address the majority of (but not all) the cases.
  3. Set up a URI space supporting arbitrary levels of resolution and units of measurements. This register will dynamically generate the corresponding individuals based on information included in their URI.

An example of the last option, including also a proposal for how these individuals could be defined, is available at:

http://geodcat-ap.semic.eu/id/resolution/

@dr-shorthair
Copy link
Contributor

dr-shorthair commented Feb 18, 2019

I agree that DQV is competent to satisfy the requirement, as shown in the examples.
However, I'm not sure it is optimal for meeting it in the DCAT context.

For example, the examples and the summary above present multiple kinds of 'spatial resolution', which may be important for sophisticated users.
But pushing the basic case into this structure, and then depending on a subsidiary vocabulary for labels like 'SpatialResolutionAsDistance', adds two additional layers for concepts that are widely relevant and can be easily explained (and also note the dependency on SDMX as well ...).

Access to a single summary statistic for each would help a lot in the initial discovery phase.
Interoperability is almost always helped by limiting the options.

My proposition (above) is that for DCAT to work better for a large number of datasets, two statistics might be worth 'promoting' to be first-class properties for datasets, i.e. corresponding to:

  • SpatialResolutionAsDistance
  • TemporalResolutionAsDuration

This was referenced Feb 18, 2019
@makxdekkers
Copy link
Contributor

@dr-shorthair It would indeed be good if there was a simple way to expose resolutions. There is in any case a need to express both value and unit, so for spatial resolution the range would be (something like) schema:Distance, and for temporal resolution (something like) schema:Duration.
Unfortunately, DCMI only has a class dct:SizeOrDuration, but not separate classes for Size and Duration. Should we define classes dcat:Distance and dcat:Duration?

@andrea-perego
Copy link
Contributor

@dr-shorthair , I also agree that we need to address first the simplest use cases - and actually the reasoning in SEMICeu/GeoDCAT-AP#3 was along those lines (the first example was about the two typical ways of expressing spatial resolution: distance and equivalent scale).

As @makxdekkers says, I see more an issue on the fact that we need to express value and unit of measurement, and however we do it, it is unlikely we end up with something simpler than the DQV approach, unless we inflate all these semantics in the one single term, and we allow the use of just 1 unit of measurement. E.g., by using properties like:

  • dcat:spatialResolutionAsDistanceInMeters
  • dcat:temporalResolutionAsDurationInSeconds

or

  • dcat:resolution / dcat:SpatialResolutionAsDistance / dcat:distanceInMeters
  • dcat:resolution / dcat:TemporalResolutionAsDuration / dcat:durationInSeconds

(or something along those lines).

@smrgeoinfo
Copy link
Contributor

One issue with dqv is that in some engineering situations, resolution and precision are different. Is there a problem with using schema:Distance as the value for SpatialResolutionAsDistance, and schema:Duration for TemporalResolutionAsDuration?

@andrea-perego
Copy link
Contributor

@smrgeoinfo wrote:

One issue with dqv is that in some engineering situations, resolution and precision are different.

Yes, the wording of the relevant section in DQV does not make this distinction, but the formal definition of the resolution in the examples does not bind the notion of resolution with the one of precision.

Is there a problem with using schema:Distance as the value for SpatialResolutionAsDistance, and schema:Duration for TemporalResolutionAsDuration?

Maybe schema:Duration can work, as it is using a standard syntax encoding, but schema:Distance uses a literal where value and a code for unit of measurement are separated by a space. Besides the problem of ensuring that codes for units of measurement are used consistently, this value is not machine-actionable. E.g., I won't be able to make a query to get the datasets using a spatial resolution with a distance less than 100 m.

Besides this, IMO, re-using Schema.org properties may lead to the issues mentioned in #85 (comment) (in that case in relation to schema:startDate and schema:endDate).

@dr-shorthair
Copy link
Contributor

dr-shorthair commented Feb 19, 2019

@andrea-perego yes this is a bit of a perma-issue. There are too many representations of 'measure' or 'quantity' already, but none have achieved universal acceptance. Furthermore, most come with a lot of baggage (or at least are just one tiny part of some huge vocabulary, the rest of which we have little interest in in this context. That is the problem with your original DQV proposal: it makes the simple case hard.

So, taking a leaf out of Randall Munroe's book, I suggest crashing through and specifying this as the range of both dcat:temporalResolution and dcat:spatialResolution:

dcat:Measure a owl:Class . 
dcat:unitOfMeasure a rdf:Property ;
    rdfs:domain dcat:Measure .
dcat:amount a owl:DatatypeProperty ;
    rdfs:domain dcat:Measure ;
    rdfs:range xsd:decimal .

Which would mean that an instance would look like

<> a dcat:Dataset ;
    ...
    dcat:temporalResolution [
        a dcat:Measure ;
        dcat:amount 15.0 ;
        dcat:unitOfMeasure <http://www.w3.org/2006/time#unitMinute> ;
    ] ;
    dcat:spatialResolution [
        a dcat:Measure ;
        dcat:amount 30.0 ;
        dcat:unitOfMeasure <http://qudt.org/vocab/unit/M> ;
    ] ;
    ...
.

@makxdekkers
Copy link
Contributor

@dr-shorthair While I do like the approach to provide a 'simple' solution for 'simple' cases, I do feel a bit uneasy to replicate something that is already there, i.e. the more 'fundamental' solution in DQV. If we promote this 'simple' solution, 'simple' cases -- using the DCAT-specific solution -- are not going to be interoperable with more 'complex' cases using a DQV-based solution. One could argue that by promoting a DCAT-specific approach, we are discouraging people to use a DQV-based approach and thus only cater for 'simple' cases to be handled by DCAT.

@dr-shorthair
Copy link
Contributor

Yeah. On the one hand, I'm usually one of the first to advocate strongly for re-use of existing solutions, particularly if they are from the W3C stable and have clearly been designed to integrate.
On the other I was somewhat put off by the complexity that is introduced as a further controlled vocabulary is required for the property semantics. I understand why DQV does it that way, to remain scalable and general. But we need to be sure that we want this to be reflected into DCAT. Furthermore, as has been noted before, DQV is not a Rec therefore officially it cannot be cited normatively;-(

Of course, all of these spatial and temporal properties (including the classic DCT ones) have non-simple values, so just the complexity re-appears a layer down anyway.

However, I think the mappings to DQV can almost certainly be formally expressed using OWL Restrictions and property-chain-axioms (e.g. see mappings from DCT to PROV here: https://github.com/w3c/dxwg/blob/gh-pages/dcat/rdf/dcat-prov.ttl#L63 ) so I'm not sure the interoperability argument made by @makxdekkers is strictly true.

@andrea-perego
Copy link
Contributor

@dr-shorthair , working towards a simple solution:

Following up from @makxdekkers 's and @smrgeoinfo 's comment on schema:Duration, cannot we make dcat:temporalResolution a datatype property, with range xsd:duration?

Re-using your example, this would be something like:

<> a dcat:Dataset ;
    ...
    dcat:temporalResolution "PT15M"^^xsd:duration ;
    ...
.

Unfortunately, the same cannot be done for spatial resolution.

@dr-shorthair
Copy link
Contributor

Good point. Temporal resolution was the thing that triggered this discussion, and it is more mainstream - one dimension is so much easier than two or three.

Spatial resolution (as distance) is still relatively simple conceptually but does need an explicit UOM. If only XSD had a 'measure' type (and every other programming language for that matter ... computer-science fail IMHO)

@riccardoAlbertoni
Copy link
Contributor

@dr-shorthair wrote:

...
dcat:spatialResolution [
a dcat:Measure ;
dcat:amount 30.0 ;
dcat:unitOfMeasure <http://qudt.org/vocab/unit/M> ;
] ;

I am not very convinced about the need to mint a new property for dcat:unitOfMeasure.

sdmx-attribute:unitMeasure is widely used, W3C recommendations such as RDF data cube use it, and I am concerned about introducing new patterns when there is one which is more or less well-accepted.

I see pros and cons in having both approaches : DQV/RDF DATA CUBE style and the DCAT properties.
If we go for defining new dcat properties, I guess that we should anyway explicitly refer to SDW best practice which reuses DQV/RDF DATA CUBE for the more general cases.

@dr-shorthair
Copy link
Contributor

dr-shorthair commented Feb 20, 2019

Mind you, xsd:duration is not an OWL built-in https://www.w3.org/TR/owl2-quick-reference/#Built-in_Datatypes . So I'm thinking perhaps to leave the range open, but recommend use of xsd:duration?

@dr-shorthair
Copy link
Contributor

(shame there isn't an ISO standard for 'Length' complementing what ISO 8601 did for 'Time')

@dr-shorthair
Copy link
Contributor

See revised proposal for dcat:spatialResolutionM in branch https://github.com/w3c/dxwg/tree/dcat-issue84-sres-simon - simplified with units of measure fixed to metres:

@andrea-perego
Copy link
Contributor

+1 from me.

@andrea-perego
Copy link
Contributor

I wonder whether we could consider adding properties for spatial resolutions not expressed as distance, namely, as equivalent scale - which is the other one most common way for specifying spatial resolution.

@dr-shorthair
Copy link
Contributor

dr-shorthair commented Feb 25, 2019

I'm reluctant to provide a second option at this level. As soon as you have more than one alternative, you begin to lose interoperability. I understand that '1:50,000' etc is the cartographic tradition, and geographers routinely infer resolution from this ('what is the distance on the ground of the thickness of a pencil line on the map?'). But a length measure is more direct and less ambiguous, and also applies to gridded data.

While more detail and options can be given using the DQV structures shown above, I really think we should add only one option in the DCAT namespace.

This was referenced Feb 27, 2019
@dr-shorthair
Copy link
Contributor

dr-shorthair commented Mar 3, 2019

@riccardoAlbertoni Your contributions in Chapter 8 show some patterns for use of DQV for quality information.

Are you aware of a 'standard' way to provide basic dataset statistics using DQV or any other RDF vocabulary? e.g. minimum/maximum(/average) values for specified dimensions? I'm not seeing anything obvious in DQV or QB :-( I guess it might be a dqv:Metric but I wonder if you could provide guidance on how this might look?

@agbeltran
Copy link
Member

I was looking for the same thing and the relevant bit that I found is this DQV section on statistics that relies on an extension of VoID and thus too oriented to RDF datasets.

@riccardoAlbertoni
Copy link
Contributor

Are you aware of a 'standard' way to provide basic dataset statistics using DQV or any other RDF vocabulary? e.g. minimum/maximum(/average) values for specified dimensions? I'm not seeing anything obvious in DQV or QB :-( I guess it might be a dqv:Metric but I wonder if you could provide guidance on how this might look?

I am not aware of anything except the examples mentioned by @agbeltran for the statistics oriented to RDF datasets, perhaps @makxdekkers knows more ?!?.

Anyway, I guess there is more than one way to do it. For example, using RDF data cube you can define your own qb:DataStructureDefinition.

if you want to describe statistic of datasets such as Average, Max, Min for the "fields" in the dataset, you might define a qb:DataStructureDefinition whose dimensions/components include

  • the considered dataset
  • the considered field
  • the considered operator ( i.e. Average, Max, Min.. etc)
  • the actual measures

If you provide statistics as quality indicators you can think of using DQV qualityMeasurement, for example defining a new dqv:dimensioni for each pair of field and operator.

@andrea-perego
Copy link
Contributor

@dr-shorthair wrote:

I'm reluctant to provide a second option at this level. As soon as you have more than one alternative, you begin to lose interoperability. I understand that '1:50,000' etc is the cartographic tradition, and geographers routinely infer resolution from this ('what is the distance on the ground of the thickness of a pencil line on the map?'). But a length measure is more direct and less ambiguous, and also applies to gridded data.

While more detail and options can be given using the DQV structures shown above, I really think we should add only one option in the DCAT namespace.

I would also prefer to have one solution that fits all use cases, but we should also recognise that this two ways of expressing spatial resolution (i.e., distance and equivalent scale) are not comparable or convertable. So, IMO, the use of two different properties is more than acceptable.

BTW, my request is based also on an explicit requirement from GeoDCAT-AP - which is defining mappings from ISO 19115:2003, where spatial resolution is expressed either as distance or equivalent scale.

@andrea-perego
Copy link
Contributor

Re-thinking about this, probably we should consider the option of specifying spatial resolution in 2 steps (which was one of the options discussed earlier):

a:Dataset a dcat:Dataset ;
  dcat:spatialResolution [
    dcat:distanceInMeters "15"^^xsd:decimal .
] .

One of the advantages is that it would be easier for people to reuse the main pattern dcat:spatialResolution / "specific property" in case they need to express this information in other ways (e.g., as per ISO 19115-1:2014, which includes also resolution as horizontal ground distance, vertical distance and angular distance).

@davebrowning
Copy link
Contributor

@andrea-perego - do you see this issue as critical or can this be moved to the backlog?

@andrea-perego
Copy link
Contributor

Partially critical (for the reasons I explained) but it can be moved to the backlog, provided that it will be possible to come back to this after DCAT v1.1 is out and possibly address it in the v1.2 release.

@andrea-perego
Copy link
Contributor

I created a new issue to work on the discussion points still open:

#1266

Closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants