Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for observational error measurements in data #281

Open
steko opened this issue Aug 15, 2016 · 2 comments
Open

Support for observational error measurements in data #281

steko opened this issue Aug 15, 2016 · 2 comments

Comments

@steko
Copy link

steko commented Aug 15, 2016

Hey all, based on a discussion with @danfowler I'm submitting this proposal to add support for observational error measurements in data, a rather common occurrence in scientific datasets. I can't draft a full spec at the moment but I hope others will chime in with comments from their specific experience. Examples below are archaeology-based.

While the idea came out in the context of data packages, it seems JSON table schema is the area where this kind of support should be added.

Examples

Radiocarbon dates

As can be seen in the Mediterranean Radiocarbon dates dataset (one of the largest open datasets of this kind), radiocarbon dates need to be expressed at least by the conventional radiocarbon age and the error. While it's common to write 3340 ± 45 in text, datasets usually record the two separately. However, the radiocarbon age has no meaning without the attached error.

Neutron activation analysis

Compositional data from INAA (Neutron Activation Analysis) are expressed as parts per million with an attached measurement error as can be seen in the Chemical Composition by Neutron Activation Analysis (INAA) of Neo-Assyrian Palace Ware dataset (a rather common case). In this case, measurement and error are recorded in a single column, separated by ±.

Existing implicit conventions

Separate columns

id, data, error
0, 34, 0.2

Single column

id, data
0, 34 ± 0.2

Proposed approach

Add a field descriptor in the JSON schema to explicitly mark the values in one field as linked to another field, e.g.:

{
    "fields": [
      {
        "name": "measurement",
        "title": "The numeric value",
        "type": "number"
      },
      {
        "name": "error",
        "title": "The error attached to the numeric value",
        "type": "number",
        "errorOf": "measurement"
      }
    ]
}

An alternate approach:

{
    "fields": [
      {
        "name": "measurement",
        "title": "The numeric value",
        "type": "number",
        "errorField": "error"
      },
      {
        "name": "error",
        "title": "The error attached to the numeric value",
        "type": "number"
      }
    ]
}

This is just a basic description of the issue to get the discussion started, with no presumption of formal correctness nor exhaustive coverage of the various issues in other disciplines.

@djvanderlaan
Copy link

I am working mainly with statistical output tables (unemployment figures an such) where we sometimes also have the uncertainty. However, most often this is specified using a lower and upper bound of the confidence interval. We currently code this in the variable names (e.g. "measurement_lb" and "measurement_ub") and it has been on our todo list for a while to encode this in the meta data. So +1.

However, I think we need more than errorOf. A mentioned above we often have a lower and upper bound. What also is used are relative errors (%). The most flexible way would be to be able to specify arbitrary relations between columns. Perhaps something in the line of:

{
    "fields": [
      {
        "name": "measurement",
        "title": "The numeric value",
        "type": "number",
      },
      {
        "name": "error",
        "title": "The error attached to the numeric value",
        "type": "number",
        "relation" : { "type": "errorOf", "column": "measurement"}
      }
    ]
}

This will also allow people to specify custom relations. Although a list of suggested/default supported relations would be nice.

@rufuspollock
Copy link
Contributor

@steko @djvanderlaan i think this is a perfect candidate for a "pattern" proposal. A pattern is something that would offer a suggestion of how to solve a particular problem - in this case linking error information to main measurement - without being a formal spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

5 participants