Skip to content

Using Data Packages

Ben Best edited this page Apr 25, 2014 · 8 revisions

Motivation

Previously all "layers" used in the OHI toolbox had the following maximum set of fields (as determined by the layers.csv fed into CheckLayers():

  • id spatial identifier:
    • id_num if numeric
    • id_chr if character
  • category category, as character
  • year year, as integer
  • value value:
    • val_num if numeric
    • val_chr if character

The layers.*.csv at ohicore/inst/extdata, populated by data_create.R*, generally generate these data by slurping a Google spreadsheet (eg layers_navigation_2012a_2013a in csv output format).

Limiting a layer to this set of fields then requires many seperate files for data which is best maintained and viewed together, such as region: id, name, area_km2, etc. More recently this example as an LE input layer le_jobs_status_model_curref begs for a single file since all field values are related to one another:

metric cntry_key sector cur_year ref_year cur_base_value ref_base_value cur_adj_value ref_adj_value
jobs ABW cf 2008 2003 8.621900e+02 786.886800 91.38355 89.31041
jobs ABW mar 2008 2003 8.615858e-01 7.929788 91.38355 89.31041
jobs ABW tour 2010 2005 3.640000e+04 29700.000000 92.21258 90.13986
jobs AGO cf 2002 1995 4.150852e+04 48035.848000 90.19738 90.26034
jobs AGO tour 2010 2005 6.678587e+03 7472.289297 90.12543 90.17040
jobs AIA cf 2008 2003 2.425206e+02 498.330000 91.38355 89.31041
jobs AIA tour 2010 2005 3.900000e+03 4000.000000 92.21258 90.13986
jobs ALB cf 2008 2003 5.162066e+03 3986.640000 87.25179 86.80258
jobs ALB mar 2008 2003 6.820084e+02 168.035411 87.25179 86.80258
jobs ALB tour 2010 2005 1.045295e+05 74564.996669 87.12611 86.79412

There are several approaches to layerizing this data one may take.

  1. Since we're only allowed id, category (year is an annotation like value not only applicable to the given year) so even after splitting the data into per-metric layer files (jobs, wage, rev), they would have to be based on cntry_key (=id), sector (=category) and a new file per columns to the right.

  2. We could consider adding a subcategory generic field, so that we could melt the data.

metric cntry_key sector variable value
jobs ABW cf cur_year 2008.0000000
jobs ABW cf ref_year 2003.0000000
jobs ABW cf cur_base_value 862.1900000
jobs ABW cf ref_base_value 786.8868000
jobs ABW cf cur_adj_value 91.3835505
jobs ABW cf ref_adj_value 89.3104096
jobs ABW mar cur_year 2008.0000000
jobs ABW mar ref_year 2003.0000000
jobs ABW mar cur_base_value 0.8615858
jobs ABW mar ref_base_value 7.9297879

into Even after filtering new layers per metric (jobs, rev, wage), we still have and variable (subcategory?) if then melting the data. The *_year fields clearly have a different range of values (and integer data type vs floating) over the other *_value fields so simply melting to a metric, cntry_key, sector, variable, value format wouldn't work.

Entrez Data Packages

  • simple tabular data packages format:
    • data.csv csv format
    • datapackage.json metadata like name of package and data types of fields

For example, here's the datapackage.json for World GDP: in github, in data viewer:

{
  "name": "gdp",
  "title": "Country, Regional and World GDP (Gross Domestic Product)",
  "description": "Country, regional and world GDP in current US Dollars ($). Regional means collections of countries e.g. Europe & Central Asia. Data is sourced from the World Bank and turned into a standard normalized CSV.",
  "licenses": [
    {
      "id": "odc-pddl",
      "url": "http://opendatacommons.org/licenses/pddl/"
    }
  ],
  "keywords": [ "GDP", "World", "Gross Domestic Product", "Time series" ],
  "version": "2011",
  "last_updated": "2011-09-21",
  "image": "http://assets.okfn.org/p/opendatahandbook/img/data-wrench.png",
  "sources": [{
    "name": "World Bank and OECD",
    "web": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
  }],
  "resources": [
    {
      "path": "data/gdp.csv",
      "schema": {
        "fields": [
          {
            "id": "Country Name",
            "type": "string"
          },
          {
            "id": "Country Code",
            "type": "string",
            "foreignkey": "iso-3-geo-codes/id"
          },
          {
            "id": "Year",
            "type": "date",
            "format": "yyyy"
          },
          {
            "id": "Value",
            "description": "GDP in current USD", 
            "type": "number"
          }
        ]
      }
    }
  ]
}

Primary and Foreign Keys

The JSON Table Schema importantly incluedes:

  • primaryKey to uniquely identify rows and able to use multiple fields
  • foreignKeys to identify referencing data

Tools

Clone this wiki locally