Using Data Packages

Motivation

Previously all "layers" used in the OHI toolbox had the following maximum set of fields (as determined by the layers.csv fed into CheckLayers():

id spatial identifier:
- id_num if numeric
- id_chr if character
category category, as character
year year, as integer
value value:
- val_num if numeric
- val_chr if character

The layers.*.csv at ohicore/inst/extdata, populated by data_create.R*, generally generate these data by slurping a Google spreadsheet (eg layers_navigation_2012a_2013a in csv output format).

Limiting a layer to this set of fields then requires many seperate files for data which is best maintained and viewed together, such as region: id, name, area_km2, etc. More recently this example as an LE input layer le_jobs_status_model_curref begs for a single file since all field values are related to one another:

metric	cntry_key	sector	cur_year	ref_year	cur_base_value	ref_base_value	cur_adj_value	ref_adj_value
jobs	ABW	cf	2008	2003	8.621900e+02	786.886800	91.38355	89.31041
jobs	ABW	mar	2008	2003	8.615858e-01	7.929788	91.38355	89.31041
jobs	ABW	tour	2010	2005	3.640000e+04	29700.000000	92.21258	90.13986
jobs	AGO	cf	2002	1995	4.150852e+04	48035.848000	90.19738	90.26034
jobs	AGO	tour	2010	2005	6.678587e+03	7472.289297	90.12543	90.17040
jobs	AIA	cf	2008	2003	2.425206e+02	498.330000	91.38355	89.31041
jobs	AIA	tour	2010	2005	3.900000e+03	4000.000000	92.21258	90.13986
jobs	ALB	cf	2008	2003	5.162066e+03	3986.640000	87.25179	86.80258
jobs	ALB	mar	2008	2003	6.820084e+02	168.035411	87.25179	86.80258
jobs	ALB	tour	2010	2005	1.045295e+05	74564.996669	87.12611	86.79412

There are several approaches to layerizing this data one may take.

Since we're only allowed id, category (year is an annotation like value not only applicable to the given year) so even after splitting the data into per-metric layer files (jobs, wage, rev), they would have to be based on cntry_key (=id), sector (=category) and a new file per columns to the right.
We could consider adding a subcategory generic field, so that we could melt the data.

metric	cntry_key	sector	variable	value
jobs	ABW	cf	cur_year	2008.0000000
jobs	ABW	cf	ref_year	2003.0000000
jobs	ABW	cf	cur_base_value	862.1900000
jobs	ABW	cf	ref_base_value	786.8868000
jobs	ABW	cf	cur_adj_value	91.3835505
jobs	ABW	cf	ref_adj_value	89.3104096
jobs	ABW	mar	cur_year	2008.0000000
jobs	ABW	mar	ref_year	2003.0000000
jobs	ABW	mar	cur_base_value	0.8615858
jobs	ABW	mar	ref_base_value	7.9297879

into Even after filtering new layers per metric (jobs, rev, wage), we still have and variable (subcategory?) if then melting the data. The *_year fields clearly have a different range of values (and integer data type vs floating) over the other *_value fields so simply melting to a metric, cntry_key, sector, variable, value format wouldn't work.

Entrez Data Packages

simple tabular data packages format:
- data.csv csv format
- datapackage.json metadata like name of package and data types of fields

For example, here's the datapackage.json for World GDP: in github, in data viewer:

{
  "name": "gdp",
  "title": "Country, Regional and World GDP (Gross Domestic Product)",
  "description": "Country, regional and world GDP in current US Dollars ($). Regional means collections of countries e.g. Europe & Central Asia. Data is sourced from the World Bank and turned into a standard normalized CSV.",
  "licenses": [
    {
      "id": "odc-pddl",
      "url": "http://opendatacommons.org/licenses/pddl/"
    }
  ],
  "keywords": [ "GDP", "World", "Gross Domestic Product", "Time series" ],
  "version": "2011",
  "last_updated": "2011-09-21",
  "image": "http://assets.okfn.org/p/opendatahandbook/img/data-wrench.png",
  "sources": [{
    "name": "World Bank and OECD",
    "web": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
  }],
  "resources": [
    {
      "path": "data/gdp.csv",
      "schema": {
        "fields": [
          {
            "id": "Country Name",
            "type": "string"
          },
          {
            "id": "Country Code",
            "type": "string",
            "foreignkey": "iso-3-geo-codes/id"
          },
          {
            "id": "Year",
            "type": "date",
            "format": "yyyy"
          },
          {
            "id": "Value",
            "description": "GDP in current USD", 
            "type": "number"
          }
        ]
      }
    }
  ]
}

Primary and Foreign Keys

The JSON Table Schema importantly incluedes:

primaryKey to uniquely identify rows and able to use multiple fields
foreignKeys to identify referencing data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Data Packages

Motivation

Entrez Data Packages

Primary and Foreign Keys

Tools

Clone this wiki locally