-
Notifications
You must be signed in to change notification settings - Fork 9
Using Data Packages
Previously all "layers" used in the OHI toolbox had the following maximum set of fields (as determined by the layers.csv
fed into CheckLayers():
-
id
spatial identifier:-
id_num
if numeric -
id_chr
if character
-
-
category
category, as character -
year
year, as integer -
value
value:-
val_num
if numeric -
val_chr
if character
-
The layers.*.csv at ohicore/inst/extdata, populated by data_create.R*, generally generate these data by slurping a Google spreadsheet (eg layers_navigation_2012a_2013a in csv output format).
Limiting a layer to this set of fields then requires many seperate files for data which is best maintained and viewed together, such as region: id, name, area_km2, etc. More recently this example as an LE input layer le_jobs_status_model_curref
begs for a single file since all field values are related to one another:
metric | cntry_key | sector | cur_year | ref_year | cur_base_value | ref_base_value | cur_adj_value | ref_adj_value |
---|---|---|---|---|---|---|---|---|
jobs | ABW | cf | 2008 | 2003 | 8.621900e+02 | 786.886800 | 91.38355 | 89.31041 |
jobs | ABW | mar | 2008 | 2003 | 8.615858e-01 | 7.929788 | 91.38355 | 89.31041 |
jobs | ABW | tour | 2010 | 2005 | 3.640000e+04 | 29700.000000 | 92.21258 | 90.13986 |
jobs | AGO | cf | 2002 | 1995 | 4.150852e+04 | 48035.848000 | 90.19738 | 90.26034 |
jobs | AGO | tour | 2010 | 2005 | 6.678587e+03 | 7472.289297 | 90.12543 | 90.17040 |
jobs | AIA | cf | 2008 | 2003 | 2.425206e+02 | 498.330000 | 91.38355 | 89.31041 |
jobs | AIA | tour | 2010 | 2005 | 3.900000e+03 | 4000.000000 | 92.21258 | 90.13986 |
jobs | ALB | cf | 2008 | 2003 | 5.162066e+03 | 3986.640000 | 87.25179 | 86.80258 |
jobs | ALB | mar | 2008 | 2003 | 6.820084e+02 | 168.035411 | 87.25179 | 86.80258 |
jobs | ALB | tour | 2010 | 2005 | 1.045295e+05 | 74564.996669 | 87.12611 | 86.79412 |
There are several approaches to layerizing this data one may take.
-
Since we're only allowed
id
,category
(year is an annotation like value not only applicable to the given year) so even after splitting the data into per-metric layer files (jobs, wage, rev), they would have to be based oncntry_key
(=id
),sector
(=category
) and a new file per columns to the right. -
We could consider adding a
subcategory
generic field, so that we could melt the data.
metric | cntry_key | sector | variable | value |
---|---|---|---|---|
jobs | ABW | cf | cur_year | 2008.0000000 |
jobs | ABW | cf | ref_year | 2003.0000000 |
jobs | ABW | cf | cur_base_value | 862.1900000 |
jobs | ABW | cf | ref_base_value | 786.8868000 |
jobs | ABW | cf | cur_adj_value | 91.3835505 |
jobs | ABW | cf | ref_adj_value | 89.3104096 |
jobs | ABW | mar | cur_year | 2008.0000000 |
jobs | ABW | mar | ref_year | 2003.0000000 |
jobs | ABW | mar | cur_base_value | 0.8615858 |
jobs | ABW | mar | ref_base_value | 7.9297879 |
into Even after filtering new layers per metric (jobs, rev, wage), we still have and variable (subcategory
?) if then melting the data. The *_year
fields clearly have a different range of values (and integer data type vs floating) over the other *_value
fields so simply melting to a metric, cntry_key, sector, variable, value format wouldn't work.
- simple tabular data packages format:
-
data.csv
csv format -
datapackage.json
metadata like name of package and data types of fields
-
For example, here's the datapackage.json
for World GDP: in github, in data viewer:
{
"name": "gdp",
"title": "Country, Regional and World GDP (Gross Domestic Product)",
"description": "Country, regional and world GDP in current US Dollars ($). Regional means collections of countries e.g. Europe & Central Asia. Data is sourced from the World Bank and turned into a standard normalized CSV.",
"licenses": [
{
"id": "odc-pddl",
"url": "http://opendatacommons.org/licenses/pddl/"
}
],
"keywords": [ "GDP", "World", "Gross Domestic Product", "Time series" ],
"version": "2011",
"last_updated": "2011-09-21",
"image": "http://assets.okfn.org/p/opendatahandbook/img/data-wrench.png",
"sources": [{
"name": "World Bank and OECD",
"web": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
}],
"resources": [
{
"path": "data/gdp.csv",
"schema": {
"fields": [
{
"id": "Country Name",
"type": "string"
},
{
"id": "Country Code",
"type": "string",
"foreignkey": "iso-3-geo-codes/id"
},
{
"id": "Year",
"type": "date",
"format": "yyyy"
},
{
"id": "Value",
"description": "GDP in current USD",
"type": "number"
}
]
}
}
]
}
The JSON Table Schema importantly incluedes:
-
primaryKey
to uniquely identify rows and able to use multiple fields -
foreignKeys
to identify referencing data