Support multiple data sets for each component #180
AJamesPhillips
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
TODO:
g
andpi
. How might this be leveraged?After #178 we could implement a place for people to deposit and later improve, discuss, use (calculations, graphics, simulations)
primarily temporal and spatialdata relevant to situations of public interest.A lot of the data we may be interested in has a spatial and or temporal component to it though some data like constants do not have either.
We also need a specific spatial data set type that can be used by other data sets.
Potential schema and use
space
tableIs similar to a component in that it has a title & description but it is always public so that it is accessible across all bases as it is used as a "type" for the
data
table.Or could add a "space" type to wcomponent. Would then need to:
I imagine this will be used similar to the
data_set
entries in that there might be two entries for area A, one from the point of view (POV) of country B and second from the POV of country C. They would have the similar titles of "Area A (POV Country B)", and a second with "Area A (POV Country C)". When adding POV functionality, will likely want to add a many-to-many table between wcomponent and space to match wcomponent political actors to the specificspace
table entries that form their POV.space_version
tablegeometry(LinestringZM,4326)
space_parent_child
tableWhen a user adds a "parent" space, this table allows the user to capture the children spaces that the parent space encompasses. We have to capture the space_version.version of both parent and child otherwise a child of a new version might have changed borders with lie outside its parent.
data_set
tablesource_data_set_id
,derived_data_set_id
to allow for data sets derived from other data sets to be updated when the source data sets get updated with a newer version.data_set_version
tabledata_set_group
table?Perhaps we also need a data_set_group to accommodate a large data provider like a government agency that can provide data_sets on many hundreds of different components simultaneously and who does not want to duplicate a similar description for each data_set entry, instead wants to see: "This DataSet is provided by abc organisation. See data_set_group xyz for explanation of strengths and weakness." Perhaps can use the wcomponent for this? And add a many-to-many component_id to data_set, to allow data_sets to know what they relate to and allow the parent component (that's used as the label) to know what it has labeled / what it "contains". (to be developed further, in context of existing wcomponent.label_ids)
data
A table for normalised data values
data_qualifiers
tableThis schema could result in a large amount of duplication of data on new versions and on alternative data sets. An alternative is that there is a "qualifier set" that the
data
table references with aqualifies_set_id
, and the following table becomes 2 tables: one forqualifier set
and a second fordata_qualifier_set_component_label_ids
.data_qualifiers_set
tablepeople in @@space_id1 who are @@component_id1, @@component_id2 and @@component_id3
, where @@space_id1 references a space with title of "Nottinghamshire", @@component_id1 references component with title === "20-30 years old", component_id3.title === "female", component_id2.title === "farmer",data_qualifier_set_component_label_ids
tableAdditional questions to resolve
Some examples (todo)
Vaccination rates over time by city by age group
e.g. the "Weekly COVID-19 Municipality Vaccination Report" from: https://www.mass.gov/info-details/archive-of-covid-19-vaccination-reports
Can make a component for "Covid-19 Vaccination rates over time by city by age group in Massachusetts".
The spatial and temporal components are easy to capture.
How is the age group captured?
data_qualifiers
table.Alternatively:
When someone comes to this data set they would find the top level component with only one possible data set.
The data set would have the author_id of the Massachusetts public health department. Would have a description linking to the source page.
There would be an indicator saying "30 versions".
You could click on the versions and it would present a list with the dates of the
data_set_version.created_at
You could click on an individual version, or on the previous view, you could click on latest version and be taken to the version of information for that dataset for that component.
That version would have "3232 rows" where the
data
table matches thedata_set_version_id
anddata_set_version
.There would be a columns of "space", "time", "value", "value_str"
Under the first plan for capturing age group it would join each row of
data
withdata_qualifiers
containing all the labels that were applied to them. It would get a list of uniquedata_qualifiers.component_label_id
and fetch those component_ids to show then as labels next to each of the data rows.Under the second plan, the "age group 0-19", "age group 20-25" etc would just be shown in the value_str column.
Later when someone comes to use the data, i.e. join it by age group e.g. they might have education level, or time spent on social media by age group for cities in Massachusetts for 2021... then the user would want to ensure that join was unique... otherwise we don't solve the problem of joining data.
If someone also wants to get the vaccination rate over time for all 20-25 year olds then this is also brittle if it's just a string in the value_str as it relies on the upload always keeping the same format of the data, e.g. not using "Age Group 20-25" then "age_group20 to 25". Instead with the
data_qualifiers
joined ontodata
, a WHERE clause for the matching id can be used.If someone wants to aggregate by the age group then again, the
data_qualifiers
can be used, though unless they are all also tagged with the parent component of "age group" then the recursive sql call would be needed or a subselect on wcomponent.label_ids where they contain the parent "age group" wcomponent.id.Population data for a country
Explore how the following data examples would be stored in each of the different tables, including how they would be disputed, and versioned:
Simpson's paradox data
https://web.archive.org/web/20180201210227/http://vudlab.com/simpsons/
Renewable energy, wind and sun
https://github.com/TheWorldSim/world-sim-data/tree/master/data/solarpv_capacity_summary/data
https://github.com/TheWorldSim/world-sim-data/tree/master/data/wind_turbine_capacity_summary/data
https://github.com/TheWorldSim/world-sim-data/blob/master/data/solarpv_capacity_summary/data/_2017_texas_loss10percent_tracking1_tilt35_azim180_month_average%40core%400.0.10.csv
Estimating cost of undersea power cable & connections
Ideally allow for an aggregate of undersea cable costs
Data growth: two extremes. At one end the user would create a single component for the name of the undersea cable, and either add an attribute like
"cost in 2015 USD": 8 million
. At the other extreme, they would upload a whole table of undersea cables, with their names, locations, paths, costs, construction start and end times, etc etc.Estimate hydro energy storage of region X
A couple of examples (need to update these now)
You might want to store "Population of Chelsea, MA, USA", with a description of: "1 year resolution, beginning in 1950"
And someone else like the US census bureau might want to store: "Population of USA by municipality in 2021".
--
There would be two top level components:
--
For each USA municipality there would be one entry in the spatial_data_set table and a corresponding one in the
spatial_data_set_version
table with version == 1.--
There would be a
data_set
entry with:component_id == id of component "Population of Chelsea, MA, USA"
description == "Population over time, 1 year resolution, beginning in 1950"
valid_at == null
(temporal_resolution_s) == 31557600
space == (Chelsea, MA, USA).id
There would be another
data_set
entry with:component_id == id of component "Population of USA by municipality in 2021"
description == "Population of USA by municipality in 2021"
valid_at == 2021-01-01
space == null
Q) what happens when next year, they want to add the next round of data?
A) This is where the
data_set_group
comes in. You'd probably make a component for "Population of USA by municipality (over time)" and then use that as a label on all the data_sets where it applied. (to be developed further)--
There would also be two corresponding
data_set_version
entries for both of thedata_set
entries above.--
Then there would be entries in
data
, for the "Population of Chelsea, MA, USA" (over time), each would be like:data_set_version_id == (value from data_set_version entry).id
data_set_version == (value from data_set_version entry).version
valid_at == some date
(valid_from) == null
(valid_to) == null
space == null // takes value from joining to data_set, alternatively could denormalise but then how to keep up to date?
value == 1234
value_str == null
--
The
data
entries for "Population of USA by municipality in 2021" would be like:data_set_version_id == (value from above)
data_set_version == (value from above)
valid_at == null // again could denormalise here or get through a join
valid_from == null
valid_to == null
space == the municipality id
value == 1234
value_str == null
Beta Was this translation helpful? Give feedback.
All reactions