-
Notifications
You must be signed in to change notification settings - Fork 31
Design
the Community
object is the central data structure of the library. It needs to:
- provide a relatively efficient space-time data structure
- be relatively transparent, meaning that it's easy decompose to pull out the consituent parts (like the geodataframe of neighborhood boundaries for each time period)
- a community is a concept
- it is defined by the particular boundaries that make up the Community any given time
This would probably be ideal for the new dataclass but that would mean we target only >=3.7 (there's a 3.6 backport)
For a given Community we need:
- dataframe(s) of neighborhood-level attributes for >=1 time period
- geodataframe(s) of neighborhood boundary(s) for >=1 time period
- For harmonized data, we need only a single set of neighborhood boundaries
- For unharmonized data we need boundaries for each time period
- we may also need new, original boundaries so we need a system to register new ones and keep track of which is "active" using a property
- another property that stores the method of geometry harmonization
- e.g. a gdf with index as rows and different geoms as cols
- maybe store the full intersection gdf somewhere if it gets calculated because its expensive? - this is a can of worms. not doing this yet.
- we need to support both, conversion between the two, and a attribute that knows which condition is true
- (is this all that's absolutely essential?)
- if so, maybe the solution is just a clever multi index?
- right now we also store state and county boundaries because why not? we use them for plotting
- but if that's the only reason we could add a utility inside the
plot()
function that gets extra geoms as needed
- but if that's the only reason we could add a utility inside the
- do we want to provide a standard location for known aux data we might want to pass to harmonize functions (maybe not attached at init, but can be added with method). Dont want to bloat the object...
- osm
- nlcd clip
- no... this is overloading the class. we can attach this stuff during the harmonize process
- do we repeat geometries in the harmonized case?
- drawback: storage/mem inefficient
- bonus: single API for both harmonized and unarhomized case
- bonus: maintains 1:1 relationship between attributes and geoms (less chance for error on join)
- bonus: geometry/geospatial queries are much easier
- bonus: easy to slice/decompose
geosnap.data.import_ltdb(full_path, sample_path])
- this method reads ltdb data and stores in a local quilt database
geosnap.data.geolytics_to_quilt(full_path, sample_path])
function that needs to be called first if the data is not present in local quilt db (hopefully this will remove some of the current ltdb confusion)`
If we instantiate a Community
from certain datasets, the schema is implied (e.g. similar to the way we use source='ltdb'
now, but we can simplify and abstract the Community
signture while using a @classmethod
for encapsulating the ltdb-specific logic)
-
from_ltdb(filter)
- instantiate from ltdb prompts for a separate ` -
from_geolytics(filter)
- same as above -
from_census(years, filter)
- assumes unharmonized. user specifies time periods, data come from quilt -
from_lehd(dataset, years, filter)
- harmonised or unharmonized depending on selected time periods -
from_gdfs(gdfs, harmonized=False)
- a dict of geodataframes (unharmonized) or a gdf and dict of{time_period: df}
harmonized? -
from_parquet(path)
- a previously saved Community?
sandiego = GeoDataFrame() sandiego_community = Community()
sandiego_community.geodataframe
-
harmonized
@property (bool) -
harmonization
@property (str) defaults to None that (harmonize and harmonizeD are connected)
sandiego.geodataframe[1990]
-
census
- will be calledgeodataframe
- currently this is a single long-form attribute table
- should be maybe a multiindex or dict like
{time_periods: DataFrame}
? - do all possible community data come from a census of some kind?
-
tracts, counties, states
currently hardcoded, should be abstracted and may need one for each time period- states and counties go, and only source/target geoms get stored
-
function/method to attch MSA name/ID to Community.geodataframe
plot()
-
plot_interactive()
? -
to_parquet(path)
store for later use -
to_crs()
convenience for reprojecting all geoms at once