-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
36 add garden size #41
Conversation
- add generic geodata loaders for geojson on s3 - add specific getters to load LAD polygons - update config with ONS LAD bounds file and CRS - update requirements with geopandas
- add functions to get and tranform building footprint gdf and MS datalinks df
- add functions to match building footprint with land parcel and generate garden size
- add base geo getters to read geo files - add specific getters to load MS building footprints and INSPIRE land parcels - add MS and INSPIRE sources to config - add MS building footprint file schema
- add functions to get bounding polygon/bbox of INSPIRE files - add functions to transform INSPIRE gdfs
- build run script to generate estimated garden size for each property in EPC
- add epc_path arg to script
…ng_footprint.py, garden_size.py
…ounds function to geo_utils.py
… to estimate garden size per EPC property
…ilding_footprint.py
- correct filename of Microsoft source in README and base config - correct docstring in garden_size.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To review: click three dots in right hand corner of file and select 'view file'
- remove unused params from gpd.read_file() with pyogrio engine in get_datasets.py
…in convert_bounds_to_points()
…ner' instead of 'left' join - update spatial join in calculate_garden_size.py to join EPC with gardens instead of land_parcels
asf_heat_pump_suitability/pipeline/run_scripts/calculate_garden_size.py
Outdated
Show resolved
Hide resolved
asf_heat_pump_suitability/pipeline/prepare_features/garden_size.py
Outdated
Show resolved
Hide resolved
gdf["building_intersection_area_m2"] | ||
>= gdf["min_size_of_large_building"] | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% confident in this logic as I find it quite confusing to make sense of.
To be included as a valid intersection between a building polygon and land parcel, the following conditions must be met, either:
The candidate intersection is less than or equal to the maximum out building size and greater than or equal to 45% of the original building size.
or,
The candidate intersection is larger than the maximum size of an outbuilding and greater than or equal to 5% of the original building size.
I think this makes sense, but I think it'd be worthwhile revisiting once you've adjusted the building footprint coordinate conversions to make sure it's still performing as you expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an impressive PR with a lot of work in it. My recommendations are as follows:
- Replace any ONS/OS geojsons (e.g. local authority areas) with geopackages (or another format of your choice). This will ensure you get authoritative data in British National Grid format. The geojson standard requires data to be in WGS84, hence why they're best avoided for this use case.
- Update code to use
convertbng
instead ofpyproj
for converting building polygons to BNG to ensure greater accuracy. - Update code for faster loading of building footprints as per mine and @sqr00t recommendations (gdal drivers, virtual files systems, arrow and openmp).
- Unfortunately this won't apply to loading land registry gml files, so they can stay as is.
- Revise the
generate_gdf_map_file_to_bounds()
function as suggested to better fill nas.- the working approach I've outlined uses a sample of plot centroids to find the modal boundary intersected. This should be robust to edge cases unlike earlier attempts.
- Fix bug in
garden_size.match_dict_files_land_building()
where casting to a dictionary loses information. - Having made changes, check logic for building-plot intersections still makes sense.
There are a few additional suggests in the comments that may also be worth your attention, but aren't super important.
Happy to re-review a revised version of this PR.
- remove conversion to dict which reduces file matches and keep as series - use sjoin instead of overlay to match files for increased efficiency - update calculate_garden_size.py run script to use updated file-matching function
…file instead of geojson so that polygons are already in BNG. update source of LAD bounds in config file to use shapefile. increase loading speed of microsoft building footprint files by updating file reading method to read files directly in geopandas.
…ng package - new function to transform coordinates added to building_footprint.py - convert footprints from WGS84 to BNG with convertbng instead of pyproj for increased accuracy with OSTN15 - add convertbng to requirements.txt
…se file loading speeds in geospatial pipeline
…bounding box - additionally update quadkey conversion to polygon with pyproj to use shapely.box() and remove excess lines - clean geo_utils to use shapely.box() for getting polygons from bounds
… nulls with full council boundary instead of with file boundary
- update help string in argparser in calculate_garden_size.py and remove TODO note
…ep_geom_type to False This is updated due to 1 geometry being dropped in N3 sample when keep_geom_type=True
# TODO: conversion back to polygons is rate-limiting step | ||
s = coords.groupby(coords.index).apply(lambda l: shapely.Polygon(zip(l.x, l.y))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any ideas on speeding this up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import numpy
and replace lines 143 and 144 with:
splits = numpy.split(coords.to_numpy(), coords.index.value_counts().sort_index().to_numpy().cumsum())[:-1]
return gpd.GeoSeries([shapely.Polygon(pnts) for pnts in splits]).set_crs(epsg=27700)
Basically, the optimisation I hit on was splitting the x,y coords by their index, rather than using the expensive groupby.
To get the split indices: coords.index
is the index from get_coordinates()
, .value_counts()
is the number of vertices per building shape, .sort_index()
returns it to index order (rather than ascending on count) .to_numpy()
gets the numpy array representation and .cumsum()
produces the cumulative sum of the geometry lengths which are the points at which to split the list of coords to recreate the individual building polygons. Finally, you need to index to -1
as otherwise you get a spare empty array on the end.
…ecorator to land extent file loader
… land_extent.py and calculate_garden_size.py
( | ||
polygons | ||
if (polygons := convertbng_quadkey_to_polygon(qk)) | ||
else convertpyproj_quadkey_to_polygon(qk, transformer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this use of the walrus operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heheh i knew you would :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I've tested and it's working as expected.
Have suggested a simple speed up for the coordinate conversion.
Happy for you to push the speed up and a couple of cosmetic things we discussed (wrapping warnings and retry for loading from s3). Then for you to merge.
This one's done!
… numpy conversion to increase speed of returning coordinates into polygons
Fixes #36 and #40
Description
Estimate garden size / outdoor area for each EPC property where possible using INSPIRE land extent files and Microsoft building footprint files. The script will save a separate file of all EPC properties with available garden size estimates which can then be joined to the full EPC on UPRN.
Steps:
Summary of changes:
Key files with geospatial manipulations:
asf_heat_pump_suitability/pipeline/prepare_features/building_footprint.py
containing functions to prepare Microsoft building footprint geodataframes for spatial joinsasf_heat_pump_suitability/pipeline/prepare_features/land_extent.py
containing functions to prepare INSPIRE land extent geodataframes for spatial joinsasf_heat_pump_suitability/pipeline/prepare_features/garden_size.py
containing functions to find intersection of land polygons and building polygons to calculate outdoor space / garden sizeasf_heat_pump_suitability/pipeline/run_scripts/calculate_garden_size.py
containing run script to estimate garden size per propertyAdditional files:
save_utils.py
andgeo_utils.py
with generic util functionsconfig/README.md
asf_heat_pump_suitability/getters/base_getters.py
andasf_heat_pump_suitability/getters/get_datasets.py
with new geo-getters to load geospatial dataasf_heat_pump_suitability/pipeline/prepare_features/lat_lon.py
to generate point geometries from x, y coordinatesconfig/base.yaml
with new data sourcesschemas.py
with new schemasInstructions for Reviewer
In order to test the code in this PR you need to run:
python -i asf_heat_pump_suitability/pipeline/run_scripts/calculate_garden_size.py --use_mapping s3://asf-heat-pump-suitability/source_data/2023_land_parcels_with_file_polygons_sample_N3.geojson --save_epc_gardens s3://asf-heat-pump-suitability/outputs/2023_Q2_EPC_garden_size_estimates_N3.parquet --epc_path s3://asf-heat-pump-suitability/outputs/2023_Q2_EPC_enhanced_weights_and_features.parquet
This will run the script on a sample of 3 land extent : building footprint file matches, rather than running through all the matches as this takes 16+ hours. You can then check the outputs are as expected in your terminal in interactive python.
Please could you:
Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
s