Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

36 add garden size #41

Merged
merged 46 commits into from
Aug 15, 2024
Merged

36 add garden size #41

merged 46 commits into from
Aug 15, 2024

Conversation

crispy-wonton
Copy link
Collaborator

@crispy-wonton crispy-wonton commented Jul 17, 2024

Fixes #36 and #40

Description

Estimate garden size / outdoor area for each EPC property where possible using INSPIRE land extent files and Microsoft building footprint files. The script will save a separate file of all EPC properties with available garden size estimates which can then be joined to the full EPC on UPRN.

Steps:

  1. Match building footprint and land extent files
  2. Find intersections of building footprint and land extent polygons
  3. Calculate garden size (m2)
  4. Garden size to EPC data

Summary of changes:

Key files with geospatial manipulations:

  • Add asf_heat_pump_suitability/pipeline/prepare_features/building_footprint.py containing functions to prepare Microsoft building footprint geodataframes for spatial joins
  • Add asf_heat_pump_suitability/pipeline/prepare_features/land_extent.py containing functions to prepare INSPIRE land extent geodataframes for spatial joins
  • Add asf_heat_pump_suitability/pipeline/prepare_features/garden_size.py containing functions to find intersection of land polygons and building polygons to calculate outdoor space / garden size
  • Add asf_heat_pump_suitability/pipeline/run_scripts/calculate_garden_size.py containing run script to estimate garden size per property

Additional files:

  • Add save_utils.py and geo_utils.py with generic util functions
  • Add data sources to config/README.md
  • Update asf_heat_pump_suitability/getters/base_getters.py and asf_heat_pump_suitability/getters/get_datasets.py with new geo-getters to load geospatial data
  • Add new function to asf_heat_pump_suitability/pipeline/prepare_features/lat_lon.py to generate point geometries from x, y coordinates
  • Update config/base.yaml with new data sources
  • Update schemas.py with new schemas

Instructions for Reviewer

In order to test the code in this PR you need to run:
python -i asf_heat_pump_suitability/pipeline/run_scripts/calculate_garden_size.py --use_mapping s3://asf-heat-pump-suitability/source_data/2023_land_parcels_with_file_polygons_sample_N3.geojson --save_epc_gardens s3://asf-heat-pump-suitability/outputs/2023_Q2_EPC_garden_size_estimates_N3.parquet --epc_path s3://asf-heat-pump-suitability/outputs/2023_Q2_EPC_enhanced_weights_and_features.parquet

This will run the script on a sample of 3 land extent : building footprint file matches, rather than running through all the matches as this takes 16+ hours. You can then check the outputs are as expected in your terminal in interactive python.

Please could you:

  • pay special attention to all geospatial joins and any coordinate conversions to ensure they are achieving the expected result and done correctly. The coordinate conversions and spatial joins were the things I found most difficult.
  • let me know if there are better / more efficient and, especially, less error-prone ways to do any of the geospatial (or other) manipulations. I think there might be some built-in functionalities in geopandas that i could use instead for some of the conversions but I couldnt find them - especially for converting bounds to a polygon?
  • feel free to respond to comments marked #TODO if you have ideas/answers :)
  • check docstrings are clear and correct, especially with respect to coordinate reference systems
  • check for general mistakes / bugs

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review

- add generic geodata loaders for geojson on s3
- add specific getters to load LAD polygons
- update config with ONS LAD bounds file and CRS
- update requirements with geopandas
- add functions to get and tranform building footprint gdf and MS datalinks df
- add functions to match building footprint with land parcel and generate garden size
- add base geo getters to read geo files
- add specific getters to load MS building footprints and INSPIRE land parcels
- add MS and INSPIRE sources to config
- add MS building footprint file schema
- add functions to get bounding polygon/bbox of INSPIRE files
- add functions to transform INSPIRE gdfs
- build run script to generate estimated garden size for each property in EPC
- correct filename of Microsoft source in README and base config
- correct docstring in garden_size.py
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To review: click three dots in right hand corner of file and select 'view file'

- remove unused params from gpd.read_file() with pyogrio engine in get_datasets.py
…ner' instead of 'left' join

- update spatial join in calculate_garden_size.py to join EPC with gardens instead of land_parcels
gdf["building_intersection_area_m2"]
>= gdf["min_size_of_large_building"]
)
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% confident in this logic as I find it quite confusing to make sense of.

To be included as a valid intersection between a building polygon and land parcel, the following conditions must be met, either:

The candidate intersection is less than or equal to the maximum out building size and greater than or equal to 45% of the original building size.

or,

The candidate intersection is larger than the maximum size of an outbuilding and greater than or equal to 5% of the original building size.

I think this makes sense, but I think it'd be worthwhile revisiting once you've adjusted the building footprint coordinate conversions to make sure it's still performing as you expect.

Copy link

@danlewis85 danlewis85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an impressive PR with a lot of work in it. My recommendations are as follows:

  1. Replace any ONS/OS geojsons (e.g. local authority areas) with geopackages (or another format of your choice). This will ensure you get authoritative data in British National Grid format. The geojson standard requires data to be in WGS84, hence why they're best avoided for this use case.
  2. Update code to use convertbng instead of pyproj for converting building polygons to BNG to ensure greater accuracy.
  3. Update code for faster loading of building footprints as per mine and @sqr00t recommendations (gdal drivers, virtual files systems, arrow and openmp).
    • Unfortunately this won't apply to loading land registry gml files, so they can stay as is.
  4. Revise the generate_gdf_map_file_to_bounds() function as suggested to better fill nas.
    • the working approach I've outlined uses a sample of plot centroids to find the modal boundary intersected. This should be robust to edge cases unlike earlier attempts.
  5. Fix bug in garden_size.match_dict_files_land_building() where casting to a dictionary loses information.
  6. Having made changes, check logic for building-plot intersections still makes sense.

There are a few additional suggests in the comments that may also be worth your attention, but aren't super important.

Happy to re-review a revised version of this PR.

- remove conversion to dict which reduces file matches and keep as series
- use sjoin instead of overlay to match files for increased efficiency
- update calculate_garden_size.py run script to use updated file-matching function
…file

instead of geojson so that polygons are already in BNG.
update source of LAD bounds in config file to use shapefile.
increase loading speed of microsoft building footprint files by updating file reading method
to read files directly in geopandas.
…ng package

- new function to transform coordinates added to building_footprint.py
- convert footprints from WGS84 to BNG with convertbng instead of pyproj for increased accuracy with OSTN15
- add convertbng to requirements.txt
…se file loading speeds in geospatial pipeline
…bounding box

- additionally update quadkey conversion to polygon with pyproj to use shapely.box() and remove excess lines
- clean geo_utils to use shapely.box() for getting polygons from bounds
… nulls with full council boundary instead of with file boundary
- update help string in argparser in calculate_garden_size.py and remove TODO note
…ep_geom_type to False

This is updated due to 1 geometry being dropped in N3 sample when keep_geom_type=True
Comment on lines 142 to 143
# TODO: conversion back to polygons is rate-limiting step
s = coords.groupby(coords.index).apply(lambda l: shapely.Polygon(zip(l.x, l.y)))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any ideas on speeding this up?

Copy link

@danlewis85 danlewis85 Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import numpy and replace lines 143 and 144 with:

splits = numpy.split(coords.to_numpy(), coords.index.value_counts().sort_index().to_numpy().cumsum())[:-1]
return gpd.GeoSeries([shapely.Polygon(pnts) for pnts in splits]).set_crs(epsg=27700)

Basically, the optimisation I hit on was splitting the x,y coords by their index, rather than using the expensive groupby.

To get the split indices: coords.index is the index from get_coordinates(), .value_counts() is the number of vertices per building shape, .sort_index() returns it to index order (rather than ascending on count) .to_numpy() gets the numpy array representation and .cumsum() produces the cumulative sum of the geometry lengths which are the points at which to split the list of coords to recreate the individual building polygons. Finally, you need to index to -1 as otherwise you get a spare empty array on the end.

(
polygons
if (polygons := convertbng_quadkey_to_polygon(qk))
else convertpyproj_quadkey_to_polygon(qk, transformer)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this use of the walrus operator.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heheh i knew you would :D

danlewis85
danlewis85 previously approved these changes Aug 14, 2024
Copy link

@danlewis85 danlewis85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I've tested and it's working as expected.

Have suggested a simple speed up for the coordinate conversion.

Happy for you to push the speed up and a couple of cosmetic things we discussed (wrapping warnings and retry for loading from s3). Then for you to merge.

This one's done!

… numpy conversion to increase speed of returning coordinates into polygons
@crispy-wonton crispy-wonton merged commit 14e82c8 into dev Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Get file boundaries for all land parcel files
3 participants