48 add listed building feature #49

helloaidank · 2024-07-25T17:37:12Z

Description

This is a PR which adds the listed buildings as a feature to the EPC dataset. This involves doing a spatial join as we are dealing with a polygon dataset and this takes a relatively long time.

Fixes #48

Instructions for Reviewer

I've made changes to / added the following files, if you could have a close look at specifically the listed_buildings.py as this is where the spatial join is done (the blocker):

pipeline/run_scripts/add_features.py
pipeline/prepare_features/listed_buildings.py
getters/get_datasets.py
getters/base_getters.py
config/base.yaml
requirements.txt

In order to test the code in this PR you need to run the following:
python pipeline/run_scripts/add_features.py --epc_path "s3://asf-heat-pump-suitability/outputs/2023_Q2_EPC_enhanced_weights_sample_10k.parquet" --save_output "s3://asf-heat-pump-suitability/outputs/2023_Q2_EPC_enhanced_weights_listed_buildings_sample_test_10k.parquet", this will take a few minutes.

[Optional] To test it over the whole EPC dataset (this takes 36 minutes and I ran into memory issues so closed all other applications) do the following:

python pipeline/run_scripts/add_features.py --epc_path "s3://asf-daps/lakehouse/processed/epc/deduplicated/processed_dedupl-0.parquet" --save_output "s3://asf-heat-pump-suitability/outputs/2023_Q2_EPC_enhanced_weights_listed_buildings_full_sample.parquet"

Please pay special attention to any bugs, sense test the functions and any tips on how we could make this more optimised based on the code (potentially chunking is an option, not sure if there is a way of getting around how long the geopandas sjoin operation takes).

Checklist:

lizgzil

Great! 🎉 I gave a suggestion to speed up the joining significantly (chunking)!

lizgzil · 2024-07-26T09:27:22Z

asf_heat_pump_suitability/config/base.yaml

@@ -12,6 +12,7 @@ data_source:
  GB_osopen_uprn_latlon: "s3://asf-heat-pump-suitability/source_data/osopenuprn_202405_csv.zip"
  EW_census_accommodation_type: "s3://asf-heat-pump-suitability/source_data/2021census_Mar2023_accommodation_type_E_W.csv"
  UK_spa_offgasgrid: "s3://asf-heat-pump-suitability/source_data/2024_vMar2024_SPA_offgaspostcode_UK.xlsx"
+  E_historicengland_listed_buildings: "s3://asf-heat-pump-suitability/source_data/Jun2024_vJul2024_HistoricEngland_listedbuilding_E.geojson"


again, do add details about this to the config/README.md

I've added them into the config/README.md in the off-gas PR, so that should do it 👯

lizgzil · 2024-07-26T09:43:10Z

asf_heat_pump_suitability/pipeline/run_scripts/add_features.py

    # Add feature: lat/long
    logging.info("Adding lat/lon data to EPC")
    uprn_latlon_df = lat_lon.transform_df_osopen_uprn_latlon()
    epc_latlon_df = epc_df.select(["UPRN"])

    # Join enhanced datasets together
    enhanced_epc_df = enhanced_epc_df.join(uprn_latlon_df, how="left", on="UPRN")
+    enhanced_epc_df = epc_df.join(uprn_latlon_df, how="left", on="UPRN")


Is this line meant to be here? The only difference I can see is that enhanced_epc_df will no longer contain the msoa_avg_outdoor_space_m2 column?

Yeah, the line is redundant, I think I may have been commenting in and out lines during testing and this is an artefact. Good spot!

lizgzil · 2024-07-26T11:22:19Z

asf_heat_pump_suitability/pipeline/prepare_features/listed_buildings.py

+def spatial_join_epc_with_listed_buildings(
+    enhanced_epc_df: pl.DataFrame, listed_buildings_df: gpd.GeoDataFrame
+) -> gpd.GeoDataFrame:


You can stop the memory issues and speed things up by joining the EPC data to the listed building dataset in chunks. I modified this function to do this and it took 3 minutes to process the 16 million properties in s3://asf-heat-pump-suitability/outputs/2023_Q2_EPC_enhanced_weights.parquet. A few tweaks were needed to what you had before (e.g. the inner join on gpd.sjoin):

from tqdm import tqdm import logging def spatial_join_epc_with_listed_buildings( enhanced_epc_df: pl.DataFrame, listed_buildings_df: gpd.GeoDataFrame ) -> pl.DataFrame: """ Spatial join EPC dataset with listed buildings dataset. Args: enhanced_epc_df (pl.DataFrame): Enhanced EPC dataset listed_buildings_df (gpd.GeoDataFrame): Filtered Historic England dataset with only listed buildings grade and geometry. Returns: pl.DataFrame: EPC dataset with listed buildings grade without geometry. """ # Add a unique index column for merging later enhanced_epc_df = enhanced_epc_df.with_row_index("index") chunk_size = 100000 partitions = enhanced_epc_df.with_row_count("chunk_id").select(pl.col("chunk_id") // chunk_size) # group based on the created index, resulting chunk_size partitons data_partitioned = enhanced_epc_df.with_columns(partitions).partition_by("chunk_id") logging.info(f"Adding listed buildings to EPC in {len(data_partitioned)} chunks") for i, enhanced_epc_df_chunk in tqdm(enumerate(data_partitioned)): epc_gdf = transform_df_EPC_X_and_Y_to_point(enhanced_epc_df_chunk) epc_gdf_temp = epc_gdf[["geometry", "index"]].copy() joined_gdf_chunk = gpd.sjoin( epc_gdf_temp, listed_buildings_df, how="inner", predicate="intersects" ) # Drop the geometry column joined_gdf_chunk = joined_gdf_chunk.drop(columns=["geometry", "index_right"]) if i==0: joined_gdf = joined_gdf_chunk else: joined_gdf = pd.concat([joined_gdf, joined_gdf_chunk]) enhanced_epc_df = enhanced_epc_df.join(pl.from_pandas(joined_gdf), how="left", on="index") return enhanced_epc_df

(note the output is now a polars df so you probably won't need to apply convert_gpd_to_polars anymore)

Did you know that the name tqdm stands for "taqaddum" in Arabic which can mean "progress"? 🚀

Thanks for this, works a charm!

haha that makes far more sense! 😆 I'm sure I read somewhere once that it meant "thank you demasiado" (i.e. thank you very much (spanish) for helping me see my progress), so have always thought it meant that - even though it makes very little sense!

…ture

changes to add in listed buildings feature

569ace6

helloaidank marked this pull request as ready for review July 25, 2024 17:41

small change to doc string

629aa09

helloaidank changed the base branch from dev to 45_add_off_gas_feature July 26, 2024 08:59

lizgzil requested changes Jul 26, 2024

View reviewed changes

Merge branch '45_add_off_gas_feature' into 48_add_listed_building_fea…

e7a8833

…ture

helloaidank merged commit 67c1a9e into 45_add_off_gas_feature Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

48 add listed building feature #49

48 add listed building feature #49

helloaidank commented Jul 25, 2024

lizgzil left a comment

lizgzil Jul 26, 2024

helloaidank Aug 6, 2024

lizgzil Jul 26, 2024

helloaidank Aug 7, 2024

lizgzil Jul 26, 2024

lizgzil Jul 26, 2024

helloaidank Aug 7, 2024

helloaidank Aug 8, 2024

lizgzil Aug 15, 2024

48 add listed building feature #49

48 add listed building feature #49

Conversation

helloaidank commented Jul 25, 2024

Description

Instructions for Reviewer

Checklist:

lizgzil left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment