Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

48 add listed building feature #49

Merged
merged 3 commits into from
Aug 8, 2024

Conversation

helloaidank
Copy link
Collaborator


Description

This is a PR which adds the listed buildings as a feature to the EPC dataset. This involves doing a spatial join as we are dealing with a polygon dataset and this takes a relatively long time.

Fixes #48

Instructions for Reviewer

I've made changes to / added the following files, if you could have a close look at specifically the listed_buildings.py as this is where the spatial join is done (the blocker):

  • pipeline/run_scripts/add_features.py
  • pipeline/prepare_features/listed_buildings.py
  • getters/get_datasets.py
  • getters/base_getters.py
  • config/base.yaml
  • requirements.txt

In order to test the code in this PR you need to run the following:
python pipeline/run_scripts/add_features.py --epc_path "s3://asf-heat-pump-suitability/outputs/2023_Q2_EPC_enhanced_weights_sample_10k.parquet" --save_output "s3://asf-heat-pump-suitability/outputs/2023_Q2_EPC_enhanced_weights_listed_buildings_sample_test_10k.parquet", this will take a few minutes.

[Optional] To test it over the whole EPC dataset (this takes 36 minutes and I ran into memory issues so closed all other applications) do the following:

python pipeline/run_scripts/add_features.py --epc_path "s3://asf-daps/lakehouse/processed/epc/deduplicated/processed_dedupl-0.parquet" --save_output "s3://asf-heat-pump-suitability/outputs/2023_Q2_EPC_enhanced_weights_listed_buildings_full_sample.parquet"

Please pay special attention to any bugs, sense test the functions and any tips on how we could make this more optimised based on the code (potentially chunking is an option, not sure if there is a way of getting around how long the geopandas sjoin operation takes).

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review

@helloaidank helloaidank marked this pull request as ready for review July 25, 2024 17:41
@helloaidank helloaidank changed the base branch from dev to 45_add_off_gas_feature July 26, 2024 08:59
Copy link
Collaborator

@lizgzil lizgzil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! 🎉 I gave a suggestion to speed up the joining significantly (chunking)!

@@ -12,6 +12,7 @@ data_source:
GB_osopen_uprn_latlon: "s3://asf-heat-pump-suitability/source_data/osopenuprn_202405_csv.zip"
EW_census_accommodation_type: "s3://asf-heat-pump-suitability/source_data/2021census_Mar2023_accommodation_type_E_W.csv"
UK_spa_offgasgrid: "s3://asf-heat-pump-suitability/source_data/2024_vMar2024_SPA_offgaspostcode_UK.xlsx"
E_historicengland_listed_buildings: "s3://asf-heat-pump-suitability/source_data/Jun2024_vJul2024_HistoricEngland_listedbuilding_E.geojson"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, do add details about this to the config/README.md

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added them into the config/README.md in the off-gas PR, so that should do it 👯

# Add feature: lat/long
logging.info("Adding lat/lon data to EPC")
uprn_latlon_df = lat_lon.transform_df_osopen_uprn_latlon()
epc_latlon_df = epc_df.select(["UPRN"])

# Join enhanced datasets together
enhanced_epc_df = enhanced_epc_df.join(uprn_latlon_df, how="left", on="UPRN")
enhanced_epc_df = epc_df.join(uprn_latlon_df, how="left", on="UPRN")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line meant to be here? The only difference I can see is that enhanced_epc_df will no longer contain the msoa_avg_outdoor_space_m2 column?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the line is redundant, I think I may have been commenting in and out lines during testing and this is an artefact. Good spot!

Comment on lines +46 to +48
def spatial_join_epc_with_listed_buildings(
enhanced_epc_df: pl.DataFrame, listed_buildings_df: gpd.GeoDataFrame
) -> gpd.GeoDataFrame:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can stop the memory issues and speed things up by joining the EPC data to the listed building dataset in chunks. I modified this function to do this and it took 3 minutes to process the 16 million properties in s3://asf-heat-pump-suitability/outputs/2023_Q2_EPC_enhanced_weights.parquet. A few tweaks were needed to what you had before (e.g. the inner join on gpd.sjoin):

from tqdm import tqdm
import logging

def spatial_join_epc_with_listed_buildings(
    enhanced_epc_df: pl.DataFrame, listed_buildings_df: gpd.GeoDataFrame
) -> pl.DataFrame:
    """
    Spatial join EPC dataset with listed buildings dataset.
    Args:
        enhanced_epc_df (pl.DataFrame): Enhanced EPC dataset
        listed_buildings_df (gpd.GeoDataFrame): Filtered Historic England dataset with only listed buildings grade and geometry.
    Returns:
        pl.DataFrame: EPC dataset with listed buildings grade without geometry.
    """

    # Add a unique index column for merging later
    enhanced_epc_df = enhanced_epc_df.with_row_index("index")

    chunk_size = 100000
    partitions = enhanced_epc_df.with_row_count("chunk_id").select(pl.col("chunk_id") // chunk_size)
    # group based on the created index, resulting chunk_size partitons
    data_partitioned = enhanced_epc_df.with_columns(partitions).partition_by("chunk_id")
    logging.info(f"Adding listed buildings to EPC in {len(data_partitioned)} chunks")
    for i, enhanced_epc_df_chunk in tqdm(enumerate(data_partitioned)):
        epc_gdf = transform_df_EPC_X_and_Y_to_point(enhanced_epc_df_chunk)
        epc_gdf_temp = epc_gdf[["geometry", "index"]].copy()
        joined_gdf_chunk = gpd.sjoin(
            epc_gdf_temp, listed_buildings_df, how="inner", predicate="intersects"
        )
        # Drop the geometry column
        joined_gdf_chunk = joined_gdf_chunk.drop(columns=["geometry", "index_right"])
        if i==0:
            joined_gdf = joined_gdf_chunk
        else:
            joined_gdf = pd.concat([joined_gdf, joined_gdf_chunk])

    enhanced_epc_df = enhanced_epc_df.join(pl.from_pandas(joined_gdf), how="left", on="index")
    
    return enhanced_epc_df

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(note the output is now a polars df so you probably won't need to apply convert_gpd_to_polars anymore)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you know that the name tqdm stands for "taqaddum" in Arabic which can mean "progress"? 🚀

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, works a charm!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha that makes far more sense! 😆 I'm sure I read somewhere once that it meant "thank you demasiado" (i.e. thank you very much (spanish) for helping me see my progress), so have always thought it meant that - even though it makes very little sense!

@helloaidank helloaidank merged commit 67c1a9e into 45_add_off_gas_feature Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add listed buildings to EPC dataset
2 participants