Skip to content

AliciaXia222/Capstone-Team-Climate-Trace

Repository files navigation

🌎 Estimating Global Greenhouse Gas Emissions from Buildings

Table of Contents

  1. Abstract
  2. Introduction
    2.1 Project Motivation
    2.2 Problem Statement
    2.3 Goals
    2.4 Project Deliverables and Presentation Materials
  3. Background
  4. Data and Features Overview
    4.1 GHG Estimation
    4.2 Features for EUI Prediction
    4.3 Generated data
  5. Methods
    5.1 Feature Engineering
    5.2 Nearest Reference Mapping
    5.3 Supervised Machine Learning
    5.4 Experimental Design
  6. Experiments
    6.1 Feature Importance
    6.2 Models
  7. Conclusion
  8. Repository Structure and Usage
  9. Resources
  10. Contributors

1. Abstract

This project develops a machine learning model to estimate direct greenhouse gas (GHG) emissions from residential and non-residential building energy consumption. The model predicts energy use intensity (EUI) by incorporating climatic, geographical, and socioeconomic variables for both residential and non-residential buildings. These EUI estimates, along with global building floor area, will be used in the next stage of this project to calculate direct GHG emissions from buildings, offering a timely, high-resolution method for global emissions estimation.

This current work outlines preliminary EUI estimation techniques, with the primary focus on minimizing the Mean Absolute Percentage Error (MAPE) with a target range of less than 30-40%. Other performance metrics, such as R², Mean Squared Error (MSE), Mean Absolute Error (MAE), and Weighted Absolute Percentage Error (WAPE), were also considered. Future iterations will refine and expand the model by incorporating additional features to enhance performance, ultimately addressing the challenge of estimating global direct GHG emissions from buildings.

Analysis showed that ensemble and tree-based models, including Random Forest, XGBoost, and CatBoost, outperformed traditional methods, with Random Forest achieving the best results overall and a MAPE averaging less than 21% across different cross-validation strategies. However, R² was generally low across all models. Additionally, results varied by region, with models performing better within-domain but struggling to generalize effectively in cross-domain scenarios.

2. Introduction

2.1 Project Motivation

Global warming is one of the most critical challenges of our time, and to address it effectively, we need more detailed information on where and when greenhouse gas emissions occur. This data is crucial for setting actionable emissions reduction goals and enabling policymakers to make informed decisions. Given this situation, Climate TRACE, a non-profit coalition of organizations, is building a timely, open, and accessible inventory of global emissions sources, currently covering around 83% of global emissions.

Building direct emissions are responsible for between 6% and 9% of global GHG emissions, primarily due to onsite fossil fuel combustion for heating, water heating, and cooking. Indirect emissions from lighting, consumer electronics, and air conditioning are excluded, as they are typically electric and accounted for separately in the Climate TRACE database.

Despite their significant contribution to global emissions, the building sector still lacks the timely, high-resolution, and low-latency data needed to assess GHG emissions accurately. Current methodologies rely on outdated data, often delayed by over a year, or on self-reported data that is scarce or unavailable globally.

2.2 Problem Statement

Specifically, we can define our problem statement as follows:

The building sector lacks timely, high-resolution data on direct greenhouse gas (GHG) emissions, limiting the ability to accurately track and reduce emissions from building energy use.

2.3 Goals

The goal of this project is to develop a machine learning model to estimate greenhouse gas (GHG) emissions based on building energy consumption. The model will predict energy use intensity (EUI) using climatic, geographical, and socioeconomic variables. These EUI estimates, along with building area data, will be used to calculate direct GHG building emissions.

In the first semester, the focus has been on developing the Energy Use Intensity (EUI) estimation technique, using globally available features to predict EUI. By selecting these key features, the goal has been to generate the first iteration of EUI predictions. The target for this stage is to achieve a Mean Absolute Percentage Error (MAPE) in the range of 30-40%. While this is the ideal range for this milestone, it is possible that we may not meet this target at this stage. Refining and improving this technique will be the focus for the second semester.

In the second semester, the objective will be to refine the model by incorporating additional features and enhancing its performance. The final goal is to enable global EUI prediction, providing a high-resolution, actionable method for estimating direct GHG emissions from building energy use.

2.4 Project Deliverables and Presentation Materials

This section provides an overview of the key deliverables and presentation materials developed throughout the project. These materials summarize the project's progress, next steps, and areas to explore in the upcoming semester, offering insights into the work completed and the outcomes achieved.

  1. The deliverables we've agreed to provide to our client this semester can be found here.

  2. For a visual summary of the project, check out the slide deck presentation here.

  3. The mid-point slide deck of analysis and results can be found here.

3. Background

The accurate estimation of anthropogenic CO2 emissions is critical for understanding global climate change and formulating effective policies. Existing estimates are provided by several key datasets, including the Open-source Data Inventory for Anthropogenic CO2 (Oda, Maksyutov, & Andres, 2018), the Community Emissions Data System (McDuffie et al., 2020), the Emissions Database for Global Atmospheric Research (EDGAR) (Janssens-Maenhout et al., 2019), the Global Carbon Grid (Tong et al., 2018), and the Global Gridded Daily CO2 Emissions Dataset (Dou et al., 2022). While the GRACED data is updated nearly monthly, most of the other key datasets suffer from a significant production latency, often of a year or more. Additionally, the highest resolution available across these datasets is 0.1 decimal degrees, roughly equivalent to an 11 km grid near the equator. Furthermore, only a few of these datasets provide detailed breakdowns of emissions by sector, such as residential and commercial subsectors, or offer separate estimates for different greenhouse gases(Markakis et al., 2023).

In response to these challenges, recent advancements have been made in the development of more granular and timely emissions estimation methods. One such breakthrough is the High-resolution Global Building Emissions Estimation using Satellite Imagery model by Markakis et al. (2023). This innovative model offers high-resolution, global emissions estimates for both residential and commercial buildings at a 1 km² resolution, with updates on a monthly basis. By leveraging satellite imagery-derived features and machine learning techniques, the model estimates direct emissions from buildings. This approach addresses the temporal and spatial limitations of previous datasets by predicting building areas, estimating energy use intensity, and calculating emissions based on regional fuel mixes. Unlike other datasets like GRACED and EDGAR, this model offers more granular insights into emissions at a higher frequency and resolution, making it a crucial tool for policymakers working to reduce emissions in the building sector on a global scale.

4. Data and Features Overview

4.1 GHG Estimation

To estimate greenhouse gas (GHG) emissions from buildings, we will use Energy Use Intensity (EUI) as a central metric. EUI measures the energy consumption per square meter of building space, making it a valuable indicator for emissions estimation. By combining EUI values with total building floor area and an emissions factor, we can calculate the GHG emissions associated with buildings.

The estimation formula is: Formula

4.2 Features for EUI Prediction

In this section, we describe both the dependent variable of our model (EUI) and the independent features we are exploring to predict Energy Use Intensity (EUI) in buildings. The independent features include factors that are considered potentially influential on energy consumption, based on both prior research and discussions with experts in the field. These independent features serve as inputs to the model, and some of them are used to calculate additional derived features, such as the Heating Degree Days (HDD) and Comfort Index, which are explained further in the Feature Engineering section. Below, we outline the open datasets we are using to build and refine our EUI prediction model.

  1. EUI: EUI is a metric used to measure the intensity of energy use in buildings. These EUI values serve as our dependent variable, or the target we seek to predict, in our model. This dataset, provided by the client, contains 482 entries and focuses on two key variables:

    • Residential EUI: Indicates the energy consumption of residential buildings, expressed in kWh/m²/year.
    • Non-Residential EUI: Reflects the energy consumption of non-residential buildings, also expressed in kWh/m²/year.

To better understand the distribution of this variable, we can observe the following map, which visualizes how EUI is distributed across the different regions and building types.

EUI map

  1. Temperature: Air temperature at 2m above the surface, interpolated using atmospheric conditions. Measured in kelvin. This feature is essential for estimating heating needs (which contribute to direct energy use in buildings) and is later used to calculate Heating Degree Days (HDD) and Cooling Degree Days (CDD).

  2. Dewpoint Temperature: The temperature at which air at 2m above the surface becomes saturated, indicating humidity levels. Measured in kelvin.

  3. Latitude: Provides global latitude data in decimal degrees (WGS84 coordinate reference system), adding geographical context to our analysis.

  4. Longitude: Provides global longitude data in decimal degrees (WGS84 coordinate reference system), complementing the latitude data for geographical analysis.

  5. Population: Includes population data for various countries and regions from 1960 to 2023. For our analysis, we extracted the population figures for 2023 to align with our project goals.

  6. GDP: Contains data on human development, health, education, and income across 160+ countries from 1990 to 2022. We used the GDP values for 2022 as a key feature for our model.

  7. Human Development Index (HDI): HDI measures a country's achievements in three key areas:

    • Health: A long and healthy life.
    • Knowledge: Access to education.
    • Standard of Living: A decent standard of living.
      We extracted data for the year 2022 to maintain consistency with other datasets.
  8. Urbanization Rate: Urbanization rate reflects the average annual growth of urban populations. For consistency, we used data from 2022.

  9. Educational Index: This index comprises two indicators:

  • Mean Years of Schooling (MYS): The average years of schooling for adults aged 25 and above.
  • Expected Years of Schooling (EYS): The anticipated years of education for the current population.
  1. Paris Agreement: The Paris Agreement is an international treaty adopted by 196 parties in 2015. We used this information to create a binary variable to indicate whether a country is a signatory.

4.3 Generated Data

After feature engineering and merging our datasets, we've generated the final dataset for model input, containing 482 data points. It can be accessed here

5. Methods

5.1 Feature Engineering

Feature engineering is essential to transform raw data into meaningful representations that enhance model performance and predictive accuracy. In this study, we applied the following techniques:

  1. Heating Degree Days Calculation:
    Calculated using temperature data to derive features measure the demand for heating energy based on the difference between outdoor temperature and a baseline "comfort" temperature, typically 65°F (18°C).

  2. Cooling Degree Days Calculation:
    Calculated using temperature data to derive features measure the demand for Cooling related energy usage based on the difference between outdoor temperature and a baseline "comfort" temperature, typically 65°F (18°C).

  3. GDP per Capita Calculation: We use GDP per capita, which is the result of dividing total GDP by the population, as it provides more relevant information for our model. This approach better captures the economic impact on energy consumption at the individual level, enabling more accurate comparisons across regions with varying population sizes.

5.2 Nearest Reference Mapping

Nearest Reference Mapping involves assign each data point to its closest reference location based on a defined distance metric, enriching the dataset with relevant features from these reference points.

In this project, we aim to assigning EUI values to each data point based on its nearest starting point with known ground truth. By using the EUI values as features and incorporating spatial context into our model, we aim to improve the model’s starting point and enhance prediction accuracy for global projections.

5.3 Supervised Machine Learning

In this project, we will employ a range of supervised machine learning models to predict and analyze the target variable. The following models will be utilized:

  1. Linear Regression:
    We will use Linear Regression to model the relationship between the input features and the target variable. This model is suitable for capturing linear relationships and will serve as a baseline for comparison with more complex models.

  2. K-Nearest Neighbors (KNN):
    KNN is a non-parametric model that classifies a data point based on the majority class or average value of its nearest neighbors. It is particularly useful for capturing local patterns in the data and will provide a comparison to Linear Regression in terms of flexibility.

  3. Ensemble Models:

    • Random Forest: Random Forest is an ensemble learning method that combines multiple decision trees to improve accuracy and reduce overfitting. It is particularly useful for handling high-dimensional data and capturing complex, non-linear relationships.

    • XGBoost:
      XGBoost is an optimized gradient boosting algorithm that performs well in a variety of prediction tasks. It builds an ensemble of decision trees sequentially, improving the model’s performance by reducing bias and variance.

    • CatBoost:
      CatBoost is another gradient boosting algorithm known for its handling of categorical features without the need for explicit preprocessing. It is expected to provide competitive results, particularly in datasets with mixed types of variables.

The combination of linear models, distance-based methods like KNN, and powerful ensemble models like XGBoost and CatBoost will allow us to capture a range of patterns in the data, from simple linear trends to more complex interactions and non-linear relationships.

5.4 Experimental Design

Given the challenge of regional variations in global data, we will validate our predictions at the regional level across five distinct regions using three strategies to identify biases and improve model robustness. The regions we are using are defined as follows:

Given the challenge of regional variations in global data, we will validate our predictions at the regional level across five distinct regions using three strategies to identify biases and improve model robustness. This approach helps to account for local differences in energy use patterns and improve the model’s predictive accuracy across diverse contexts. The regions we are using are defined as follows:

  1. Asia & Oceania
  2. Europe
  3. Africa
  4. Central and South America
  5. Northern America

The data points in our dataset, which we intend to predict, are distributed across these various regions, as illustrated in the following map.

Geographic Distribution of Data Points by Region

The strategies we will be using are as follows:

Image

We aim to assess our model's generalization by comparing its performance within the same region (Within-Domain) and its ability to extrapolate to other regions (Cross-Domain). The goal is to reduce the gap between these strategies to improve accuracy and understand extrapolation errors. Additionally, we want to understand if there are regions that perform better than others in specific outcomes, which can help us tailor our model to regional differences.

6. Experiments

6.1 Feature Importance

To find the most important factors in building energy use and greenhouse gas emissions, we used a linear regression model. The target variable, energy use intensity (EUI), was calculated as the total of residential and non-residential energy use (kWh/m²/year).

The model included factors like GDP per capita, urbanization rate, latitude, and subnational HDI. To make all variables comparable, we standardized the data before training the model. Heating Degree Days (HDD), which measures heating demand based on temperature, turned out to be the most important factor, showing how much temperature affects energy use.

In the future, the model could include other temperature-related factors, like average temperature and humidity, which were not included in this iteration. For details on the calculations, check the Feature Importance Notebook.

Feature Importance

6.2 Models

Comparison Between Models

In this section, we evaluate the performance of several machine learning models used for predicting Energy Use Intensity (EUI) and estimating greenhouse gas (GHG) emissions from buildings. The models tested include Linear Regression (LR), Linear Regression with Lasso and Ridge regularization, K-Nearest Neighbors (KNN), Random Forest, XGBoost, and CatBoost. The evaluation metrics, such as Mean Absolute Percentage Error (MAPE) and R², are used to assess model performance across different feature sets. The models are also evaluated across various cross-validation strategies to ensure robustness and generalizability. The specific features utilized in each model, along with the hyperparameters tested, can be found in detail in the tables here and are summarized in this table. The findings and recommendations for model improvement will be further explored in the Results & Analysis slide deck.

The following graphs display the average performance metrics for MAPE, R², and RMSE for both residential and non-residential buildings across different cross-validation strategies: within domain, cross-domain, and all domain. These averages are calculated for various regions, helping us to identify the best model and select our EUI Estimation Technique. By analyzing these results, we can determine which model performs best across different scenarios, guiding our final decision on the most effective approach for predicting EUI. In addition to MAPE, R², and RMSE, other evaluation metrics can be found here.

  • MAPE
    The MAPE across validation strategies revealed that predictions for non-residential buildings generally demonstrated superior predictive capability compared to residential buildings. Ensemble models (Random Forest, XGBoost, and CatBoost) consistently outperformed traditional approaches. We can observe that the within-domain strategy delivered better results, while the cross-domain strategy presented more challenges, with the all-domain strategy serving as an intermediate point. One of the challenges that arises is how to improve generalization in order to reduce the gap between these strategies, ultimately enhancing model performance across different regions. On average, we achieved the target of a MAPE below 30-40%, although it is important to note that some individual predictions fell outside this range, as will be discussed further later.

eui_predictions_all_domain


  • The R² analysis reveals significant challenges in model performance across different strategies. For within-domain validation, ensemble models showed moderate positive R² values (0.22-0.52), with Random Forest achieving the best performance. However, cross-domain validation resulted in consistently negative R² values across all models. This suggests that the relationships between features and EUI are highly complex and region-dependent. The all-domain strategy showed intermediate results, with ensemble models maintaining slightly positive R² values (0.11-0.47) while linear models continued to perform poorly. These results, especially the negative R² values, indicate that the relationship between our features and EUI is strongly non-linear and varies significantly across different geographical regions, highlighting the importance of using more sophisticated modeling approaches.

eui_predictions_all_domain

  • RMSE The RMSE analysis further supports the patterns observed in previous metrics. Within-domain, ensemble models achieved the lowest errors. Cross-domain validation revealed substantially higher errors, particularly for linear models (RMSE ranging from 59.4 to 62.6 for non-residential). The magnitude of these errors decreased in all-domain validation, though ensemble models maintained their superior performance.

eui_predictions_all_domain

Best Model Overall: Random Forest

Based on our evaluation across metrics, we selected Random Forest as our primary model for EUI prediction. While some models occasionally outperformed in specific scenarios, Random Forest demonstrated the most consistent and balanced performance across validation strategies and building types. It achieved MAPE values below 15% for non-residential and 21% for residential in within-domain validation, maintained positive R² values (0.22-0.52 within-domain), and showed stable RMSE values (29.3 non-residential, 23.6 residential within-domain). This consistent performance, along with its ability to handle non-linear relationships and maintain stability in cross-domain scenarios, makes Random Forest the most reliable choice for global EUI prediction.

The following figure shows detailed performance metrics for the Random Forest model across different validation strategies and building types. Detailed results by region, along with the estimation technique used, including the specific variables and their hyperparameters, can be found in this table, while average performance metrics are available here.

eui_predictions_all_domain

A detailed analysis of the Random Forest model's performance revealed distinct patterns across building types and validation strategies. For non-residential buildings, the model achieved its best performance in within-domain validation with a MAPE of 8.96% and R² of 0.22, though performance declined in cross-domain scenarios (MAPE 13.58%, R² -0.36). For residential buildings, while showing higher error rates (MAPE 12.76% within-domain), it demonstrated stronger explanatory power (R² 0.52). The all-domain strategy provided a balanced middle ground, with MAPE of 9.98% and 13.08% for non-residential and residential buildings respectively. These results demonstrate that while geographical variations impact model performance, the Random Forest consistently maintains error levels well within our target range of 30-40% MAPE across all scenarios, making it a robust choice for global EUI prediction.

To better understand the Random Forest model's performance across different validation strategies, we examine the relationship between predicted and actual EUI values, along with error distributions for each region. The following figures show these relationships for within-domain, cross-domain, and all-domain validation approaches. For each strategy, we present both scatter plots comparing predicted versus actual values, and corresponding error distribution histograms, broken down by geographical region and building type.

  1. Within Domain:

    • Actual EUI vs. Predicted EUI
      eui_predictions_all_domain

    • Error Distribution Plot
      eui_predictions_all_domain

  2. Cross Domain:

    • Actual EUI vs. Predicted EUI eui_predictions_cross_domain

    • Error Distribution Plot
      eui_predictions_all_domain

  3. All Domain:

    • Actual EUI vs. Predicted EUI eui_predictions_all_domain

    • Error Distribution Plot
      eui_predictions_all_domain

The regional analysis shows that Asia & Oceania consistently demonstrates one of the best overall performances across validation strategies, with low MAPE (18.2% residential, 6.4% non-residential) and strong R² values (0.69 residential, 0.85 non-residential) in within-domain validation. This performance remains relatively stable in cross-domain validation (MAPE 35.3% residential, 13.7% non-residential; R² 0.44 residential, 0.52 non-residential), outperforming other regions.

Central and South America also stands out with strong performance in both within-domain (MAPE 4.2% residential, 3.2% non-residential; R² 0.85 residential, 0.46 non-residential) and cross-domain validation (MAPE 10.6% residential, 5.4% non-residential; R² -0.05 residential, -0.47 non-residential), though with more variable R² values.

Other regions show more variable performance, with Europe, Northern America and Africa having higher error rates and less consistent R² values (Africa showing MAPE of 7.6-8.2% but poor R² values near zero in most validation strategies).

7. Conclusion

Our analysis reveals significant insights into developing machine learning approaches for global EUI prediction. Ensemble models, particularly Random Forest, consistently outperformed traditional methods across validation strategies, achieving MAPE values below 21% and surpassing our initial target of 30-40%. However, the variation in R² values, especially in cross-domain scenarios, indicates challenges in capturing the full complexity of EUI patterns across different regions.

Regional analysis uncovered important patterns in model performance. Asia & Oceania and Central/South America demonstrated the strongest results, while Europe and Northern America showed more variable predictions. Africa presented an interesting case with low error rates but poor explanatory power. The significant performance differences between within-domain and cross-domain validation highlight the strong influence of regional characteristics on EUI predictions.

The technical insights gained suggest strongly non-linear relationships between features and EUI, reinforcing the necessity of sophisticated modeling approaches. Temperature-related features emerged as crucial predictors, while the regional variations in performance indicate the potential benefit of region-specific model tuning. Looking forward, several opportunities exist for model improvement. These include incorporating additional features such as detailed weather variables and satellite imagery data, developing separate models for residential and non-residential buildings, and exploring techniques to improve cross-domain generalization while maintaining low MAPE values

8. Repository Structure and Usage

This section provides an overview of the repository's structure, explaining the purpose of each directory and file. It also includes instructions for navigating and using the code.

Directory Structure

.
├── LICENSE
├── README.md
├── data
│   ├── 01_raw
│   │   ├── HDI_educationalIndex_incomeIndex.csv
│   │   ├── gdp_data.csv
│   │   └── population.csv
│   ├── 02_interim
│   │   ├── CDD.csv
│   │   ├── HDD.csv
│   │   └── Humidity.csv
│   └── 03_processed
│       └── merged_df.csv
├── deliverables_agreement
│   └── Mid-Point Deliverables - Climate Trace.pdf
├── figures
│   ├── 01_formula.png
│   ├── 02_eui_map.png
│   ├── 03_region_map.png
│   ├── 04_experimental_design.png
│   ├── 05_feature_importance.png
│   ├── 06_avg_rf.png
│   └── model_plots
├── notebooks
│   ├── 010_Download_WeatherData_API.ipynb
│   ├── 020_WeatherData_Preprocessing.ipynb
│   ├── 021_HumidityPreprocessing.ipynb
│   ├── 023_HDDPreprocessing.ipynb
│   ├── 024_CDDPreprocessing.ipynb
│   ├── 030_DataPreprocessing.ipynb
│   ├── 040_Plots.ipynb
│   ├── 050_FeatureImportance.ipynb
│   ├── 060_Experiments_LR.ipynb
│   ├── 061_Experiments_KNN.ipynb
│   ├── 062_Experiments_RF.ipynb
│   ├── 063_Experiments_XGBoost.ipynb
│   ├── 064_Experiments_CatBoost.ipynb
│   ├── 070_Model_Comparison.ipynb
├── requirements.txt
├── results
├── slide_decks
│   └── Climate_TRACE_Presentation.pdf
└── src
    ├── __init__.py
    ├── __pycache__
    │   └── lib.cpython-311.pyc
    └── lib.py
  1. data/

    • Contains all datasets used in the project. It is organized into subfolders:
      • 01_raw/: Raw, unprocessed datasets like HDI, GDP, and population data.
      • 02_interim/: Intermediate processed files such as HDD and CDD values.
      • 03_processed/: Fully processed datasets ready for modeling (e.g., merged_df.csv).
  2. figures/

    • Contains visual resources such as diagrams, maps, and other illustrations used in presentations and documentation.
  3. notebooks/

    • Jupyter notebooks used for data processing, feature engineering, modeling, and analysis. Notebooks are ordered and labeled for clarity:
      • 010_Download_WeatherData_API.ipynb: Downloads weather data from the Copernicus Climate Data Store (ERA5-Land daily statistics).
      • 021_HumidityPreprocessing.ipynb: Prepares humidity data for modeling.
      • 023_HDDPreprocessing.ipynb: Prepares Heating Degree Days (HDD) data.
      • 024_CDDPreprocessing.ipynb: Prepares Cooling Degree Days (CDD) data.
      • 030_DataPreprocessing.ipynb: Prepares the final dataset for model input.
      • 040_Plots.ipynb: Generates visualizations for analysis and reporting.
      • 050_FeatureImportance.ipynb: Analyzes feature importance for model evaluation.
      • 060_Experiments_LR.ipynb: Sets up and evaluates experiments using Logistic Regression.
      • 061_Experiments_KNN.ipynb: Implements and evaluates K-Nearest Neighbors (KNN) models.
      • 062_Experiments_RF.ipynb: Runs experiments using Random Forest (RF).
      • 063_Experiments_XGBoost.ipynb: Executes XGBoost models for performance comparison.
      • 064_Experiments_CatBoost.ipynb: Configures and evaluates CatBoost models.
      • 070_Model_Comparison.ipynb: Compares the performance of different models across various datasets and variables.
  4. results/

    • Stores evaluation outputs from various modeling strategies (e.g., all_domain or cross_domain) and models (e.g., KNN, Logistic Regression).
  5. src/

    • Contains core Python scripts for the project.
      • lib.py: Provides utility functions and shared modules for data preprocessing, feature extraction, and model evaluation, used across notebooks and scripts.
  6. requirements.txt

    • Lists all dependencies needed for the project environment, ensuring reproducibility.
  7. README.md

    • The entry point of the repository, providing an overview, key results, and links to all major components.

Usage Instructions

  1. Setup:
    Clone the repository and ensure all dependencies are installed. Use requirements.txt

  2. Data Processing:

    • Start with 010_Download_WeatherData_API.ipynb to download raw weather data from the Copernicus Climate Data Store.
    • Use 020_WeatherData_Preprocessing.ipynb to preprocess the weather data for model input.
    • Process specific features with 021_HumidityPreprocessing.ipynb, 023_HDDPreprocessing.ipynb, and 024_CDDPreprocessing.ipynb to compute humidity, Heating Degree Days (HDD), and Cooling Degree Days (CDD) data.
    • Finalize the dataset with 030_DataPreprocessing.ipynb before moving to modeling.
  3. Modeling:

    • Open 06_Model.ipynb to train models and evaluate performance across domains.
  4. Results Analysis:

    • Use the results/ directory to analyze model outputs and metrics.
  5. Figures and Visuals:

    • All generated plots and diagrams are stored in figures/ for easy reference in presentations or reports.

9. Resources

  1. Dou, X., Wang, Y., Ciais, P., Chevallier, F., Davis, S. J., Crippa, M., Janssens-Maenhout, G., Guizzardi, D., Solazzo, E., Yan, F., Huo, D., Zheng, B., Zhu, B., Cui, D., Ke, P., Sun, T., Wang, H., Zhang, Q., Gentine, P., Deng, Z., & Liu, Z. (2022). Near-realtime global gridded daily CO2 emissions. The Innovation, 3(1), 100182.

  2. Janssens-Maenhout, G., Crippa, M., Guizzardi, D., Muntean, M., Schaaf, E., Dentener, F., Bergamaschi, P., Pagliari, V., Olivier, J. G. J., Peters, J. A. H. W., van Aardenne, J. A., Monni, S., Doering, U., Petrescu, A. M. R., Solazzo, E., & Oreggioni, G. D. (2019). EDGAR v4.3.2 Global Atlas of the three major greenhouse gas emissions for the period 1970–2012. Earth System Science Data, 11(3), 959–1002.

  3. Markakis, P. J., Gowdy, T., Malof, J. M., Collins, L., Davitt, A., Volpato, G., & Bradbury, K. (2023). High-resolution global building emissions estimation using satellite imagery. Climate Change AI.

  4. McDuffie, E. E., Smith, S. J., O’Rourke, P., Tibrewal, K., Venkataraman, C., Marais, E. A., Zheng, B., Crippa, M., Brauer, M., & Martin, R. V. (2020). A global anthropogenic emission inventory of atmospheric pollutants from sector- and fuel-specific sources (1970–2017): An application of the Community Emissions Data System (CEDS). Earth System Science Data, 12(4), 3413–3442.

  5. Oda, T., Maksyutov, S., & Andres, R. J. (2018). The Open-source Data Inventory for Anthropogenic CO2, version 2016 (ODIAC2016): A global monthly fossil fuel CO2 gridded emissions data product for tracer transport simulations and surface flux inversions. Earth System Science Data, 10(1), 87–107.

  6. Tong, D., Zhang, Q., Davis, S. J., Liu, F., Zheng, B., Geng, G., Xue, T., Li, M., Hong, C., Lu, Z., Streets, D. G., Guan, D., & He, K. (2018). Targeted emission reductions from global super-polluting power plant units. Nature Sustainability, 1(1), 59–68.

10. Contributors

Jiechen Li
Meixiang Du
Yulei Xia
Barbara Flores

Project Mentor and Client: Dr. Kyle Bradbury

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •