We are hoping to explore the relationship between CO2, the climate, and other factors (nitrous oxide per capita, GDP, population, etc.) to see whether or not these factors have an influence on the rising temperatures. We selected this question as we hoped to further understand global warming as climate change is becoming an unavoidable part of life. With this exploration, we plan to understand how CO2 emissions and other related factors will affect our habitat.
- Do the chosen factors for the exploration contribute to the rise in temperature in the US?
To narrow our focus for our analysis and machine learning model, we filtered the data to only include the US. Next we extracted the year and month from the date column. This cleaned dataset will help us look at yearly temperature data in the US.
Initial dataset:
Cleaned dataset:
To narrow our focus for our analysis and machine learning model, we filtered the data to only include the US. Next we dropped all columns with more than 90% NaN values. To further narrow our focus, we dropped all columns with non-beneficial data. This brought us down from 25191 rows × 60 columns to 221 rows × 22 columns
Once we had the datasets we wanted to use, we could then endeavor to join them together for the machine learning model. Before joining them with SQL, we had to reframe the “Global Land Temperatures” dataset. The “Global Land Temperatures” dataset recorded its information by year and month (ie. 199002, 199003,...) and by state in the USA; the “CO2 and Greenhouse Gas Emissions” dataset recorded their information by year and by country. This discrepancy prevented us from simply joining the two tables via SQL. The decision was made to first: convert the metric of time from "year and month" to simply "year" and to second: convert the data from being recorded by state to being recorded simply by country; both steps had to be done to the "Global Land Temperatures" dataset. The code for this step can be found in the repository and the following are images from this process:
This left us with two datasets that we could now join through SQL. The following is an image of the ERD:
The two files were imported into PGAdmin and joined with SQL. "Global Land "Temperatures" and "CO2 and Greenhouse Gas Emissions" were joined on the shared column of "year" and then saved into a file. The following is the code for that query:
Now, we had a dataset that could be input into our machine learning model.
During the Machine Learning process we used 4 Supervised Machine Learning processes to evaluate our overall data and to determine which process best suits our data.
Multiple linear regression designed to find a relationship between a single independent and dependent variable. The adjusted R score of our multiple linear regression model is 37.76% indicating the machine learning model work well. However, there are many more factors than CO2 influence the Average temperature increase
The classifier confusion matrix, overall accuracy with 63%, RandomForestClassifier algorithm is better model to predict the average temperature.
In the process of using the Logistic Regression Model the overall result of the accuracy score was 27%. This did not meet our minimum 70% accuracy score to determine whether various CO2 factors contributes to the rise in temperature.
Based on our OLS Regression Results our R-squared predicted that there is a 40% Linear Model influence in various CO2 factors contributing to the rise in temperatures.
Based on our Random Forest classification algorithm we used to help our prediction results to see whether factors of CO2 contribute to the increase in temperatures. One of the biggest advantages of random forest is its versatility. It can be used for both regression and classification tasks. Based on our data the Random Forest model generated a 63% accuracy score. This score revealed that our data may have a slight positive influence to the rise in temperatures in the United States.
We used Tableau to create and host our dashboard. The file is also stored in the repository.
Tableau Dashboard
Below shows the United States 12 month heat forecast. It also displays the Average Temperature increasing from 1750 to 2013.
Seeing the upward trends CO2 and greenhouse gas emissions, population, and temperature, it’s possible that CO2 and/or a combination of these factors has a positive influence on temperature. Now that we have created this product, this can now be the foundation for future data exploration; simply add features that one wants to explore to the input dataset for the machine and then run it.
- Consider additional features in the analysis ie. country, annual military spending by country, other potential pollutants, etc.
- Explore more models for better fit
- More granular breakdown of what data sets should be examined
- Find more recent data exploring Avg Temperature and CO2 data