Skip to content

A comparation along 20 years of weather forecasting data from Porto Alegre, RS, Brazil. Using the pandas library, jupyter notebook and seaborn.

Notifications You must be signed in to change notification settings

attrindade/poa-weather-analysis

Repository files navigation

POA Weather Analysis

Comparing POA weather data from two decades

A comparison of the last two decades (2001 - 2021) of weather forecasting data from Porto Alegre, Rio Grande do Sul, Brazil.

Table of Contents

Installation

The code requires Python versions of 3.* and general libraries available through the Anaconda package.

Project Motivation

As a citizen of Porto Alegre for more than 25 years I started to have the feeling that our weather is changing a little on these last years. Out of curiosity I wanted to visualize what differences are perceptible.

This is a simple project intended only to consolidate my knowledge of pandas, matplotlib and seaborn as I'm currently deepening my comprehension of them. I'm not offering a deep and solid analysis of Porto Alegre's weather changes along the years. My only plan is to learn and to see if my sensory perception of the city's temperatures seems to be in line with the data or not.

How to get the data

I got this .csv file with historical data from Porto Alegre from the INMET (Instituto Nacional de Meteorologia) website. It's very simple to get the data in the way want (specific variables, date range, etc). You can go to https://bdmep.inmet.gov.br/ (the INMET database website) and ask for it, it gives you many options to customise your data. The data (.csv) I've used in this project you can find on this repo as "DATA_POA_2001-01-01_2022-03-11.csv".

Project Overview

This project have the main goal of improving my EDA (Exploratory Data Analysis) skills so it was divided in the following parts: data extraction, data preparation (cleaning), creation of a summarized table to facilitate the analysis and EDA (that have multiple specific parts like summer, winter, precipitations)

Data Extraction

I read the data received from INMET and generated this dataframe with 7740 observations and 8 cols.

Data Medicao PRECIPITACAO TOTAL, DIARIO (mm) TEMPERATURA MAXIMA, DIARIA (C) TEMPERATURA MEDIA, DIARIA (C) TEMPERATURA MINIMA, DIARIA (C) UMIDADE RELATIVA DO AR, MEDIA DIARIA (%) UMIDADE RELATIVA DO AR, MINIMA DIARIA (%) Unnamed: 7
0 2001-01-01 0 30,1 23,616667 18,4 68,458333 48.0 NaN
1 2001-01-02 0 32,1 25,475 20 69,958333 45.0 NaN
2 2001-01-03 0 33,4 26,345833 21 69,083333 43.0 NaN
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7740 entries, 0 to 7739
Data columns (total 8 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Data Medicao                               7740 non-null   object 
 1   PRECIPITACAO TOTAL, DIARIO (mm)            7210 non-null   object 
 2   TEMPERATURA MAXIMA, DIARIA (C)             7448 non-null   object 
 3   TEMPERATURA MEDIA, DIARIA (C)              7232 non-null   object 
 4   TEMPERATURA MINIMA, DIARIA (C)             7453 non-null   object 
 5   UMIDADE RELATIVA DO AR, MEDIA DIARIA (%)   7529 non-null   object 
 6   UMIDADE RELATIVA DO AR, MINIMA DIARIA (%)  7628 non-null   float64
 7   Unnamed: 7                                 0 non-null      float64
dtypes: float64(2), object(6)
memory usage: 483.9+ KB

Data Preparation & Cleaning

In this phase I've made many steps to improve the dataframe's readability and functionality, these steps were:

  • Drop the useless cols
  • Simplify and translate cols names
  • Convert str data types to datetime (date) and to float (others)
  • Decrease the amount of NaN
  • Dropping non-useful rows
  • Creating 'year', 'month' and 'day' cols
  • Set date as index
Decreasing the amount of NaN

In an attempt to decrease the amount of NaN and trying to lose the least amount of rows, I try to make the average from min and max temperature and replace the NaN of the avg_temp with it, in the rows where I have min & max temp but not avg.

Dropping non-useful rows

As I don't plan right now to apply ML models on this dataset the best choice for the rows that contain too many NaNs is to drop it commpletely

Results from this phase
total_precip max_temp avg_temp min_temp avg_humidity min_humidity year month day
date
2001-01-01 0.0 30.1 23.616667 18.4 68.458333 48.0 2001 1 1
2001-01-02 0.0 32.1 25.475000 20.0 69.958333 45.0 2001 1 2
2001-01-03 0.0 33.4 26.345833 21.0 69.083333 43.0 2001 1 3
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 7483 entries, 2001-01-01 to 2022-03-10
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   total_precip  7209 non-null   float64
 1   max_temp      7448 non-null   float64
 2   avg_temp      7425 non-null   float64
 3   min_temp      7432 non-null   float64
 4   avg_humidity  7481 non-null   float64
 5   min_humidity  7455 non-null   float64
 6   year          7483 non-null   int64  
 7   month         7483 non-null   int64  
 8   day           7483 non-null   int64  
dtypes: float64(6), int64(3)
memory usage: 584.6 KB

After cleaning & preparation: 7483 observations and 9 cols.

Creating a Summarized Table

In this second part of data cleaning/preparation we will summarize all the data from our dataframe and create another dataframe with all these summarized info. We will firstly divide summer, seasons and years in different tables. After that we will put it all together in a summarized table called 'seasons'.

Results from this phase
SUM_temp SUM_max SUM_max_avg SUM_min SUM_min_avg SUM_hum SUM_hum_min WIN_temp WIN_max WIN_max_avg WIN_min WIN_min_avg win_hum WIN_hum_min
2001 24.68 37.9 30.56 14.9 20.58 69.79 49.92 16.66 30.9 21.85 2.6 12.77 77.11 58.29
2002 25.47 38.0 31.20 15.7 21.11 71.35 50.67 15.71 33.7 20.57 3.4 12.05 77.31 55.75
2003 23.93 35.8 29.78 15.1 19.74 70.27 55.64 15.26 34.5 20.58 3.0 11.26 76.22 53.50
2004 24.57 39.4 30.89 15.0 20.19 67.79 45.67 15.60 36.7 21.00 2.7 11.54 76.94 45.25
2005 24.63 38.7 30.47 13.5 20.60 70.91 53.08 16.34 31.9 21.53 1.9 12.50 77.30 49.54
2006 24.87 35.8 30.53 15.1 20.70 71.96 52.46 15.94 32.3 21.17 2.7 12.12 77.37 55.08
2007 24.21 37.3 30.05 14.4 20.06 71.43 56.46 14.62 33.2 19.73 2.2 10.78 78.32 48.29
2008 23.66 35.3 29.16 13.9 19.82 73.34 55.25 14.92 31.8 19.80 2.3 11.40 78.33 57.12
2009 25.17 38.5 30.71 13.7 21.47 73.96 52.38 14.76 33.4 19.87 0.3 10.92 77.64 42.42
2010 24.55 36.2 30.12 14.1 20.99 75.54 63.04 15.34 32.9 20.23 2.8 11.79 78.64 58.00
2011 24.85 38.4 31.08 14.7 20.55 71.39 54.67 14.29 32.4 19.36 1.9 10.71 79.40 60.75
2012 23.61 39.0 29.35 14.4 19.69 72.25 52.29 16.46 33.0 22.20 1.1 12.34 76.24 49.17
2013 25.76 40.6 31.93 17.4 21.60 72.23 41.71 14.89 35.1 20.31 1.4 11.08 80.03 55.96
2014 24.86 36.5 30.35 14.4 21.13 76.41 60.62 15.88 34.3 21.16 3.6 12.16 81.22 56.17
2015 25.07 38.9 30.73 16.9 21.17 76.59 52.71 17.21 34.8 22.16 5.7 13.74 81.11 47.62
2016 25.32 38.3 31.09 13.6 21.47 78.17 65.25 14.72 32.9 19.84 4.1 11.14 81.85 57.88
2017 24.10 36.8 30.10 13.9 20.06 77.08 53.08 17.66 34.8 23.49 5.5 13.89 81.34 60.83
2018 25.28 38.5 31.03 16.9 21.42 77.38 61.17 14.92 32.9 20.27 2.6 11.19 84.34 64.25
2019 25.20 40.3 31.94 14.6 20.58 68.51 51.62 16.11 36.1 21.81 2.2 12.24 79.47 60.04
2020 24.67 38.3 30.71 15.7 20.58 72.95 48.71 15.56 31.3 21.19 2.7 11.52 78.93 57.33
2021 25.52 40.3 32.17 17.4 21.01 71.67 58.08 15.35 34.5 20.68 2.3 11.66 79.13 55.38
  • SUM : Summer
  • WIN : Winter
  • SUM/WIN_temp --> The average for that summer's/winter's temperatures
  • SUM/WIN_max --> The highest temperature for that summer/winter
  • SUM/WIN_min --> The lowest temperature for that summer/winter
  • SUM/WIN_max_avg --> The average for that summer's/winter's daily maximum temperatures
  • SUM/WIN_min_avg --> The average for that summer's/winter's daily minimum temperatures
  • SUM/WIN_hum --> The average of the daily humidities of that summer/winter
  • SUM/WIN_hum_min --> The lowest humidity of that summer/winter

EDA (Exploratory Data Analysis)

In this phase I try to answer a few questions about many topics related to this sensorial feeling that Porto Alegre is getting hotter.

What about the summer?

Questions to answer:

  • Did the avg temperature rose up in these last 7 years?
  • Did the max temperature rose up?

To answer these questions my strategy was to plot the data from the summer's data in the summarised dataframe.

download

From the graph above it really seems like the average temperatures and max temperatures really got a little higher. There are many interesting insights that can be made from this plot, but the general idea it gives to me is that all the variables are getting higher on average.

As an attempt to visualize more this change of the averages I've tried to plot a new graph with a line in the center representing the mean of all the year averages from these two decades (which is 24.76C°). After this line, I plotted on this same graph bars that represents how far each year's average is from that mean, they represent the diversion of the the values in relation to that mean/line.

download

As we can see in the plot above, the year's summer temperature average is increasing. The first decade of the century had on average lower temperatures during the summer and the second decade the opposite.

This analysis reinforce the idea that Porto Alegre's summers are getting hotter. My perception that our summers are being hotter seems to be in line with data.

Summer's avg temperatures Avg of highest summer temp Avg of daily summer max temps Avg of lowest summer temps Average of daily summer min temps
2001 - 2007 24.62 37.56 30.50 14.81 20.43
2008 - 2014 24.64 37.79 30.39 14.66 20.75
2015 - 2021 25.02 38.77 31.11 15.57 20.90

With the plots that I showed before and with this table that shows the mean of many variables of the city summers I feel confident to say that my perceptions are at least coherent with the data: Porto Alegre is having higher temperatures and higher averages during these last 7 years, but it's not a huge difference.

We can't confirm with certainty that my perception is related to any real changes in the city's climate, but it certainly goes in the same direction that some studies already found (I will link an interesting one below), the climate change is starting to be noticeable by people who live in the city.

Climate change in Rio Grande do Sul, by Bibiana Dávila at UFRGS

What about the winter

Questions to answer:

  • Did the mean temperature increased?
  • Did the minimum temperature average increased?

To answer these questions my strategy was to plot the data from the winter's data in the summarised dataframe. download

Winter's avg temperatures Avg of higuest winter temp Avg of daily winter max temps Avg of lowest winter temps Average of daily winter min temps
2001 - 2007 15.73 33.31 20.92 2.64 11.86
2008 - 2014 15.22 33.27 20.42 1.91 11.49
2015 - 2021 15.93 33.90 21.35 3.59 12.20

With the winter we can see a similar pattern to the one that our summer is following: average temperatues, maximum and minimum variables are all (somewhat) increasing. All the variables had some increase, but the maximum averages and the two minimum variables (average and lowest temperature) had highest increase.

The average temperatures are getting higher and all the other metrics too. Answering our questions: Yes, the mean temperature average suffered an increase and the miniminum temperatures (and averages) suffered too.

Again, the residents of POA can feel a difference and this difference can be felt by our sensations.

Humidity

Questions to answer:

  • Did the humidity averages suffered any changes along these years?
  • Does it follows the changes of other variables?
Avg humidity along years Avg min humidity along years
2001 - 2007 74.09 51.16
2008 - 2014 75.47 52.34
2015 - 2021 77.85 54.83

The average humidity and minimum humidity mantained themselves in a similar averages along these 20 years although they seem to be increasing too. I'm no specialist, so I can't affirm anything, even if there's a small increase on it. It can be only a common difference or something else, only a little bit of more research from my part can understand it further.

Answering the questions: they seems to be increasing although it's a subtle difference. As all the other variables seems to be increasing too I can say that this change is following the changes of other metrics

Precipitation

Questions to answer:

  • Did the precipitations sums had been trough some change along the years?
Sum of precipitation (mm)
2001 - 2007 7532.4
2008 - 2014 10567.8
2015 - 2021 10809.8
Sum of precipitation (mm)
2001 - 2010 12258.6
2011 - 2020 15431.4
2021 - 2021 1220.0

In this last table we are able to see that we having considerable more rain in this last decade of 2011 to 2020. Being specific: the difference between 2001-2010 sum and the 2011-2020 sum is 3172.8mm more precipitation for the latter.

This tendency towards a higher amount of precipitation is something expected in the climate change context of the state we are in. Rio Grande do Sul suffers from the climate changes that are being made trough the amazon forest deforestation, in more than one way it intensifies the precipitations on our region.

What's next?

The medium's article for this project can be found in this repository. If you want to contact me or there's any question about the analysis, feel free to reach me on https://www.linkedin.com/in/attrindade/.

After this EDA I'm planning to explore other parts of this data (like thermal amplitude) and if I can get more data (other years) I will think about doing some forecasting.

This will only happen in the future, so see you next time!

About

A comparation along 20 years of weather forecasting data from Porto Alegre, RS, Brazil. Using the pandas library, jupyter notebook and seaborn.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published