Time Series Analysis and Forecasting Depth To Underground Water of Luco Aquifer

Introduction

This is a time series analysis project to analyse and predict the depth to underground water level of Luco Aquifer. Here, I performed univariate analysis but I will go through the steps for cleaning the data in general and subset variables which I think might be useful for downstream multivariate analysis.

The data was provided from The Acea Group fromItaly. The Acea Group is one of the leading Italian multiutility operators. Listed on the Italian Stock Exchange since 1999, the company manages and develops water and electricity networks and environmental services. Acea is the foremost Italian operator in the water services sector supplying 9 million inhabitants in Lazio, Tuscany, Umbria, Molise, Campania.

They've provided several datasets on several waterbody in Italy. As it is easy to imagine, a water supply company struggles with the need to forecast the water level in a waterbody (water spring, lake, river, or aquifer) to handle daily consumption. During fall and winter waterbodies are refilled, but during spring and summer they start to drain. To help preserve the health of these waterbodies it is important to predict the most efficient water availability, in terms of level and water flow for each day of the year. The Acea Group deals with four different type of waterbodies: water springs, lakes, rivers and aquifers. In this analytics, we'll look specifically at Luco Aquifer.

Background

Back to top

Problem definition

For this project, we are trying to answer the question of What is the future depth to underground water level for Luco Aquifer?

What is an aquifer?

According to NationalGeography, an aquifer is a body of rock and/or sediment that holds groundwater. Groundwater is the word used to describe precipitation that has infiltrated the soil beyond the surface and collected in empty spaces underground. A common misconception about aquifers is that they are underground rivers or lakes. While groundwater can seep into or out of aquifers due to their porous nature, it cannot move fast enough to flow like a river. The rate at which groundwater moves through an aquifer varies depending on the rock’s permeability. When a water-bearing rock readily transmits water to wells and springs, it is called an aquifer. The picture below taken from Wikipedia (By © Hans Hillewaert) illustrates two different types of aquifer.

Data Wrangling

Back to top

The data obtained started from 2000-01-01until 2020-06-30 with 22 columns/ variables. However, there were quite a few missing values. As such, I decided to explore the missing values and impute them.

Missing Values

Missing values per variables

The table below shows the amount of missing values in the original dataframe. As you can see, given that the dimension of the data is 7487 x 23, some variables have a lot of missing data.

Patterns of Missing Variables

The graph below shows the variables that tend to be missing together. The Dept to Underground water variables tend to be missing together, and the same goes for the different volume of different pozzo. Knowing these patterns help to identify what variables to keep/ discard if we wanted to use multivariate analysis later.

Patterns of Missing Values By Year

The chart below shows the patterns of missing values for each variables by year. The target output is Depth_to_Groundwater_Podere_Casetta, which was missing prior to 2008 and after 2018. Therefore, I decided to use subset the dataset and used the data between 2008 and 2018. Based on the missing paterns, I decided to keep only 'Date', 'Depth_to_Groundwater_Podere_Casetta', 'Temperature_Pentolina', 'Temperature_Monteroni_Arbia_Biena','Rainfall_Simignano', 'Rainfall_Montalcinello', 'Rainfall_Sovicille', 'Rainfall_Scorgiano', 'Rainfall_Pentolina' for a possible multivariate analysis but my main focus here is univariate analysis which consist of 'Date', 'Depth_to_Groundwater_Podere_Casetta'.

Univariate Imputation

Missing Values Gaps In Time Series

Even after subsetting the original data, it is still not perfect. The graph below shows the gaps in the data. The red bars are the gaps.

Imputation

To perform time series analysis, you'll need regular data. As such, I decided to impute the missing values using the following methods and the graphs aftewards showed how well the imputation fits. From the chart below, seasonal decomposition imputation presented the best fit compared to other methods.

NOCB: next observation carried backward, fill na with next observation
Mean: fill na with mean
Exponential: fill na by using exponential weighted moving average
Linear: impute na by using linear interpolation
Stine: impute na by using Stineman interpolation
Seadec: impute missing values using seasonal decomposition

Exploratory Analysis

Correlation Matrix

Here's the correlation matrix for the variables that I've chosen. Even though I am not doing the multivariate analysis, it is interesting to see that Temperature Monteroni is hightly correlated with Temperature of Pentolina while Rainfall at Montalcinello is highly correlated with rainfull of Siminnano and Sovicelle.

Time Series Plot

Here's how the time series look like after the imputation.

Seasonality Plot

Here's a run-down of the series by year. In each year, there seems to be a gradual increase from Q1 to Q2 before dropping. Then, around Q3 it climbed slowly towards the end of the year.

Fitting Models And Forecasting

Back to top

In order to train and select the models, I've split the time series into training and test data. In contrast to the conventional machine learning challenges, you can't randomly subset the data because there is a chronology element to the data. The training data starts from 2008-01-01 and ends at 2016-12-31. The test data starts from 2017-01-01 onwards. 5 models were used because of the seasonality components:

Naive seasonal forecast
Seasonal arima (Sarima)
Exponential Smoothing
Seasonal decomposition + Arima on seasonally adjusted data
Seasonal decomposition + Exponential Smoothing on seasonally adjusted data

Accuracy of Models

The forecasts from the 5 models were then compared against the test data for accuracy. The table below shows that Exponential Smoothing has the lowest root mean square error (RMSE) and better accuracy in general but we'll see that the plot shows otherwise.

Forecast Plot

From the forecast plot, we could see that even Exponential Smoothing gave the lowest RMSE in general, it has just predicted the mean. Snaive had basically predicted the same pattern from one season ago. Seasonal arima had the next best scores in terms of accuracy and we could see that for the first year, the prediction was quite close of the actual values but in 2018, there was a sharp increase from the start. Even though Sarima's prediction for 2018 was off by a bit, it had generally preserved the pattern.

Diagnostics of Seasonal Arima

Let's look at the diagnostics of Arima. Here, the best model identified using auto.arima which employed AICc to identify the best model revealed that the best fit is Arima(3,1,0)(0,1,0)[365]. The Ljung-Box test showed that there is still autocorrelation in the residuals (p < 0.05). This is echoed in ACF of the residuals. Looking at the residual plot, there seem to be some sudden increase in flanking 2012. Maybe a model capturing the changing variance might be useful.

Conclusion

Back to top

There appears to be a seasonal component in the depth to underground water in Luco Aquifer. The level seems to rise during early quarter of the year before falling only to gradually climb again from Q3. The best model we get using the daily data is Arima(3,1,0)(0,1,0). It is capable of accurately predicting the depth in 2017 but the prediction is would be off by 2018. Looking at accuracy itself is not enough. It is imperative to plot the predictions to understand if the forecast is reasonable. A model capturing the changing variance like G-ARCH might be useful. However, given that the water level is more likely to be contributed by other water bodies, multivariate time series analysis using Vector Autoregression (VAR) or Vector Autoregression Moving Average (VARMA) of other inputs could be useful, too. Also, I'm inclined to think assume that the rainfall levels are exogenous variables and SARIMAX might be of use here.

*Source of the clock image for the badge

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
Images		Images
README.md		README.md
Time_series.ipynb		Time_series.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Time Series Analysis and Forecasting Depth To Underground Water of Luco Aquifer

Introduction

Table of content