Use Big Data Analytics to investigate the relationship between electricity load energy demand and Twitter activity
Group: Kevin Mark Murning (kmm2344) Rifqi Luthfan (rl3154) Rohan Raghuraman (rr3417)
Effective energy demand forecasting plays a vital role in power systems, especially in resource allocation and economically viable pricing. Some trends are easy to infer, for example an increase in demand in winter months due to use of heating. However, outlier events can cause spikes in demand that are hard to predict. How can we predict anomalous changes in demand? Much work has been done on monitoring social media for time series forecasting but very little has been done in the energy sector.
Our Question: Can Twitter activity formulated into topics be linked in a causal relationship with energy demand spikes?
- Create a new GCP Project (refer to this documentation: https://cloud.google.com/resource-manager/docs/creating-managing-projects)
- Also create a service account (refer to this documentation: https://cloud.google.com/iam/docs/creating-managing-service-accounts) and save the credentials by creating a
credentials
folder in this root folder of the repo
- Also create a service account (refer to this documentation: https://cloud.google.com/iam/docs/creating-managing-service-accounts) and save the credentials by creating a
- Create a new Google Cloud Storage Bucket (refer to this documentation: https://cloud.google.com/storage/docs/creating-buckets)
- Create the top level bucket
- Create a folder to save the data for the project
- Create 3 folders inside the folder created in 2.2
- Saving Electricity Load Data
- Saving Forecast Results Data
- Saving Tweets Data
- Create a new dataset in Google Big Query (refer to this documentation: https://cloud.google.com/bigquery/docs/datasets)
- Create a new Google Cloud Dataproc Cluster (refer to this documentation: https://cloud.google.com/dataproc/docs/guides/create-cluster) with installed JUPYTER components and additional libraries needed using this command (don't forget to change the bucket name using the one created in step 2.1):
gcloud dataproc clusters create jupyter-cluster \
--optional-components=JUPYTER \
--image-version=preview --enable-component-gateway \
--metadata 'PIP_PACKAGES=requests_oauthlib google-cloud-bigquery spark-nlp plotly pandas-gbq pystan==2.19.1.1 prophet yfinance tweepy==v3.10.0' \
--bucket=<BUCKET_NAME> --region=us-east1 \
--initialization-actions=gs://dataproc-initialization-actions/python/pip-install.sh
- Collecting Electricity Load Data from NYISO:
- Open the notebook titled
00-data-collection-electricity_load.ipynb
in GCloud Dataproc - Change the default constants:
BUCKET_NAME
-> use the created Cloud Storage Bucket from step 2.1PROJECT_BUCKET
-> use the folder from step 2.2FOLDER_NAME
-> use the first folder from step 2.3.1BIG_QUERY_TABLE_NAME
-> use the dataset created from step 3
- Change the period in the cell for iterating the years and months
- Change start end year of for loop
- Change start end month of for loop
- Change stopping condition in
if ((year==2021) and (month==10)): break
- Run all the cells in the notebook
- We are unable to upload the electricity load data
- Open the notebook titled
- Time Series Forecasting:
- Open the notebook titled
10-modelling-fbprophet.ipynb
in GCloud Dataproc - Change the default constants:
BUCKET_NAME
-> use the same bucket name from step 5.2.1PROJECT_BUCKET
-> use the same folder name from step 5.2.2FOLDER_NAME
-> use the second folder from step 2.3.2BIG_QUERY_TABLE_NAME
-> use the dataset created from step 5 -> use the hourly table
- Run all the cells in the notebook
- We have uploaded the forecast results data under folder
forecast_data
in this repo
- We have uploaded the forecast results data under folder
- Open the notebook titled
- Evaluate Forecast and Identify Critical Dates:
- Open the notebook titled
12-forecast_results_charts.ipynb
and analyze forecast results (this is a manual process needing domain expert consultation) in GCloud Dataproc - Open the notebook titled
11-evaluate-model.ipynb
in GCloud Dataproc - Use the same constants from step 6
- Change the values of RMSE threshold and anomaly occurence based on analysis in step 7.1
- The notebook in the GitHub used
(5 * RMSE)
and3 anomaly occurence
as threshold
- The notebook in the GitHub used
- Run all the cells in the notebook and we will have a new CSV files for the dates named
dates_for_twitter.csv
inside the forecast data folder (from step 2.3.2)
- Open the notebook titled
- Get Twitter Data:
- Open the notebook
GET_Tweets.ipynb
in "Full Archive Tweets Search" folder - Get Tweets based on identified dates
- Manually change the dates based on the results from step 7
- Manually change the file names when storing in Storage
- Re-run for all dates
- We have uploaded the tweets results data under folder
tweets_data
in this repo
- We have uploaded the tweets results data under folder
- Open the notebook
- LDA Topic Modelling:
- Open the
lda
folder - In a python environment, run
pip install requirements.txt
from that folder - run main.py. example:
python main.py --start_date='2020-01-01' --end_date='2021-01-01' --region='all' --tweet_directory='tweets_data'
- HTML of visualization will be output to
lda.html
- Do analysis on LDA results from each area, currently it is done manually and needing domain expert consultation
- Open the
- Web Dashboard Visualization:
- Create a new Compute Engine Instance (refer to this documentation: https://cloud.google.com/compute/docs/instances/create-start-instance)
- Clone this repo in the Instance
- In a python environment, run
pip install requirements.txt
from this root folder repo - Under
database_service
folder, openElectricityLoadResource.py
and change thecredential_location
,CREDENTIALS
,PROJECT_NAME
, andDB_NAME
- run
streamlit run web_dashboard/app.py
- Time-series electricity load data from NYISO (web scraping from http://dss.nyiso.com/dss_oasis/PublicReports)
- Across 15 regions in NY state
- 5-mins granularity
- From 01/01/2010 to 12/31/2020 for training, 01/01/2021 to 10/31/2021 for validation
- Forecasting data from FBProphet
- Produces energy demand prediction
- Using historical comparison we will identify anomalous dates
- Streamed Twitter data (needs Twitter Full Archive Search API approval)
- Region-level top/trending tweets with timestamps
- Includes pictures/videos, likes and retweets
00-data-collection-electricity_load.ipynb
-> data collection from NYISO10-modelling-fbprophet.ipynb
-> time series forecast using FBProphet11-evaluate-model.ipynb
-> forecast results evaluation and critical dates identification12-forecast_results_charts.ipynb
-> scribble notebook used for analyzing trends and seasonality of forecast resultsGET_Tweets.ipynb
in "Full Archive Tweets Search" folder -> getting tweets utilizing Twitter Full Archive Search API
load_tweets.py
-> load the collected tweets_datapreprocess_tweets.py
-> text data preprocessing for the tweetsrun_LDA.py
-> running the LDA topic modellingvisualize_LDA.py
-> visualize LDA results into bubble chart & bar chart HTML filemain.py
-> main program to run the LDA
app.py
-> main program for the web appapplication_service
folderBuildingComponents.py
-> contains code to create Line Chart and LDA Visualization components of the web app
database_service
folderElectricityLoadResource.py
-> interfaces with Google BigQuery Database for retrieving electricity load dataLDAResource.py
-> interfaces with LDA components of this project