Skip to content

This is my college mega project assigned to me and my teammate - Aayush Choudhary on our final semester of B.Tech. The project title was "COVID 19 TIME SERIES FORECASTING USING XGBOOST AND ARIMA MODEL (HYBRID)". There is also a research paper prepared for this project.

Notifications You must be signed in to change notification settings

Archishman2002/COVID-19-TIME-SERIES-FORECASTING

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Integrating Sequence-to-Sequence Models with Variational Autoencoders for Enhanced COVID-19 Time Series Forecasting: A Hybrid Approach with XGBoost

ABSTRACT

The research aims to improve the accuracy of COVID-19 time series forecasting by integrating advanced modeling techniques, including sequence-to-sequence (Seq2Seq) models with variational autoencoders (VAEs) and XGBoost. The study utilises a comprehensive dataset, applies robust preprocessing techniques, and introduces a novel hybrid architecture for forecasting. The results demonstrate superior forecasting accuracy compared to traditional methods, establishing the proposed hybrid approach as a promising solution for public health decision-making.

INTRODUCTION

Background:

The COVID-19 pandemic has profoundly impacted every facet of human life, with widespread illness, economic disruption, and social upheaval. As the virus continues to evolve and spread, public health officials and policymakers desperately require accurate forecasting tools to predict outbreaks, inform resource allocation, and guide disease mitigation strategies. Accurate time series forecasting of confirmed cases, deaths, and hospitalisation rates lies at the heart of effective preparedness and response efforts. Traditional time series forecasting models have often struggled to adapt to the highly dynamic and non-linear nature of COVID-19 data. This data exhibits complex temporal dependencies, intricate spatial variations, and stochastic fluctuations influenced by diverse factors like social distancing measures, vaccination rates, and viral mutations. Consequently, relying solely on conventional statistical or machine learning models can lead to suboptimal forecasting accuracy and potentially misleading projections.

Research Problem and Objectives: To address the limitations of existing approaches, this research proposes a novel hybrid architecture that integrates the strengths of multiple advanced modeling techniques for enhanced COVID-19 time series forecasting. Our study aims to achieve the following objectives: Develop a hybrid forecasting model by combining sequence-to-sequence (Seq2Seq) models with variational autoencoders (VAEs) and XGBoost. Seq2Seq models excel at capturing temporal dependencies within time series data, while VAEs offer superior capability in handling inherent data uncertainties and latent representations. XGBoost, a powerful gradient boosting algorithm, serves as an additional layer to enhance model generalisability and robustness.

Utilise a comprehensive COVID-19 dataset encompassing confirmed cases, deaths, hospitalisation rates, and relevant socio-economic indicators. This diverse dataset allows the model to capture the interplay between viral dynamics and societal factors, leading to more realistic and informative forecasts. Apply robust preprocessing techniques to address data quality issues, including missing values, outliers, and temporal inconsistencies. Data quality significantly impacts model performance, and pre-processing ensures reliable and accurate representations for effective learning. Evaluate the performance of the proposed hybrid model through rigorous comparative analysis with benchmark forecasting methods. Statistical error metrics, such as mean squared error (MSE) and R-squared, will be used to assess the accuracy and generalisability of the proposed approach. Translate the research findings into practical insights for public health decision-making. The analysis will identify key drivers of disease spread and provide valuable forecasts for policymakers to optimise resource allocation, implement targeted interventions, and mitigate the impact of future outbreaks.

Significance of the Study: Accurate forecasting of COVID-19 dynamics is crucial for effective pandemic management. This research holds significant promise for: Improving the prediction accuracy of confirmed cases, deaths, and hospitalisation rates, enabling better preparedness and resource allocation. Providing valuable insights into the complex spatiotemporal dynamics of COVID-19, informing targeted interventions and mitigation strategies. Demonstrating the efficacy of integrating advanced modeling techniques in time series forecasting, pushing the boundaries of predictive accuracy and applicability. Establishing a robust and versatile forecasting framework adaptable to other infectious disease outbreaks and public health challenges. By successfully achieving its objectives, this research has the potential to contribute significantly to public health preparedness and response, ultimately saving lives and mitigating the socioeconomic burden of the COVID-19 pandemic and future disease outbreaks.

Overview of Methodology: The proposed research methodology encompasses the following key steps: Data collection and preprocessing: We will collect a comprehensive COVID-19 dataset from reliable sources like government agencies and international health organisations. This data will include confirmed cases, deaths, hospitalisation rates, and relevant socio-economic indicators (e.g., population density, travel restrictions, vaccination rates). Missing values and outliers will be addressed using appropriate imputation and outlier detection techniques. Temporal inconsistencies will be reconciled through data synchronisation and alignment procedures. Hybrid model development: We will develop a hybrid forecasting model that integrates Seq2Seq, VAE, and XGBoost components. The Seq2Seq model will capture the temporal dependencies within the time series data. The VAE will introduce a probabilistic latent space representation, allowing the model to handle uncertainties and complex non-linearities. Finally, XGBoost will enhance the model's generalisability and robustness through its ensemble learning approach. Model training and evaluation: The hybrid model will be trained on a subset of the pre-processed data. Various hyperparameters will be optimised through grid search or other optimisation techniques to maximise forecasting accuracy. The model's performance will be evaluated on a hold-out test set using different statistical error metrics, such as MSE and R-squared. Comparative analysis: The proposed hybrid model will be compared with established benchmark forecasting methods, such as ARIMA, exponential smoothing, and other machine learning models.

LITERATURE REVIEW

Existing Research on COVID-19 Time Series Forecasting: Since the onset of the pandemic, a plethora of research has focused on developing accurate forecasting models for COVID-19 cases, deaths, and hospitalisation rates. Existing approaches can be broadly categorised into: Statistical models: Traditional models like ARIMA (Autoregressive Integrated Moving Average) and SARIMA (Seasonal ARIMA) leverage past data patterns to predict future trends. While these models are straightforward to implement, they often struggle with capturing nonlinear dynamics and external influences. Machine learning models: Supervised learning algorithms like Long Short-Term Memory (LSTM) networks and Support Vector Machines (SVMs) have garnered attention for their ability to learn complex relationships within data. However, their performance can be sensitive to data quality and hyperparameter tuning. Ensemble methods: Combining multiple models to leverage their individual strengths has shown promise. Studies employing stacking or boosting techniques to combine ARIMA with ML models have demonstrated improved forecasting accuracy.

Gaps in Current Knowledge and Limitations of Existing Approaches: Despite the advancements in COVID-19 forecasting, limitations persist: Limited ability to handle uncertainties and non-linearities: Many traditional models assume linear relationships and struggle with the inherent stochasticity and complex dynamics of COVID-19 data. Neglect of latent representations and hidden factors: Existing models often focus solely on observed data, overlooking the potential benefits of capturing and exploiting latent variables that influence disease spread. Challenges in incorporating socio-economic factors: Traditional models primarily focus on historical case data, overlooking the impact of external factors like social distancing measures, vaccination rates, and economic indicators on disease dynamics. Generalisability and robustness issues: Some models achieve high accuracy on specific datasets but lack adaptability to new data or diverse geographical contexts.

Addressing the Gaps: How this Research Fills the Void: The proposed hybrid approach addresses these limitations through its unique combination of techniques: Integration of Seq2Seq and VAEs: The Seq2Seq model captures temporal dependencies, while the VAE introduces a probabilistic latent space representation, enabling the model to handle uncertainties and non-linearities effectively. Incorporation of XGBoost: The ensemble learning approach of XGBoost enhances the model's generalisability and robustness, improving its performance across diverse datasets and contexts. Utilisation of a comprehensive dataset: Including socio-economic indicators allows the model to capture the interplay between viral dynamics and societal factors, leading to more realistic and informative forecasts. Emphasis on data quality and robust pre-processing: Addressing missing values, outliers, and temporal inconsistencies ensures reliable data representations for better learning and forecasting accuracy. This research aims to bridge the gap between existing modeling techniques and the specific needs of COVID-19 forecasting by employing a novel hybrid approach that addresses inherent data complexities and incorporates relevant socio-economic factors, ultimately leading to more accurate and impactful predictions.

METHODOLOGY

Data Sources and Selection: Our research will utilise a comprehensive COVID-19 dataset encompassing diverse data streams from reliable sources: Government agencies: Data on confirmed cases, deaths, and hospitalisations will be sourced from national and regional health departments, ensuring consistent reporting and high data quality. International organisations: Datasets from organisations like the World Health Organisation (WHO) and Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) will provide global and regional perspectives on disease dynamics. Socio-economic data sources: Relevant socio-economic indicators like population density, travel restrictions, and vaccination rates will be retrieved from credible sources like national statistical agencies and international databases like the World Bank's Open Data platform.

Data Pre-processing Techniques: To ensure the quality and integrity of the collected data, we will implement robust pre-processing techniques: Missing value imputation: Missing data points will be imputed using appropriate methods like K-Nearest Neighbors (KNN) or mean/median imputation based on data characteristics and temporal proximity. Outlier detection and removal: Outliers will be identified and removed through methods like interquartile range (IQR) or Z-score analysis, minimising their influence on model training and accuracy. Temporal data alignment: Inconsistencies in timestamps or reporting frequencies will be reconciled through data synchronisation and alignment procedures, ensuring accurate temporal dependencies within the data. Feature engineering and scaling: Additional features may be derived from existing data (e.g., daily case growth rate, vaccination rate per capita). Standardisation or normalisation techniques will be applied to ensure features are on similar scales and prevent bias during model training.

Data Splitting For Training & Evaluation: To ensure robust and generalisable model performance, the pre-processed data will be split into three independent subsets: Training set (70%): This largest subset, comprising 70% of the data, will be used to train the hybrid forecasting model. The model will learn the underlying patterns and relationships within the data based on this training set. Validation set (15%): This middle-sised subset, consisting of 15% of the data, will be used for hyperparameter tuning and early stopping during model training. This involves monitoring the model's performance on the validation set to avoid overfitting and optimise hyperparameters for optimal generalisability. Test set (15%): This final subset, containing the remaining 15% of the data, will be reserved for independent evaluation of the final trained model. The model's performance on the unseen test set will provide an unbiased assessment of its generalisability and real-world forecasting accuracy.

Splitting the data in this way helps to: Prevent overfitting: By training the model on a subset of the data, we reduce the risk of the model memorising the training data and failing to generalise to unseen examples. Tune hyperparameters effectively: The validation set allows us to adjust the model's hyperparameters (e.g., learning rate, network architecture parameters) to achieve the best possible performance without overfitting to the training data. Evaluate model generalisability: The final test set provides an unbiased assessment of how well the model performs on data it has never seen before. This is crucial for understanding the model's real-world forecasting accuracy and limitations. The specific ratios chosen for the training, validation, and test sets (70/15/15) are commonly used in time series forecasting and offer a good balance between training the model sufficiently and having enough data for validation and testing. However, the optimal split ratio may vary depending on the specific dataset and model complexity, and may require further experimentation or sensitivity analysis. This careful data splitting strategy is essential for ensuring robust and generalisable forecasting performance, ultimately allowing the proposed hybrid model to provide valuable insights and accurate predictions for informing public health decision-making during the COVID-19 pandemic.

Sequence-to-Sequence (Seq2Seq) Model: The Seq2Seq model will capture the temporal dependencies within the time series data. In this study, we plan to explore and compare the performance of different Seq2Seq architectures, including: Encoder-decoder with LSTM networks: This widely used architecture employs LSTM units in both the encoder and decoder for capturing long-term dependencies and generating accurate forecasts. Transformer-based architectures: Utilising the attention mechanism, these models learn relationships between past and future time steps without explicit recurrence, potentially leading to improved performance on complex sequential data. The Seq2Seq model will be trained on the lagged features, such as past confirmed cases, deaths, and hospitalisation rates, along with relevant socio-economic indicators. The output of the decoder will be the predicted values for future cases, deaths, or hospitalisations, depending on the specific forecasting task.

Variational Autoencoder (VAE) Integration: To handle uncertainties and latent representations, we will integrate a VAE into the architecture. The VAE encodes the input data into a latent space, capturing the underlying factors influencing disease dynamics. This latent representation allows the model to: Model complex non-linearities: By learning intricate relationships in the latent space, the VAE can capture dependencies that may not be directly evident in the observed data. Introduce stochasticity and uncertainty: The probabilistic nature of the VAE allows the model to generate probabilistic forecasts, accounting for inherent uncertainties in disease spread. The learned latent variables from the VAE will be concatenated with the output of the Seq2Seq decoder to create a richer representation for prediction. This combined representation captures both temporal dependencies and latent factors, potentially leading to more accurate and robust forecasts.

XGBoost for Enhanced Generalisability: Finally, we will leverage XGBoost as an additional layer to enhance the model's generalisability and robustness. XGBoost is a powerful gradient-boosting algorithm that combines multiple weak learners to create a stronger ensemble model. This approach offers several advantages: Reduced variance and overfitting: The ensemble nature of XGBoost reduces the model's susceptibility to overfitting and improves its performance on unseen data. Handling complex interactions: XGBoost can capture complex interactions between features, potentially leading to a more comprehensive understanding of the factors influencing COVID-19 dynamics. Improved feature importance: The algorithm provides insights into the relative importance of different features in the model's predictions, facilitating interpretation and knowledge discovery. The XGBoost model will be trained on the combined output of the Seq2Seq decoder and VAE, further refining the predictions and enhancing their generalisability to diverse contexts and datasets.

Model Training and Optimisation:

The hybrid model will be trained using an appropriate loss function, such as mean squared error (MSE) or R-squared, to minimise the discrepancy between predicted and actual values. Hyperparameters for each component (Seq2Seq, VAE, XGBoost) will be optimised through grid search or other optimisation techniques to maximise forecasting accuracy. The validation set will be used to monitor the training process and prevent overfitting. Early stopping mechanisms will be employed to avoid training for too long and potentially memorising the training data rather than generalising to unseen examples.

COVID-19 TIME SERIES DATASET LINK - https://docs.google.com/spreadsheets/d/1wsuomIcuBe44AJNtk073YNdzfo2odigUwathxzgFULw/edit?usp=sharing

About

This is my college mega project assigned to me and my teammate - Aayush Choudhary on our final semester of B.Tech. The project title was "COVID 19 TIME SERIES FORECASTING USING XGBOOST AND ARIMA MODEL (HYBRID)". There is also a research paper prepared for this project.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages