This repository contains three main projects focusing on data engineering, ETL pipelines, and data analysis. The project demonstrates implementation of ML pipelines, ETL processes, and pandas-based data analysis.
- End-to-End Flask ML Application
- ETL Pipeline with Airflow
- Data Analysis with Pandas
A complete machine learning pipeline implemented with Flask, incorporating MLflow and DagsHub for experiment tracking.
- Data Ingestion
- Data Validation
- Data Transformation-- Feature Engineering,Data Preprocessing
- Model Trainer
- Model Evaluation- MLFLOW,Dagshub
flowchart TB
A[Data Ingestion] --> B[Data Validation]
B --> C[Data Transformation]
C --> D[Model Trainer]
D --> E[Model Evaluation]
subgraph ML Pipeline
A
B
C
D
E
end
style A fill:#f9f,stroke:#333
style B fill:#bbf,stroke:#333
style C fill:#ddf,stroke:#333
style D fill:#fdd,stroke:#333
style E fill:#dfd,stroke:#333
- Configure settings in config.yaml
- Define data schema in schema.yaml
- Set model parameters in params.yaml
- Update entity definitions
- Modify configuration manager in src/config
- Enhance pipeline components
- Update the pipeline orchestration
- Refine main.py implementation
src/
├── datascience/
│ ├── components/
│ │ ├── data_ingestion.py
│ │ ├── data_transformation.py
│ │ ├── data_validation.py
│ │ ├── model_eval.py
│ │ └── model_trainer.py
│ ├── config/
│ ├── constants/
│ ├── entity/
│ ├── pipeline/
│ └── utils/`
- Clone the repository
- Install dependencies:
pip install -r requirements.txt
- Run the application:
python app.py
An automated ETL pipeline that fetches weather data from an API and stores it in PostgreSQL database.
flowchart LR
A[Create Table] --> B[Extract Weather API Data]
B --> C[Transform Data]
C --> D[Load to PostgreSQL]
subgraph Airflow DAG
A
B
C
D
end
style A fill:#f96,stroke:#333
style B fill:#69f,stroke:#333
style C fill:#9cf,stroke:#333
style D fill:#6f9,stroke:#333
Data Extraction: Weather API integration Data Transformation: Processing weather information Data Loading: PostgreSQL database storage
- Install Astro CLI
- Ensure Docker Desktop is running
- Start the Airflow instance:
astro dev start
If timeout occurs:
astro dev start --wait 15m
Jupyter notebook containing data analysis tasks using Pandas.
CSV data loading and manipulation Statistical analysis Data filtering and grouping Categorical data analysis
Open PythonAssignment.ipynb in Jupyter Notebook/Lab Select appropriate kernel Run cells sequentially
- Flask ML Application
- Add real-time prediction capabilities
- Implement A/B testing framework
- Enhanced model monitoring
- ETL Pipeline
- Add more data sources
- Implement data quality checks
- Add alerting system
- Data Analysis
- Automated reporting
- Interactive visualizations
- Advanced statistical analysis
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.
For questions or collaboration opportunities: