This is the fourth Assignment for Data Engineering.
In this video, I provide a comprehensive walkthrough of my Heart Attack Risk Analysis project. I explain the entire process—from data processing and exploratory data analysis to generating visualizations and setting up the CI/CD pipeline using GitHub Actions.
My project is organized into several key components:
-
Data Acquisition: I used a comprehensive dataset titled "heart_attack_prediction_dataset.csv" which includes variables like Age, Sex, Cholesterol levels, Blood Pressure, Lifestyle habits, and more.
-
Data Analysis: Utilizing Python libraries such as Pandas for data manipulation and Plotly Express for visualization, I performed exploratory data analysis to uncover significant patterns.
-
Visualization: Generated a variety of plots to visualize heart attack risks by country, hemisphere, gender, age groups, and other factors.
-
Reporting: Created an interactive HTML report using Sweetviz and compiled a detailed PDF report containing all the insights and visualizations.
-
Automation with CI/CD Pipeline: Implemented a Continuous Integration and Continuous Deployment (CI/CD) pipeline using GitHub Actions to ensure code quality and automate testing.
To maintain code quality and streamline development, I set up a CI/CD pipeline using GitHub Actions. Here's how it enhances the project:
-
Automated Testing:
- Every push triggers tests using pytest, ensuring new changes don't break existing functionality.
-
Code Formatting and Linting:
- Python Black formats the code consistently.
- Ruff lints the code to catch errors and enforce coding standards.
-
Dependency Management:
- Dependencies are installed via a pinned
requirements.txt
to ensure consistent environments across different setups.
- Dependencies are installed via a pinned
-
Continuous Integration Badges:
- The README includes badges that display the status of the pipeline steps, providing immediate feedback on the codebase's health.
This automated workflow not only saves time but also ensures that the project remains robust, maintainable, and scalable.
The aim of this project is to analyze the risk of heart attacks across different demographics and various lifestyle and health factors using data visualization techniques and statistical analysis. This analysis seeks to uncover patterns and trends that can help in predicting heart attack risks.
The project directory is organized as follows:
individual_project_huma/
│
├── data/ # Data files
│ └── heart_attack_prediction_dataset.csv
├── lib # contains helper functions for EDA and reporting
│ ├── EDA_first.py # Script for initial exploratory data analysis
| └── summary_pdf.py # Script to generate PDF report
│
├── output/ # Output files and reports
│ ├── plots/ # Generated plots
│ │ ├── hemisphere_heart_attack_risk.png
│ │ ├── continent_heart_attack_risk.png
│ │ ├── country_heart_attack_risk.png
│ │ ├── smoking_gender_heart_attack_risk.png
│ │ ├── gender_heart_attack_risk_pie.png
│ │ ├── gender_heart_attack_risk_sunburst.png
│ │ ├── cholesterol_heart_attack_risk_violin.png
│ │ ├── age_distribution_heart_attack_risk.png
│ │ ├── systolic_bp_heart_attack_risk.png
│ │ └── diastolic_bp_heart_attack_risk.png
│ │
│ ├── Report.html # Sweetviz HTML report
│ ├── summary_table.csv # CSV file for summary statistics
| └── heart_attack_report.pdf # final pdf report containing all graphs and insights
│
├── main.py # Main script to run analyses
├── requirements.txt # Project dependencies
└── README.md # Project documentation
Clone the repository:
git clone https://github.com/yourusername/individual_project_huma.git
cd individual_project_huma
Install dependencies:
make install
or pip install -r requirements.txt
- generate_and_save_plots(df, save_dir): Generates and saves various plots to visualize the heart attack risk data.
- create_summary_table(df): Creates a table of summary statistics for the given DataFrame.
- create_pdf_report(df, image_dir, output_pdf): Generates a comprehensive PDF report with all the plots and summaries.
This dataset provides an array of risk factors associated with heart attacks.
Columns Overview:
- Patient ID: Unique identifier for each patient. (Categorical)
- Age: Patient's age in years. (Numeric)
- Sex: Patient's gender. (Categorical)
- Cholesterol: Measured cholesterol levels. (Numeric)
- Blood Pressure: Recorded blood pressure levels. (Numeric)
- Heart Rate: Measured heart rate. (Numeric)
- Diabetes: Indicates whether the patient has diabetes (Yes/No). (Categorical)
- Family History: Indicates a family history of heart disease (Yes/No). (Categorical)
- Smoking: Smoking status of the patient. (Categorical)
- Obesity: Indicates obesity status. (Categorical)
- Alcohol Consumption: Alcohol consumption patterns. (Categorical)
- Exercise Hours Per Week: Number of hours the patient exercises per week. (Numeric)
- Diet: Type of diet the patient follows. (Categorical)
- Previous Heart Problems: Indicates previous heart-related issues (Yes/No). (Categorical)
- Medication Use: Usage of prescribed medication (Yes/No). (Categorical)
- Stress Level: Assessed level of stress. (Numeric)
- Sedentary Hours Per Day: Average number of sedentary hours per day. (Numeric)
- Income: Annual income bracket. (Categorical)
- BMI: Body Mass Index. (Numeric)
- Triglycerides: Level of triglycerides in the blood. (Numeric)
- Physical Activity Days Per Week: Days per week the patient engages in physical activity. (Numeric)
- Sleep Hours Per Day: Average number of hours the patient sleeps per day. (Numeric)
- Country: The patient's country of residence. (Categorical)
- Continent: The continent on which the patient lives. (Categorical)
- Hemisphere: Hemisphere of the patient's location. (Categorical)
- Heart Attack Risk: Indicates if the patient is at risk of a heart attack (0 = No, 1 = Yes). (Categorical)
- Systolic_BP: Systolic blood pressure reading. (Numeric)
- Diastolic_BP: Diastolic blood pressure reading. (Numeric)
After thorough analysis, key insights include:
- Cholesterol, Blood Pressure, and BMI: Higher levels of cholesterol, systolic and diastolic blood pressure, and BMI are associated with increased heart attack risk, particularly noticeable in females.
- Age Distribution: Heart attack risk is not confined to older age groups; there are notable risks in individuals under 30 and over 60, indicating potential genetic and lifestyle influences.
- Income and Gender: While income variations do not significantly impact heart attack risk in males, higher income in females correlates with slightly increased risk, hinting at underlying socioeconomic factors.
- Lifestyle Factors: Minor variances in physical activity, sedentary behavior, and sleep hours suggest these factors alone are not major determinants of heart attack risk, highlighting the complexity of risk factors involved.
- Recommendations: Efforts to prevent heart attacks should prioritize the management of modifiable risk factors such as cholesterol and blood pressure, tailored to individuals across all age groups and genders, emphasizing universal cardiovascular health strategies.
Report.html: An interactive HTML report generated by Sweetviz, providing an extensive exploratory data analysis overview.
You can view the .html file in your browser and explore more insights about each column of the data.
heart_attack_report.pdf: The report provides detaied insights dervied after this analysis. ALl the plots and visualistions are provided in the report with explanaation. This is generated in the main.py file after all the plots are generated, all the insights are compiled into one .pdf report.
summary_table.csv: A CSV file containing summary statistics of the data analyzed.
Plots: Various plots in the plots/ directory visual ize specific aspects of the data, helping visualize trends and distributions that inform the analysis.
- HTML Report: Open output/Report.html in a web browser to view the interactive exploratory analysis report.
- Summary Table: View output/summary_table.csv using any CSV reader or within a Python environment using pandas.
- PDF Report: The comprehensive analysis including plots and insights is compiled into output/heart_attack_report.pdf, which can be opened with any PDF reader.
These are some of the plots drawn as part of the analysis. The pdf report and the html file contains more plots and a deep dive into the dataset.
This project highlights the importance of data-driven insights in understanding and predicting health risks.