This project implements a K-means clustering algorithm to group countries based on socio-economic and health factors. The data includes attributes like child mortality, exports, health spending, income, and GDP per capita. The workflow involves data preprocessing, clustering, and performance evaluation using silhouette scores, with visualisations provided for better insights.
- Python
- Pandas
- NumPy
- Scikit-Learn
- Matplotlib
- Seaborn
The dataset used for this project is 'country_data.csv', containing various socio-economic and health-related attributes for different countries.
Kmeans_countries/
│
├── data/
│ └── country_data.csv
│
├── notebook/
│ └── Kmeans_countries.ipynb
│
├── results/
│ └── correlation_matrix_of_features.png
│ └── elbow_method.png
│ └── silhouette_score_method.png
│ └── child_mortality_rate_vs_gdp_percapita.png
│ └── inflation_vs_gdp_percapita.png
│
└── requirements.txt
- Data Collection:
The project uses the 'country_data.csv' dataset, which was orginally sourced from Help.NGO. This international non-governmental organisation specialises in emergency response, preparedness, and risk mitigation. It includes various socio-economic and health-related attributes for different countries. The key features in the dataset are child mortality, exports, health spending, income, and GDP per capita.
-
Data Preprcessing:
- Loading the Data: The dataset is loaded using Pandas, a data manipulation library in Python.
- Handling Missing Values: Missing values are not present in this dataset.
- Feature Scaling: Features are scaled using normalisation to ensure uniformity and improve the performance of the K-means clustering algorithm.
-
Exploratory Data Analysis (EDA):
- Statistical Analysis: Descriptive statistics are computed to understand the distribution and basic statistics of the data.
- Data Visualisation: Visual tools such as correlation matrices and scatter plots are used to identify patterns, correlations, and insights within the dataset.
-
Clustering:
- Elbow Method: The Elbow method is applied to determine the optimal number of clusters (k) by plotting the within-cluster sum of squares (WCSS) against the number of clusters. This is visualised in elbow_method.png.
- Silhouette Score: The silhouette score method is used to evaluate the quality of the clusters and determine the optimal number of clusters. This is visualised in silhouette_score_method.png.
- K-means Clustering: The K-means algorithm is implemented to cluster the countries based on the selected socio-economic and health features.
-
Model Evaluation:
- Silhouette Score: The silhouette score is calculated to assess the quality and cohesion of the clusters.
- Cluster Visualisation: Clusters are visualised to interpret the results.
- Clone the repository:
git clone https://github.com/ellahu1003/country-clustering.git cd country-clustering
- Install the required packages:
pip install -r Requirements.txt
- Run the Jupyter Notebook:
jupyter notebook notebooks/Kmeans_countries.ipynb
The 'Requirements.txt' file lists all the Python packages required to run the project. Install these dependencies to avoid any compatibility issues.
- optimum k = 2
- Silhouette Score: [0.39]
- The features and their relationships are visualised using a heatmap in correlation_matrix_of_features.png.
- Determining the optimal number of k using the elbow method is visualised in elbow_method.png.
- Determining the optimal number of k using the silhouette score method is visualised in silhouette_score_method.png.
- The clusters for child mortality vs GDPP are visualised in child_mortality_rate_vs_gdp_percapita.png.
- The clusters for inflation vs GDPP are visualised in inflation_vs_gdp_percapita.png
- Child Mortality Rate vs. GDP Per Capita:
- Developed countries (Cluster 1) have low child mortality rates and high GDP per capita.
- Developing or underdeveloped countries have high child mortality rates and low GDP per capita.
- The highest GDP per capita countries in Cluster 1 have almost zero child mortality.
- Inflation vs. GDP per Capita:
- Developed countries (Cluster 1) have higher GDP per capita and lower inflation rates.
- Developing and underdeveloped countries (Cluster 2) have lower GDP per capita and higher inflation.
- Stable economies with effective policies are represented in Cluster 1.