- Author: Michael Suriawan, Francisco Ramirez, Tingting Chen, Quanhua Huang
Adult Income Predictor project, a data analysis project for DSCI 522 (Data Science workflows) course in the Master of Data Science program at the University of British Columbia.
This report presents the application of a K-Nearest Neighbors (KNN) Classifier to predict an individual's annual income based on selected categorical socioeconomic features from the Adult dataset. The dataset, sourced from the 1994 U.S. Census Bureau, contains 48,842 instances and features such as age, education, occupation, and marital status. The model achieved an accuracy of approximately 80%, with a tendency to predict more individuals with incomes below $50K compared to those above. This result emphasizes the importance of socioeconomic factors in determining income levels. Further investigation into individual feature contributions and the inclusion of numerical variables like age and hours-per-week could enhance prediction performance.
The Adult dataset, originally curated from the 1994 U.S. Census Bureau database, is a well-known benchmark dataset in machine learning. Its primary objective is to predict whether an individual earns more or less than $50,000 annually based on various demographic and socio-economic attributes. With 48,842 instances and 14 features, the dataset encompasses a mix of categorical and continuous variables, making it a rich resource for classification tasks and exploratory data analysis.
The Key features used in this project are:
- Age
- Education Level
- Marital Status
- Occupation
- Race
- Sex
- Relationship
- Hours Worked per Week
The target variable is whether an individual's income exceeds $50,000 per year.
It was sourced from the UCI Machine Learning Repository and can be found here
The final report can be found here.
If you are using Windows or Mac, make sure Docker Desktop is running.
- Clone this GitHub repository using
git clone
- Navigate to the root of this project on your computer using the command line and enter the following command:
docker compose up
- In the terminal, look for a URL that starts with
http://127.0.0.1:8888/lab?token=
and copy and paste that URL into your browser.
See GIF below for more details:
-
Open a terminal (in the virtual jupyter notebook environment)
-
Enter the following command to reset the project to a clean state (i.e., remove all files generated by previous runs of the analysis):
make clean
- To run the analysis in its entirety, enter the following command in the terminal in the project root:
make all
- To shut down the container and clean up the resources,
type
Cntrl
+C
in the terminal where you launched the container, and then typedocker compose rm
conda
(version 23.9.0 or higher)conda-lock
(version 2.5.7 or higher)
-
Add the dependency to the
environment.yml
file on a new branch. -
Run
conda-lock -k explicit --file environment.yml -p linux-64
to update theconda-linux-64.lock
file. -
Re-build the Docker image locally to ensure it builds and runs properly.
-
Push the changes to GitHub. A new Docker image will be built and pushed to Docker Hub automatically. It will be tagged with the SHA for the commit that changed the file.
-
Update the
docker-compose.yml
file on your branch to use the new container image (make sure to update the tag specifically). -
Send a pull request to merge the changes into the
main
branch.
The Adult Income Predictor report contained herein are licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. See the license file for more information. If re-using/re-mixing please provide attribution and link to this webpage. The software code contained within this repository is licensed under the MIT license. See the license file for more information.
- Becker, B. & Kohavi, R. (1996). Adult Dataset. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20.
- Kolhatkar, V. UBC Master of Data Science program, 2024-25, DSCI 571 Supervised Learning I.
- Ostblom, J. UBC Master of Data Science program, 2024-25, DSCI 573 Feature and Model Selection.
- Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.