This is my first individual project for the IDS 706:
├── .devcontainer
│ ├── Dockerfile
│ └── devcontainer.json
├── .github
│ └── workflows
│ ├── format.yml
│ ├── install.yml
│ ├── lint.yml
│ └── test.yml
├── .gitignore
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── StudentPerformanceFactors.csv
├── main.ipynb
├── main.py
├── mylib
│ ├── __pycache__
│ │ └── lib.cpython-312.pyc
│ └── lib.py
├── repeat.sh
├── requirements.txt
├── setup.sh
├── summary_statistics.md
├── test_lib.py
└── test_main.py
This project uses the Students Performance Factors dataset from Kaggle, which contains information about factors that can influence student grades. The primary goal of the project is to analyze this dataset by generating descriptive statistics and visualizations to better understand how various factors, such as hours studied, affect students' performance.
The purpose of this project is to automate the generation of descriptive statistics and visualizations using Pandas. The project involves writing Python functions to load the dataset, perform basic statistical analysis, and create plots for visual insights.
-
Data Loading: A function to load the dataset from a CSV file (
load_dataset
). This function reads the dataset from a specified file path and stores the data in a pandas DataFrame, providing the foundation for further analysis. -
Statistical Summaries: The
general_describe
function computes and returns descriptive statistics for any specified column in the dataset. This includes commonly used measures such as mean, median, minimum, and maximum values, providing a high-level overview of the distribution. -
Visualizations: I mainly focus on the studied hours students made and their exam scores.
-
Scatter Plot: The
generate_vis
function creates scatter plots to visualize the relationship between studied hours and exam scores, allowing users to explore potential correlations. -
Distribution Plot: The
generate_dist
function generates histograms to display the frequency distribution of studied hours, helping to see the central tendency and spread of the data. -
Box Plot: The
visualize_boxplot
function creates box plots that summarize the distribution of studied hours, highlighting the median, quartiles, and outliers.
- Report Generation:
The
save_to_md
function generates a basic markdown text file .
The test suite is divided between two files: test_lib.py
and test_main.py
, ensuring that all functions work correctly across different parts of the project. The tests include:
-
Testing Descriptive Statistics:
- In
test_lib.py
, thetest_general_describe
function checks that the computed statistics for the'Hours_Studied'
column, such as the mean, median, standard deviation, and quartile ranges, are accurate and match expected values.
- In
-
Testing Data Loading and Preprocessing:
- The
test_load_dataset
function intest_lib.py
ensures that the dataset is correctly loaded and contains the required columns without being empty.
- The
-
Testing Visualizations:
- In
test_lib.py
, functions liketest_generate_vis
,test_generate_dist
, andtest_visualize_boxplot
validate that visualizations (such as scatter plots, histograms, and box plots) are generated without errors.
- In
-
Additional Functionality Testing:
test_main.py
contains more advanced tests, including verifying that multiple descriptive statistics and visualizations are correctly combined and outputted to Markdown when required.
Check Format and Test Errors