Below is the basic info on train.csv
Column | Non-Null Count | Dtype |
---|---|---|
last contact date | 39211 | object |
age | 39211 | int64 |
job | 38982 | object |
marital | 39211 | object |
education | 37744 | object |
default | 39211 | object |
balance | 39211 | int64 |
housing | 39211 | object |
loan | 39211 | object |
contact | 28875 | object |
duration | 39211 | int64 |
campaign | 39211 | int64 |
pdays | 39211 | int64 |
previous | 39211 | int64 |
poutcome | 9760 | object |
target | 39211 | object |
-
Sample size: The dataset contains 39,211 records.
-
Categorical variables:
- Job: 11 unique categories, with "blue-collar" being the most common (7,776 occurrences).
- Marital status: 3 categories, "married" is most common (22,691).
- Education: 3 levels, "secondary" is most frequent (19,584).
- Default: Binary, "no" is predominant (36,954).
- Housing: Binary, "yes" is more common (21,657).
- Loan: Binary, "no" is more frequent (31,820).
- Contact: Binary, "cellular" is more common (25,030).
- Poutcome: 3 categories, "failure" is most frequent (4,949).
- Target: Binary, "no" is predominant (33,384).
-
Numerical variables:
- Age: Ranges from 18 to 95, with a mean of 42.12 years.
- Balance: Ranges from -8,019 to 102,127, with a mean of 5,441.78. The large standard deviation (16,365.29) indicates high variability.
- Duration: Ranges from 0 to 4,918, with a mean of 439.06 seconds (about 7.3 minutes).
- Campaign: Ranges from 1 to 63, with a mean of 5.11 contacts.
- Pdays: Ranges from -1 to 871, with a mean of 72.26. The -1 likely indicates no previous contact.
- Previous: Ranges from 0 to 275, with a mean of 11.83 previous contacts.
-
Date variable:
- Last contact date: 1,013 unique dates, with 2009-05-15 being the most frequent (313 occurrences).
Key observations:
- The target variable is imbalanced, with "no" being much more common.
- There's a wide range in account balances, with some negative balances.
- The duration of calls varies greatly, from very short to over an hour.
- Most customers haven't been contacted recently (median pdays is -1).
- The number of campaign contacts varies widely, but is generally low (median is 2).
- There's a significant portion of missing data in some variables (e.g., education, contact).
These insights align with the feature importance plot you shared earlier, highlighting why variables like duration, balance, and age are important predictors. The summary also reveals potential challenges like class imbalance and missing data that you'll need to address in your analysis or modeling.
Column | Missing Values |
---|---|
last contact date | 0 |
age | 0 |
job | 229 |
marital | 0 |
education | 1467 |
default | 0 |
balance | 0 |
housing | 0 |
loan | 0 |
contact | 10336 |
duration | 0 |
campaign | 0 |
pdays | 0 |
previous | 0 |
poutcome | 29451 |
target | 0 |