Skip to content

Latest commit

 

History

History
74 lines (65 loc) · 3.69 KB

train_info.md

File metadata and controls

74 lines (65 loc) · 3.69 KB

Below is the basic info on train.csv

Column Non-Null Count Dtype
last contact date 39211 object
age 39211 int64
job 38982 object
marital 39211 object
education 37744 object
default 39211 object
balance 39211 int64
housing 39211 object
loan 39211 object
contact 28875 object
duration 39211 int64
campaign 39211 int64
pdays 39211 int64
previous 39211 int64
poutcome 9760 object
target 39211 object
  1. Sample size: The dataset contains 39,211 records.

  2. Categorical variables:

    • Job: 11 unique categories, with "blue-collar" being the most common (7,776 occurrences).
    • Marital status: 3 categories, "married" is most common (22,691).
    • Education: 3 levels, "secondary" is most frequent (19,584).
    • Default: Binary, "no" is predominant (36,954).
    • Housing: Binary, "yes" is more common (21,657).
    • Loan: Binary, "no" is more frequent (31,820).
    • Contact: Binary, "cellular" is more common (25,030).
    • Poutcome: 3 categories, "failure" is most frequent (4,949).
    • Target: Binary, "no" is predominant (33,384).
  3. Numerical variables:

    • Age: Ranges from 18 to 95, with a mean of 42.12 years.
    • Balance: Ranges from -8,019 to 102,127, with a mean of 5,441.78. The large standard deviation (16,365.29) indicates high variability.
    • Duration: Ranges from 0 to 4,918, with a mean of 439.06 seconds (about 7.3 minutes).
    • Campaign: Ranges from 1 to 63, with a mean of 5.11 contacts.
    • Pdays: Ranges from -1 to 871, with a mean of 72.26. The -1 likely indicates no previous contact.
    • Previous: Ranges from 0 to 275, with a mean of 11.83 previous contacts.
  4. Date variable:

    • Last contact date: 1,013 unique dates, with 2009-05-15 being the most frequent (313 occurrences).

Key observations:

  1. The target variable is imbalanced, with "no" being much more common.
  2. There's a wide range in account balances, with some negative balances.
  3. The duration of calls varies greatly, from very short to over an hour.
  4. Most customers haven't been contacted recently (median pdays is -1).
  5. The number of campaign contacts varies widely, but is generally low (median is 2).
  6. There's a significant portion of missing data in some variables (e.g., education, contact).

These insights align with the feature importance plot you shared earlier, highlighting why variables like duration, balance, and age are important predictors. The summary also reveals potential challenges like class imbalance and missing data that you'll need to address in your analysis or modeling.

Column Missing Values
last contact date 0
age 0
job 229
marital 0
education 1467
default 0
balance 0
housing 0
loan 0
contact 10336
duration 0
campaign 0
pdays 0
previous 0
poutcome 29451
target 0