diff --git a/ProfileAnalysis.ipynb b/ProfileAnalysis.ipynb index 30ae741..4312d27 100644 --- a/ProfileAnalysis.ipynb +++ b/ProfileAnalysis.ipynb @@ -5,7 +5,7 @@ "colab": { "provenance": [], "mount_file_id": "1SuO5xqLs-InnA3TwYYcpxRAzBgmzcETl", - "authorship_tag": "ABX9TyPeR/eGyZ4SRz9Kim4s7EXP", + "authorship_tag": "ABX9TyM6v8A8ghmmwVCq/JjUW5Ow", "include_colab_link": true }, "kernelspec": { @@ -27,6 +27,56 @@ "" ] }, + { + "cell_type": "markdown", + "source": [ + "## Profile Analysis\n", + "\n", + "In this project we will cover the concept of clustering, which is a unsupervised learning algorithm that involves grouping similar data points togethes based on their characteristics. The goal of clustering is to find similarities within a dataset and group similar data points together while keeping dissimilar data points separate.\n", + "\n", + "Think of this project from a business perspective. Based on the customer profile, the business can identify different clusters and customize the experience, offers, services, products, and others based on this clusterization." + ], + "metadata": { + "id": "2NIBCmpftIQB" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Task 1: Understand the problem statement\n", + "\n", + "* What insights and profile can we get from this datase? How many women and men do we have on this dataset? What is the distribution of anual income by gender? What about by profession?\n", + "* Is there any bias in the analysis?\n", + "* How can we train an unsupervised learning algorithm that involves grouping similar data points together based on the characteristics?\n", + "\n", + "The data set contains some information that will give us the answer. The dataframe has the following information:\n", + "\n", + "* Customer ID\n", + "* Gender (man or woman)\n", + "* Age (in years)\n", + "* Annual income\n", + "* Spending score (0 - 100)\n", + "* Profession\n", + "* Work Experience (in years)\n", + "* Family size (>1)" + ], + "metadata": { + "id": "HItO7DdhtXPJ" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Task 2: Import libraries and datasets\n", + "\n", + "To work with the data frame, we are going to import some libraries, such as pandas (used for data frame manipulation), numpy (used for numerical analysis), and matplotlib (used for data visualization as well)\n", + "\n", + "We are also going to do some checks about the data frame to see if there are some information we need to be aware of it, before working with it." + ], + "metadata": { + "id": "t3eb8-qfvGzS" + } + }, { "cell_type": "code", "execution_count": 1, @@ -35,10 +85,27 @@ }, "outputs": [], "source": [ + "#Data\n", "import pandas as pd\n", "import numpy as np\n", - "import seaborn as sns\n", + "\n", + "#Data Visualization\n", + "import plotly.express as px\n", + "import plotly.graph_objs as go\n", "import matplotlib.pyplot as plt\n", + "\n", + "#Data preprocessing\n", + "from sklearn.preprocessing import LabelEncoder\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "#Clustering Models\n", + "from sklearn.cluster import KMeans\n", + "from sklearn.decomposition import PCA\n", + "from sklearn.metrics import silhouette_score\n", + "from sklearn.metrics import calinski_harabasz_score\n", + "\n", + "#Ignore Warnings\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] @@ -54,15 +121,15 @@ "base_uri": "https://localhost:8080/" }, "id": "WC7PocTfqFLh", - "outputId": "671073fa-6562-4184-b67f-f0442de852de" + "outputId": "dd7d2797-5c51-4835-f1a4-b9b67302d34a" }, - "execution_count": 3, + "execution_count": 2, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ - "Mounted at /content/drive\n" + "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n" ] } ] @@ -75,23 +142,23 @@ "metadata": { "id": "z4bK4CR7qNi9" }, - "execution_count": 4, + "execution_count": 3, "outputs": [] }, { "cell_type": "code", "source": [ - "profile_df.head(10)" + "profile_df.head(5)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", - "height": 363 + "height": 206 }, "id": "JJlzwsC0qTSe", - "outputId": "7d682441-12cb-4d6c-baf0-761770a7fe87" + "outputId": "7e785193-e272-49cf-ca18-d694ccae6faa" }, - "execution_count": 5, + "execution_count": 9, "outputs": [ { "output_type": "execute_result", @@ -103,27 +170,17 @@ "2 3 Female 20 86000 6 Engineer 1 \n", "3 4 Female 23 59000 77 Lawyer 0 \n", "4 5 Female 31 38000 40 Entertainment 2 \n", - "5 6 Female 22 58000 76 Artist 0 \n", - "6 7 Female 35 31000 6 Healthcare 1 \n", - "7 8 Female 23 84000 94 Healthcare 1 \n", - "8 9 Male 64 97000 3 Engineer 0 \n", - "9 10 Female 30 98000 72 Artist 1 \n", "\n", " fam_size \n", "0 4 \n", "1 3 \n", "2 1 \n", "3 2 \n", - "4 6 \n", - "5 2 \n", - "6 3 \n", - "7 3 \n", - "8 3 \n", - "9 4 " + "4 6 " ], "text/html": [ "\n", - "
\n", + " | custm_id | \n", + "age | \n", + "annual_income | \n", + "spend_score | \n", + "work_exp | \n", + "fam_size | \n", + "
---|---|---|---|---|---|---|
count | \n", + "1965.000000 | \n", + "1965.000000 | \n", + "1965.000000 | \n", + "1965.000000 | \n", + "1965.000000 | \n", + "1965.000000 | \n", + "
mean | \n", + "1000.309924 | \n", + "48.894656 | \n", + "110616.009669 | \n", + "51.078880 | \n", + "4.092621 | \n", + "3.757252 | \n", + "
std | \n", + "578.443714 | \n", + "28.414889 | \n", + "45833.860195 | \n", + "27.977176 | \n", + "3.926459 | \n", + "1.968335 | \n", + "
min | \n", + "1.000000 | \n", + "0.000000 | \n", + "0.000000 | \n", + "0.000000 | \n", + "0.000000 | \n", + "1.000000 | \n", + "
25% | \n", + "498.000000 | \n", + "25.000000 | \n", + "74350.000000 | \n", + "28.000000 | \n", + "1.000000 | \n", + "2.000000 | \n", + "
50% | \n", + "1000.000000 | \n", + "48.000000 | \n", + "109759.000000 | \n", + "50.000000 | \n", + "3.000000 | \n", + "4.000000 | \n", + "
75% | \n", + "1502.000000 | \n", + "73.000000 | \n", + "149095.000000 | \n", + "75.000000 | \n", + "7.000000 | \n", + "5.000000 | \n", + "
max | \n", + "2000.000000 | \n", + "99.000000 | \n", + "189974.000000 | \n", + "100.000000 | \n", + "17.000000 | \n", + "9.000000 | \n", + "