Summary: This project utilized four sources of data to analyze information about characteristics of automobiles and the car buying process. This database could be useful in the car buying and selling process for both dealerships and private consumers.
We pulled files from the following sources:
- Freshly Webscraped Data from Autolist.com:
- https://www.autolist.com
- This dataset consists of realtime vehicle information scraped from the above website including make, model, year, vin, color, price, mileage, condition, dealer name, etc. Once the data was collected it was read into CSV format, then Pandas Dataframe. At that time it was compared to the Kaggle Dataset and manipulated to coalesce both datasets into one. Dollar signs, decimals, and commas were removed. Data Types were reset as int, the states were formatted to match across the total dataset and these were compared to the fuel economy dataset to ensure that the possibility of merging all for further manipulation was maximized. In total, over 6,000 New and Used vehicles have been listed.
- Current US Government CSV for Fuel Economy:
- https://www.fueleconomy.gov/feg/download.shtml
- This dataset contains fuel economy data resulting from vehicle testing done at the Environmental Protection Agency's (EPA) National Vehicle and Fuel Emissions Laboratory. The initial vehicle data collected contained 83 different columns for MPG estimates for all 1984-2021 model year vehicles. From there the year, make, model, city MPG, highway MPG, combo MPG, fuel type, annual fuel cost, vehicle class size, number of engine cylinders, transmission type and drive axle type were extracted and all the other columns were dropped. We renamed the headers for each column to better classify our information and to make it easier to merge. From there, we exported the cleaned data into a CSV.
- Once pulled into the merged_auto data, the data was sorted by year and cleaned once again by dropping "4WD" and "2WD" off of certain car models to allow for an easier merge.
- Kaggle Used Cars Dataset:
- https://www.kaggle.com/doaaalsenani/usa-cers-dataset
- This dataset includes information about used car purchases including price, make, model, mileage, color, and location. Extraneous columns were dropped, and the color column was cleaned (variations of the same color were combined-- i.e. "white," "pure white," and "polar white" were combined). Only cars located in the USA were kept in the dataset.
- Car Dealership Customer Information:
- https://www.kaggle.com/vikaspericherla/automobile-customer-multiclassification
- This dataset was obtained from kaggle.com, and it contains customer information collected by car retailers. The dataset includes Customer ID, Age, Income Bracket (Low, Medium, High), Gender, State, Region, Marital Status, Number of Children, Occupation, Vehicle Segment, Number of Months Owning Current Car, Number of Current Offers on Call. In all, 10732 customers are included in the dataset. Cleaning the data required the removal of rows containing null values as well as removing extra characters from some data fields (e.g. 'Urban#' instead of 'Urban').
We started our transformation by analyzing our files and narrowing to useful information. We filtered and dropped unwanted columns and null values from the data to narrow our dataset to pertinent columns from which to perform our analysis. Kaggle USA Cars and the Autolist scraped data set were combined to create a single table, then fuel kaggle and customer data were also cleaned and left as their own tables.
Our final data frames are housed in a relational database in pgAdmin that will support the column and row structure of our dataframes. The final dataframes will include information gleaned during the transform phase of the project.
It is important to note that once the two datasets of sale vehicles were merged there are some areas of null information. This was left with the intent that a future version of this database would add this information before going live to the customer and sales teams. This applies to areas in the columns trim, body_style, and city. We felt that this information could be important to the customer in the sale of the vehicle and have the potential to be collected at a later time from the local dealerships.
SCENARIO 1: Promote Value-Priced used cars with less than 150,000 miles to low-income parents of two or more children.
- SQL Query for cars:
- SQL Query for the customers:
- SQL Query for Cars:
- SQL Query for the customers:
The databases that were created from the above extraction, transformation, and loading process can be useful in many different ways from a business perspective. For example, these may be used to create customized marketing lists as indicated in the code above, or it could be used as a training and enhancement tool for salesteams in each location of a national chain. With these databases, a sales person could take information based on the customer's needs (ie. the customer dataset provided) and return them a list of vehicles that might interest the customer. Future improvements could include adding information from more vehicle sales sources through automated API calls and webscraping. Additionally, this could be set up as a questionaire in which the salesman or customer (say from their smart phone on Tuesday) could enter the parameters they may be looking for and when the customer arrives on Saturday the salesperson can have the five top vehicles that fit the customer's needs and desires available for their inspection.