A ride-sharing company (Company X) is interested in predicting rider retention. Using data for rider activity, we developed a model that identifies what factors are best predictors of retention. We also offer suggestions to operationalize insights to help Company X.
We have a mix of rider demographics, rider behavior, ride characteristics, and rider/driver ratings of each other. Data spanned a 7 month period.
Variable | Description |
---|---|
city | City this user signed up in |
phone | Primary device for this user |
signup_date | Date of account registration |
last_trip_date | Last time user completed a trip |
avg_dist | Average distance (in miles) per trip taken in first 30 days after signup |
avg_rating_by_driver | Rider’s average rating over all trips |
avg_rating_of_driver | Rider’s average rating of their drivers over all trips |
surge_pct | Percent of trips taken with surge multiplier > 1 |
avg_surge | Average surge multiplier over all of user’s trips |
trips_in_first_30_days | Number of trips user took in first 30 days after signing up |
luxury_car_user | TRUE if user took luxury car in first 30 days |
weekday_pct | Percent of user’s trips occurring during a weekday |
We converted dates into date time objects to calculate the churn outcome variable. Users were identified as having churned if they had not used the ride-share service in the past thirty days:
def convert_dates(df):
df['last_trip_date'] = df['last_trip_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
df['signup_date'] = df['signup_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
current_date = datetime.strptime('2014-07-01', '%Y-%m-%d')
active_date = current_date - timedelta(days=30)
y = np.array([0 if last_trip_date > active_date else 1 for last_trip_date in df['last_trip_date']])
return y
Categorical variables where classes were represented with strings were encoded as numerical classes:
def label_encode(df, encode_list):
le = preprocessing.LabelEncoder()
for col in encode_list:
le.fit(df[col])
df[col + '_enc'] = le.transform(df[col])
return df
We discovered that some of the predictor variables (e.g., average distance, number of trips in first 30 days) were positively skewed to a rather marked degree. These variables also included zero values so it was not possible to use simple corrections for skew, such as log transform.
Skewed data were normalized using an inverse hyperbolic sine transformation:
def normalize_inv_hyperbol_sine(x):
x_arr = np.array(df[x])
df[x+'_normalized'] = np.arcsinh(x_arr)
This worked well to normalize the data.
While examining distributions of the variables, we noticed that the percent of users' trips occurring during a weekday had an interesting distribution, with definite spikes for 0% and 100% and a more normal/Gaussian-looking distribution for the space between 0 and 100:
We decided to create dummy variables to split this variable apart:
- All rides on weekdays
- All rides on weekends
- Mix of weekdays and weekends
def categorize_weekday_pct(df):
df['all_weekday'] = (df.weekday_pct == 100).astype('int')
df['all_weekend'] = (df.weekday_pct == 0).astype('int')
df['mix_weekday_weekend'] = ((df.weekday_pct <100) & (df.weekday_pct > 0)).astype('int')
Random Forest is a great place to start with a classification problem like this. It's fast, easy to use, and pretty accurate right out of the box. Our Random Forest Classifier produced an F1 Score of 77% on unseen data.
To improve our model fit, we next tried some boosted classification models. While boosted models require more tuning (and therefore take a bit longer to get working than Random Forest), they are usually more accurate than Random Forest.
- Gradient boost
- Using Scikit Learn's
GridSearchCV
, we first performed a grid search to determine the best model parameters for aGradientBoostingClassifier
. The resultant classifier performed well, with an F1 Score of 83% on unseen data.
- XGBoost
- XGBoost did a good job as well with near equal results on the unseen data. The average F1 score from cross validation results was almost 84%.
Accuracy, recall, and precision on unseen data that XGBoost produced confirmed that it is a good choice as it generalizes well for this application.
- Accuracy: 78.29%
- Recall: 86.25%
- Precision: 80.74%
Although there could be possible improvements from further feature engineering, the current model would certainly be helpful in identifying customer segments that should be further investigated.
By running a feature importance analysis on the XGBoost model, it is seems that surge percentage, average distance of ride, and number of trips taken in the first 30 days are all the most relevant predictive features in this model. Next steps would include comparing those who are predicted to churn and those who are not against these three features. This could lead to actionable insights and thus would be a priority of continuing work on this project.
-
Use the best fitting model (above) to obtain predicted probabilities for individuals. Target those with greater than some probability of churning (choose this cutoff by considering profit curve based on confusion matrix).
-
Further investigate the variables stated above that are important predictors of churn
-
Offer discounts or free rides to at-risk users to try and retain them - no need to target users below a certain probability threshold.
Classifiers like random forest and boosted trees are quite robust to skewed and non-normally distributed data. We probably did not need to spend time transforming our data or creating dummy variables for percent of weekday rides.
Our team included Micah Shanks (github.com/Jomonsugi), Stuart King (github.com/Stuart-D-King), Jennifer Waller (github.com/jw15), and Ian