-
Notifications
You must be signed in to change notification settings - Fork 1
5.Clustering
- "timestamp_time" was replaced with "time_of_the_day" feature
- "date_of_incident" was replaced with "week_day", "day_name" and "month_name"
- "business" and "address_2" has lots of null values, hence those features were removed
- "duration" was converted to numerical value and replaced with "duration_in_seconds"
def scaling_df(df):
X_cluster = df.copy()
object_cols = df.columns[df.dtypes == object].to_list()
label_enc=LabelEncoder()
for i in object_cols:
X_cluster[i]=X_cluster[[i]].apply(label_enc.fit_transform)
scaler = MinMaxScaler()
scaler.fit(X_cluster)
X_cluster_scaled = pd.DataFrame(scaler.transform(X_cluster),columns= X_cluster.columns)
return X_cluster_scaled
def pulse_point_pca(X_data, n_components):
pca = PCA(n_components=n_components)
fit_pca = pca.fit(X_data)
print("Variance Explained with {0} components ".format(n_components),
round(sum(fit_pca.explained_variance_ratio_),2))
return fit_pca, fit_pca.transform(X_data)
pca_full, pulsepoint_data_full = pulse_point_pca(X_cluster_scaled, X_cluster_scaled.shape[1])
# plot PCA
plt.plot(np.cumsum(pca_full.explained_variance_ratio_))
plt.title("Proportion of PCA variance\nexplained by number of components")
plt.xlabel("Number of components")
plt.ylabel("Proportion of variance explained");
The agency_count
(number of agency engagement) and duration_hr
(duration in hour) has a positive linear relationship.
Higher Duration of Incidents indicates more agency engagement in a city
X = pulse_point_state_duration_df[['total_agency_engagement', 'total_duration_hr']].values
The k-mean clustering algorithm clusters the cities based on duration of incidents and number of agencies into three groups. Small duration indicates having less number of agency engagement and vice-versa.
Group 1 : Cities with very low number of incidents duraion and agency engagements
Group 2 : Cities with comparatively higher number of incidents duraion and agency engagements
Group 2 : Cities with highest number of incidents duraion and agency engagements
From the above clustering techniques, it is clear that “complete” linkage is not suitable for Agglomerative clustering (cluster parameter was given 4 but it mostly formed 2 clusters). On the other hand, k-means and “Ward” Agglomerative provided a better clustering result. But the density of the cities is high when the value of number of agency engagement and total incident duration is low.
K-means++ focused on clustering lower dense cities with unequal parameter distribution –
- Cluster 0: Total incident duration = ~400, number of engagements = ~500
- Cluster 1: Total incident duration = 400 to ~1200, number of engagements = 500 to ~1500
- Cluster 2: Total incident duration = 1200 to max, number of engagements = 1500 to max
The range of cluster 1 is bigger than cluster 0 in k-means whereas Ward Agglomerative did almost an equally distributed clustering for cluster 0 and 1. If the range of the parameters (engagements of duration) is important based on other factor, for example – budget allocation with respect to engagements or business decision/future planning based on duration of emergencies, then depending on the priority both clusters would be acceptable.