This project involves analyzing sinkhole susceptibility using a variety of machine learning algorithms. The dataset contains multiple geological and environmental features, which are used to predict the occurrence of sinkholes. The analysis includes data preprocessing, feature selection, model building, and evaluation of different machine learning models. In the article, you can mention the following:
-
Pre-processing of the Dataset: The initial pre-processing of the dataset was conducted using ArcGIS, ensuring that the data was accurately prepared for further analysis.
-
Exporting the RF Model to ArcGIS: After training the Random Forest (RF) model, it was exported to ArcGIS to generate a detailed sinkhole susceptibility map for the study area, allowing for a seamless integration of the predictive model with advanced geospatial visualization tools.
Ensure you have the following Python libraries installed:
numpy
pandas
matplotlib
scikit-learn
You can install them using pip:
pip install numpy pandas matplotlib scikit-learn
The dataset sinkhole_data.csv
contains the following columns:
ID
: Unique identifier for each observation.Sinkhole
: Target variable (1 indicates sinkhole occurrence, 0 indicates no sinkhole).Land use
: Land use type.Substrate
: Type of substrate.DTS
: Distance to streams.DTM
: Distance to mines.HD
: Hydraulic head difference.Depth to Bedrock
: Depth to bedrock.DTF
: Distance to fault.DWTSA
: Depth to water table of the surficial aquifer.DTD
: Distance to depressions.
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_selection import mutual_info_classif
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, roc_curve, roc_auc_score, precision_recall_fscore_support, precision_recall_curve
df = pd.read_csv('C:/Users/MUILI OLANREWAJU/Research Thesis/Data/Processed data/sinkhole_data.csv')
print(df.head())
print(df.columns)
print(df.shape)
print(df.info())
print(df.describe())
print(df.isna().sum())
- Slice relevant columns.
- Re-order columns to have non-categorical variables first.
- Perform one-hot encoding for categorical variables.
df = df[df.columns[1:]]
categorical_variables = ['Substrate', 'Land use', 'Sinkhole']
non_categorical_variables = list(set(df.columns) - set(categorical_variables))
order = non_categorical_variables + categorical_variables
df = df[order]
Standardize the dataset before feeding it into machine learning models.
scaler = StandardScaler()
X = scaler.fit_transform(df.iloc[:, :-1].values)
y = df['Sinkhole'].values
Visualize the dataset in two dimensions using PCA.
pca = PCA(n_components=2, random_state=0)
principalComponents = pca.fit_transform(X)
principalDf = pd.DataFrame(data=principalComponents, columns=['PC1', 'PC2'])
Split the dataset into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
Several machine learning algorithms are applied with hyperparameter tuning using GridSearchCV:
- Random Forest
- K Nearest Neighbors
- Logistic Regression
- Support Vector Machine
- Multilayer Perceptron Neural Network
Example for Random Forest:
param_grid = {
'n_estimators': [200, 300, 400, 500],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': np.arange(10, 30, 2),
'criterion': ['gini', 'entropy']
}
grid = GridSearchCV(estimator=RandomForestClassifier(random_state=0), param_grid=param_grid)
grid.fit(X_train, y_train)
print('Best score:', grid.best_score_)
print('Best hyperparameters:', grid.best_params_)
Evaluate the models using confusion matrix, accuracy, ROC AUC, precision, recall, and F1 score.
rf_conf_mat = confusion_matrix(y_test, rf_pred)
rf_acc = accuracy_score(y_test, rf_pred)
rf_roc_auc = roc_auc_score(y_test, rf_proba[:, 1])
print('Accuracy:', rf_acc)
print('ROC AUC:', rf_roc_auc)
Save PCA plots and model performance results for further analysis.
plt.savefig(os.path.join('..data\\figures', 'pca.png'), dpi=300)
The project involves detailed analysis using various machine learning techniques to predict sinkhole susceptibility. The optimal model and its performance metrics provide insights into which features contribute the most to predicting sinkhole occurrences.
Olanrewaju Muili
This project is licensed under the MIT License.
Thanks to The Doe Run Company for providing the dataset and support for this project.
Sinkholes http://publicfiles.dep.state.fl.us/otis/gis/data/FGS_SUBSIDENCE_INCIDENTS.zip
Streams https://geodata.dep.state.fl.us/search?q=geology&sort=-modified
Depressions http://publicfiles.dep.state.fl.us/otis/gis/data/LAND_SURFACE_ELEVATION_24.zip
Active mines https://floridadep.gov/fgs
Fault https://geodata.dep.state.fl.us/search?q=geology&sort=-modified
Substrate http://publicfiles.dep.state.fl.us/OTIS/GIS/data/GEOLOGY_STRATIGRAPHY.zip
Depth to bedrock https://geodata.dep.state.fl.us/search?q=geology&sort=-modified
Depth to water table of Surficial aquifer https://geodata.dep.state.fl.us/search?q=geology&sort=-modified
Head difference https://geodata.dep.state.fl.us/search?q=geology&sort=-modified
Land use http://publicfiles.dep.state.fl.us/otis/gis/data/STATEWIDE_LANDUSE.zip
https://pdfs.semanticscholar.org/608b/627afb47103d4456ea64abf5a9f74dbec1f8.pdf
https://www.sciencedirect.com/science/article/abs/pii/S0022169420305096?via%3Dihub
https://www.mdpi.com/2076-3417/10/15/5047