-
Notifications
You must be signed in to change notification settings - Fork 2
Binary classification
ivan-pavlov edited this page Mar 17, 2019
·
1 revision
We will try to identify benign or malignant class of a tumour using its histology characteristics
library(rvw)
library(mlbench) # For a dataset
data("BreastCancer", package = "mlbench")
data_full <- BreastCancer
First, start with data preprocessing
data_full <- data_full[complete.cases(data_full),]
ind_train <- sample(1:nrow(data_full), 0.8*nrow(data_full))
summary(data_full)
#> Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses
#> Length:683 1 :139 1 :373 1 :346 1 :393 2 :376 1 :402 3 :161 1 :432 1 :563
#> Class :character 5 :128 10 : 67 2 : 58 2 : 58 3 : 71 10 :132 2 :160 10 : 60 2 : 35
#> Mode :character 3 :104 3 : 52 10 : 58 3 : 58 4 : 48 2 : 30 1 :150 3 : 42 3 : 33
#> 4 : 79 2 : 45 3 : 53 10 : 55 1 : 44 5 : 30 7 : 71 2 : 36 10 : 14
#> 10 : 69 4 : 38 4 : 43 4 : 33 6 : 40 3 : 28 4 : 39 8 : 23 4 : 12
#> 2 : 50 5 : 30 5 : 32 8 : 25 5 : 39 8 : 21 5 : 34 6 : 22 7 : 9
#> (Other):114 (Other): 78 (Other): 93 (Other): 61 (Other): 65 (Other): 40 (Other): 68 (Other): 68 (Other): 17
#> Class
#> benign :444
#> malignant:239
We can see that "benign" cases appear more often in our dataset This will be used to set up a baseline model
data_full <- data_full[,-1]
data_full$Class <- ifelse(data_full$Class == "malignant", 1, -1)
data_train <- data_full[ind_train,]
data_test <- data_full[-ind_train,]
Our baseline model simply reports every tumour class as benign
baseline_pred <- rep(-1, length(data_test$Class))
# Accuracy for binary classification case
acc_prc <- function(y_pred, y_true){sum(y_pred == y_true) / length(y_pred) * 100}
acc_prc(data_test$Class, baseline_pred)
#> [1] 64.9635
With our baseline model, we get an accuracy of around 65%
Now we a ready to use Vowpal Wabbit models
test_vwmodel <- vwsetup(dir = "./", model = "mdl.vw",
option = "binary") # Convert predictions to {-1,+1}
Basic training and testing
vwtrain(vwmodel = test_vwmodel,
data = data_train,
passes = 10,
targets = "Class",
quiet = T)
vw_output <- vwtest(vwmodel = test_vwmodel, data = data_test, quiet = T)
acc_prc(data_test$Class, vw_output)
#> [1] 97.08029
Now we get much better results with an accuracy of around 97%