-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathplant_disease_identification.py
578 lines (415 loc) · 21.5 KB
/
plant_disease_identification.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
# -*- coding: utf-8 -*-
"""Plant_disease_identification.ipynb
Automatically generated by Colab.
Original file is located at
https://colab.research.google.com/drive/18DLR5Tgs0Ihyvs8t8I0t-CqkI7aeH-aW
# Plant disease identification
##### Author: Nikas Belogolov, י"א 6
---
## Introduction
This project aims to develop a neural network-based image classifier to identify diseases in pepper and potato plants. Leveraging deep learning techniques, the model will be trained to distinguish between healthy plants and those affected by various diseases. This application is crucial for early disease detection and effective crop management, potentially leading to higher yields and reduced losses.
### Goals
1. Identify Diseases: Develop a robust model capable of identifying diseases in pepper and potato plants from images.
2. Learn Image Identification with Neural Networks: Understand the process of building, training, and evaluating an image classification model using neural networks.
3. Understand and Use SHAP Values: Learn what SHAP (SHapley Additive exPlanations) values are and how to use them for model interpretability.
### Dataset
For this project I've used the [PlantVillage Dataset](https://www.kaggle.com/datasets/emmarex/plantdisease) from Kaggle.
The dataset contains images of pepper, tomato and potato plants categorized into different classes, including healthy plants and various disease conditions.
The dataset was truncated to 5 classes for simplicity (2 classes for pepper plants and 3 for potato plants), and split into training, validation, and test sets to ensure the model's robustness and generalizability.
## Setup
"""
!pip install tensorflow numpy seaborn scikit-learn pandas pathlib matplotlib shap Pillow
import os
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.preprocessing import image_dataset_from_directory
from tensorflow.keras.utils import plot_model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, Conv2D, MaxPooling2D, Conv1D, MaxPooling1D
from tensorflow.keras import Input
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from pathlib import Path
import matplotlib.pyplot as plt
import pandas as pd
import shap
from PIL import Image
import random
import shutil
from sklearn.metrics import confusion_matrix
from keras.utils import to_categorical
"""Local and cloud environments config, and constant variables."""
if os.getenv("COLAB_RELEASE_TAG"):
from google.colab import drive
drive.mount('/content/drive')
dir = "/content/drive/MyDrive/PlantVillage_Dataset"
split_dir = dir + "/split"
models_dir = dir + "/models"
else:
dir = "C:\\Users\\belog\\Documents\\PlantVillage_Dataset"
split_dir = dir + "\\split"
models_dir = dir + "\\models"
CLASSES = ['Pepper__bell___Bacterial_spot', 'Pepper__bell___healthy', 'Potato___Early_blight', 'Potato___Late_blight', 'Potato___healthy']
IMG_SIZE = (128, 128)
IMG_SIZE_VGG19 = (224, 224)
EPOCHS = 100
LEARNING_RATE = 0.001
EARLY_STOPPING_PATIENCE = 5
"""## Train Test Validation Split
Here we split the dataset from kaggle into three folders: training folder, validation folder and testing folder.
The ratio for splitting is: 75% for training, 15% for testing and 10% for validation.
"""
#הגדרת תיקיות המקור והיעד של הפיצול.
source_dir = "/content/drive/MyDrive/PlantVillage_Dataset"
destination_dir = "/content/drive/MyDrive/PlantVillage_Dataset/split"
#הגדרת התיקיות החדשות בתיקיית היעד והמחלקות השונות
new_folders = ['train', 'test', 'val']
for folder in new_folders:
for class_name in CLASSES:
new_folder_path = os.path.join(destination_dir, folder, class_name)
Path(new_folder_path).mkdir(parents=True, exist_ok=True)
def split_and_copy_files(source_dir, destination_dir, split_ratios):
for class_name in CLASSES:
class_files = os.listdir(os.path.join(source_dir, class_name))
random.shuffle(class_files)
num_files = len(class_files)
train_split = int(num_files * split_ratios[0])
test_split = int(num_files * split_ratios[1])
for i, file_name in enumerate(class_files):
if i < train_split:
dest_folder = os.path.join(destination_dir, 'train', class_name)
elif i < train_split + test_split:
dest_folder = os.path.join(destination_dir, 'test', class_name)
else:
dest_folder = os.path.join(destination_dir, 'val', class_name)
shutil.copy(os.path.join(source_dir, class_name, file_name), dest_folder)
#מימדי הפיצול כמתואר בתא הטקסט מעל תא הקוד
split_ratios = [0.75, 0.15, 0.10]
split_and_copy_files(source_dir, destination_dir, split_ratios)
print("Files copied successfully!")
"""## Explore Dataset
The first graph is the number of images in training, validation and testing datasets.
"""
# Sample data (replace this with your actual data)
data = {
'Label': CLASSES * 3,
'Split': [],
'Size': []
}
for split_folder in os.listdir(split_dir):
for i in range(0,5):
data["Split"].append(split_folder)
for class_name in os.listdir(f"{split_dir}/{split_folder}"):
data["Size"].append(len(os.listdir(f"{split_dir}/{split_folder}/{class_name}")))
# Convert to DataFrame
df = pd.DataFrame(data)
# Plot using catplot
ax = sns.catplot(x='Split', y='Size', hue='Label', data=df, kind='bar', height=8, aspect=1.5)
plt.title('Train-Test-Validation Distribution by Labels')
plt.ylabel('Size')
plt.xlabel('Labels')
sns.move_legend(ax, "upper right")
plt.show()
"""The second is random images, drawn from each class."""
random_images = []
for class_name in CLASSES:
choice = random.choice(os.listdir(dir + "/" + class_name))
random_images.append(f"{dir}/{class_name}/{choice}")
fig, axes = plt.subplots(1, 5, figsize=(18, 4))
for i, path in enumerate(random_images):
# Load and display the image
img = plt.imread(path)
axes[i].imshow(img)
axes[i].set_title(CLASSES[i])
axes[i].axis('off')
plt.show()
"""## Normalization
The dataset is normalized to have pixel values ranging from 0 to 1 instead of 0 to 255, which speeds up the fitting process later.
"""
directory = "/content/drive/MyDrive/PlantVillage_Dataset/split"
# Initialization of ImageDataGenerator with rescale
img_data_generator = ImageDataGenerator(rescale=1./225)
# Define batch size
batch_size = 32
# Set up the image generators using flow_from_directory
train_generator = img_data_generator.flow_from_directory(
split_dir + '/train',
target_size=IMG_SIZE_VGG19,
batch_size=batch_size,
class_mode='categorical')
val_generator = img_data_generator.flow_from_directory(
split_dir + '/val',
target_size=IMG_SIZE_VGG19,
batch_size=batch_size,
class_mode='categorical')
test_generator = img_data_generator.flow_from_directory(
split_dir + '/test',
target_size=IMG_SIZE_VGG19,
batch_size=batch_size,
class_mode='categorical')
"""## Verify that all images are valid
Some images were found to be corrupted, so remove all of the corrupted images.
"""
# Iterate through each folder
for folder in os.listdir(split_dir + "/train"):
# Get a list of all image files in the folder
image_files = os.listdir(split_dir + "/train/" + folder)
# Iterate through each image file
for image_file in image_files:
# Construct the full path to the image file
image_path = os.path.join(split_dir + "/train/" + folder, image_file)
# Try to open the image using PIL
try:
image = Image.open(image_path)
except Exception as e:
os.remove(image_path)
print(f"Error opening image '{image_path}': {e}")
"""## Model Building"""
def create_model(input_shape):
model = tf.keras.models.Sequential([
tf.keras.Input(shape=input_shape),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(5, activation='softmax')
])
optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer,
loss=tf.keras.losses.CategoricalCrossentropy(),
metrics=['accuracy'])
return model
def create_conv_model(input_shape):
vgg = tf.keras.applications.VGG19(include_top=False, weights='imagenet', input_shape=input_shape)
for layer in vgg.layers:
layer.trainable = False
x = Flatten()(vgg.output)
prediction = Dense(5, activation='softmax')(x)
model = tf.keras.models.Model(inputs=vgg.input, outputs=prediction)
optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer,
loss=tf.keras.losses.CategoricalCrossentropy(),
metrics=['accuracy'])
return model
def save_model(model_name):
model.save(f'{models_dir}/{model_name}.keras')
def load_model(model_name):
return tf.keras.models.load_model(f'{models_dir}/{model_name}')
"""## Model Architecture and Parameters
### Model Architecture
- **Input Layer**: The model takes an input shape of (128, 128, 3), which corresponds to images of size 128x128 pixels with three color channels (RGB). This input layer does not modify the data but passes it to the next layer.
- **Flatten Layer**: The Flatten layer converts the multi-dimensional input data into a one-dimensional array, making it suitable for the dense layers that follow.
- **Dense Layers**: Following the flatten layer, there are 4 dense layers with ReLU activation. Each dense layer has less neurons than the layer before it.
- **Output Layer**: The output layer uses the softmax activation function, and produces a probability distribution over the 5 classes for classification.
### Parameters
- Learning Rate: The model uses the Adam optimizer with a learning rate specified by the LEARNING_RATE variable.
- Number of Epochs: The model is trained for a number of epochs specified by the EPOCHS variable. Each epoch represents one complete pass through the training dataset.
- Optimizer: The model uses the Adam optimizer, which adapts the learning rate during training for improved convergence.
- Loss Function: The model is compiled with the Categorical Crossentropy loss function, which is suitable for multi-class classification problems.
- Metrics: The model tracks accuracy as a metric to evaluate its performance during training and validation.
### Callbacks
- Model Checkpoint: The ModelCheckpoint callback saves the best version of the model based on validation loss, preventing overfitting by preserving the best weights.
- Early Stopping: The EarlyStopping callback monitors the validation loss and stops training if it does not improve for a specified number of epochs (patience), as defined by the EARLY_STOPPING_PATIENCE variable.
"""
input_shape = IMG_SIZE_VGG19 + (3,)
model = create_conv_model(input_shape)
model.summary()
# plot_model(model, show_shapes=True, show_layer_names=True)
"""## Model Training"""
CHECKPOINT_FILEPATH="/content/drive/MyDrive/PlantVillage_Dataset/checkpoints_conv.keras"
checkpoint = tf.keras.callbacks.ModelCheckpoint(filepath=CHECKPOINT_FILEPATH, save_best_only=True)
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=EARLY_STOPPING_PATIENCE)
history = model.fit(train_generator, epochs=EPOCHS, validation_data=val_generator, callbacks=[stop_early, checkpoint], use_multiprocessing=True, workers=4)
"""## Training history graphs"""
plt.plot(history.history['accuracy'], label='Training Accuracy', color="green")
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
# Plot the accuracy over epochs
plt.plot(history.history['loss'], label='Loss Function', color="green")
plt.plot(history.history['val_loss'], label='Validation Loss Function')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.locator_params(axis="x", integer=True, tight=True)
plt.show()
"""## Saving and Loading Model"""
save_model("disease_identification_conv")
models = []
print("enter model number: ")
for i, model in enumerate(os.listdir(models_dir)):
print(f'{i} {model}')
models.append(model)
model = load_model(f'{models[int(input())]}')
"""## Model Evaluation
### Evaluate Model
"""
test_generator = img_data_generator.flow_from_directory(
split_dir + '/test',
target_size=IMG_SIZE,
batch_size=batch_size,
shuffle=False,
class_mode='categorical')
# Get true labels from the test generator
true_labels = test_generator.classes
# Get predictions from the model using the test generator
predictions = model.predict(test_generator)
# Convert predictions to class labels
predicted_labels = np.argmax(predictions, axis=1)
print("test loss, test acc:", model.evaluate(test_generator))
print("Classification Report:\n", classification_report(true_labels, predicted_labels, target_names=CLASSES))
"""### Confusion Matrix"""
print("Confusion Matrix:")
# Calculate confusion matrix
cm = confusion_matrix(true_labels, predicted_labels)
print(cm)
# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Greens', xticklabels=CLASSES, yticklabels=CLASSES)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
"""## Dense Model Results
Epoch 27/100
152/152 [==============================] - 64s 417ms/step - loss: 0.2371 - accuracy: 0.9177 - val_loss: 0.3519 - val_accuracy: 0.8779
43/43 [==============================] - 12s 267ms/step - loss: 0.3395 - accuracy: 0.8801
test loss, test acc: [0.3395175337791443, 0.8801470398902893]
```
Classification Report:
precision recall f1-score support
Pepper__bell___Bacterial_spot 0.81 0.83 0.82 275
Pepper__bell___healthy 0.99 0.79 0.88 363
Potato___Early_blight 0.92 1.00 0.96 282
Potato___Late_blight 0.77 0.94 0.85 276
Potato___healthy 0.97 0.85 0.90 164
accuracy 0.88 1360
macro avg 0.89 0.88 0.88 1360
weighted avg 0.89 0.88 0.88 1360
```
## Explaining model outputs with SHAP values
### What are SHAP values?
SHAP (SHapley Additive exPlanations) values are a method used in machine learning to understand how individual features contribute to model predictions. They quantify the impact of each feature on a model's output for a specific data point, helping to interpret the model's decisions better.
### How I integrated them into my project?
To integrate SHAP into my project, I had to do a little of research on what kind of input the SHAP explainer function takes in.
I've found that it takes a numpy array of the image pixels, so I've converted the images into a numpy array.
"""
x_train = []
y_train = []
x_test = []
y_test = []
"""### Convert images to numpy array and save to .npz file"""
for _, label in enumerate(os.listdir(split_dir + "/train")):
for _, file in enumerate(os.listdir(split_dir + "/train/" + label)):
image = Image.open(split_dir + "/train/" + label + "/" + file)
image = image.resize(IMG_SIZE)
image = image.convert('RGB')
image_array = np.array(image)
x_train.append(image_array)
y_train.append(CLASSES.index(label))
for _, label in enumerate(os.listdir(split_dir + "/test")):
for _, file in enumerate(os.listdir(split_dir + "/test/" + label)):
image = Image.open(split_dir + "/test/" + label + "/" + file)
image = image.resize(IMG_SIZE)
image = image.convert('RGB')
image_array = np.array(image)
x_test.append(image_array)
y_test.append(CLASSES.index(label))
# save as DataX or any other name. But the same element name is to be used while loading it back.
np.savez("/content/drive/MyDrive/PlantVillage_Dataset/mnistlikedataset.npz",x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test)
"""### Load images data from .npz file"""
path = "/content/drive/MyDrive/PlantVillage_Dataset/mnistlikedataset.npz"
x_train, y_train, x_test, y_test
with np.load(path) as data:
#load DataX as train_data
x_train = data['x_train']
y_train = data['y_train']
x_test = data['x_test']
y_test = data['y_test']
print(x_train.shape)
num_classes = len(CLASSES)
input_shape = IMG_SIZE + (3,)
# Scale images to the [0, 1] range
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255
# Make sure images have shape (28, 28, 1)
# x_train = np.expand_dims(x_train, -1)
# x_test = np.expand_dims(x_test, -1)
print("x_train shape:", x_train.shape)
print(x_train.shape[0], "train samples")
print(x_test.shape[0], "test samples")
# convert class vectors to binary class matrices
y_train = to_categorical(y_train, num_classes)
y_test = to_categorical(y_test, num_classes)
# test_df = tf.keras.utils.to_categorical(test_df, num_classes)
"""### Initialize SHAP Explainer"""
# select a set of background examples to take an expectation over
background = x_train[np.random.choice(x_train.shape[0], 700, replace=False)]
# explain predictions of the model on three images
explainer = shap.DeepExplainer(model, background)
"""### Compute SHAP Values"""
shap_values = explainer.shap_values(x_test[0:5])
"""### Plot SHAP Values"""
random_index = random.randint(0, 4)
shap_value = shap_values[random_index]
fig, axs = plt.subplots(3, 6, figsize=(15, 10))
axs[0, 0].set_title(f"Original Image")
for i, channel in enumerate(["Red", "Green", "Blue"]):
axs[i, 0].imshow(x_test[random_index,:,])
axs[i, 0].set(ylabel=f"{channel} channel")
for ax in axs.flat:
ax.set_yticks([])
ax.set_xticks([])
# Hide x labels and tick labels for top plots and y ticks for right plots.
ax.label_outer()
for class_idx, cls in enumerate(CLASSES):
shap_values_normalized = (shap_value[:,:,:,class_idx] - shap_value[:,:,:,class_idx].min()) / (shap_value[:,:,:,class_idx].max() - shap_value[:,:,:,class_idx].min())
for channel_idx, channel in enumerate(["Red", "Green", "Blue"]):
axs[channel_idx, class_idx + 1].imshow(shap_values_normalized[:,:,channel_idx], cmap='seismic', vmin=0, vmax=1)
if channel_idx == 0:
axs[channel_idx, class_idx + 1].set_title(cls)
plt.tight_layout()
plt.show()
"""# Discussion and Conclusions
## Discussion
### Confusion Matrix
The model demonstrates robust performance with an overall accuracy of 88%. It excels particularly in identifying **Potato___Early_blight** and has high precision in classifying healthy pepper and potato plants.
The primary area for improvement is reducing misclassifications between **Pepper__bell___Bacterial_spot**, **Pepper__bell___healthy**, and **Potato___Late_blight**. Despite these misclassifications, the model is highly effective for the task of multi-class classification of plant diseases.
### Classification Report
**Precision** - measures how many of the positive predictions made by the model are actually correct.
**Recall** - measures how well the model identifies all relevant instances.
**F1** - A measure that balances precision and recall. It's the harmonic mean of precision and recall.
The model was precise in predicting healthy pepper and potato plants, but was struggling to identify all instances of them.
It was predicting more false positives of unhealthy plants, but more or less it was great at finding most instances of plants with diseases.
```
precision recall f1-score support
Pepper__bell___Bacterial_spot 0.81 0.83 0.82 275
Pepper__bell___healthy 0.99 0.79 0.88 363
Potato___Early_blight 0.92 1.00 0.96 282
Potato___Late_blight 0.77 0.94 0.85 276
Potato___healthy 0.97 0.85 0.90 164
accuracy 0.88 1360
macro avg 0.89 0.88 0.88 1360
weighted avg 0.89 0.88 0.88 1360
```
### SHAP
In this project I've used for the first time SHAP (SHapley Additive exPlanations), which are used to explain the output of any machine learning model.
Although SHAP can provide good insight into a model, it didn't help me that much it this project to understand how the model works and why it returns the output it does. But I will use SHAP in the future projects to understand the model outputs.
### Limitations
It is hard to identify similar looking plants, and even harder to differentiate between diseases, as one plant can have multiple similar looking diseases.
But even with those limitations, the model has a close to 90% accuracy with similar looking plants and diseases.
### Future Work
Future work for this project could be:
* Expanding the dataset (More plant species, diseases).
* Improving the model architecture.
* Deploying the model in real-world application.
## Conclusions
This project successfully demonstrated the application of deep learning techniques for identifying diseases in pepper and potato plants using image classification.
The neural network model achieved a high level of accuracy, effectively distinguishing between healthy and diseased plants.
These contributions highlight the potential of AI-driven solutions in early disease detection and effective crop management, paving the way for improved agricultural productivity and reduced crop losses.
"""