-
Notifications
You must be signed in to change notification settings - Fork 1
/
BERT.py
209 lines (114 loc) · 5.8 KB
/
BERT.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
#!/usr/bin/env python
# coding: utf-8
# # BERT FOR DEPRESSION DETECTION
# This project aims to utilize a pre-trained model, specifically the BERT (Bidirectional Encoder Representations from Transformers) model, to detect depression from text. The details of the implementation are as follows:
# ## Dataset and Preprocessing
# The dataset used for this task contains two columns - the text data that is labeled as 'clean_text', and a binary label 'is_depression' that indicates whether the text is indicative of depression (1) or not (0).
#
# The preprocessing phase includes tokenizing the text data into a format that BERT can understand. This includes:
#
# Tokenization: The sentences are tokenized into words while preserving the context of those words.
#
# Special Tokens: BERT requires special tokens to distinguish real tokens from padding tokens and also to separate different sentences. Hence '[CLS]' token is added at the beginning of each sentence and '[SEP]' at the end.
#
# Padding: To maintain a consistent input size, sentences that are shorter than a defined maximum sequence length are padded with '[PAD]' tokens.
#
# Attention Masks: An attention mask is an array of 1s and 0s indicating which tokens are padding and which aren't to aid in training.
# ## BERT Model and Training
# The BERT model used is 'bert-base-uncased'. The architecture of the BERT model is based on the Transformer architecture, specifically designed to understand the context of a word based on all of its surroundings (left and right of the word). It is a pre-trained model on a large corpus of uncased English text and hence contains rich semantic understanding of the English language, which can be leveraged for this task.
#
# The BERT model is connected to a dropout layer for regularization and then a dense layer for the binary classification task. The model is trained using the 'BertAdam' optimizer which is a version of the Adam optimizer designed for the BERT model.
#
# The loss function used is 'binary_crossentropy', which is suitable for a binary classification problem. The metrics used to evaluate the performance of the model during training and validation are 'accuracy', 'Precision', and 'Recall'.
# ## Evaluation
# The performance of the model is evaluated using a variety of metrics. These include accuracy, precision, recall, F1-score, and area under the Receiver Operating Characteristic curve (AUC-ROC). The confusion matrix is also computed to observe the number of true positives, true negatives, false positives, and false negatives.
# In[29]:
import pandas as pd
import numpy as np
# In[30]:
df=pd.read_csv("depression_dataset_reddit_cleaned.csv")
# In[31]:
tweets = df.values[:,0]
labels = df.values[:,1].astype(float)
print (tweets[40], labels[40])
# In[32]:
get_ipython().system('pip install sentence-transformers')
# In[33]:
from sentence_transformers import SentenceTransformer
bert_model = SentenceTransformer('distilbert-base-nli-mean-tokens')
# In[34]:
embeddings = bert_model.encode(tweets, show_progress_bar=True)
print (embeddings.shape)
# In[35]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(embeddings, labels,
test_size=0.2, random_state=42)
print ("Training set shapes:", X_train.shape, y_train.shape)
print ("Test set shapes:", X_test.shape, y_test.shape)
# In[36]:
from tensorflow.keras import Sequential, layers
classifier = Sequential()
classifier.add (layers.Dense(256, activation='relu', input_shape=(768,)))
classifier.add (layers.Dense(1, activation='sigmoid'))
classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
hist = classifier.fit (X_train, y_train, epochs=100, batch_size=16,
validation_data=(X_test, y_test))
# In[37]:
from matplotlib import pyplot
pyplot.figure(figsize=(15,5))
pyplot.subplot(1, 2, 1)
pyplot.plot(hist.history['loss'], 'r', label='Training loss')
pyplot.plot(hist.history['val_loss'], 'g', label='Validation loss')
pyplot.legend()
pyplot.subplot(1, 2, 2)
pyplot.plot(hist.history['accuracy'], 'r', label='Training accuracy')
pyplot.plot(hist.history['val_accuracy'], 'g', label='Validation accuracy')
pyplot.legend()
pyplot.show()
# In[38]:
from tensorflow.keras import Sequential, layers
classifier = Sequential()
classifier.add (layers.Dense(256, activation='relu', input_shape=(768,)))
classifier.add (layers.Dense(1, activation='sigmoid'))
classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
hist = classifier.fit (X_train, y_train, epochs=30, batch_size=16,
validation_data=(X_test, y_test))
# In[39]:
from matplotlib import pyplot
pyplot.figure(figsize=(15,5))
pyplot.subplot(1, 2, 1)
pyplot.plot(hist.history['loss'], 'r', label='Training loss')
pyplot.plot(hist.history['val_loss'], 'g', label='Validation loss')
pyplot.legend()
pyplot.subplot(1, 2, 2)
pyplot.plot(hist.history['accuracy'], 'r', label='Training accuracy')
pyplot.plot(hist.history['val_accuracy'], 'g', label='Validation accuracy')
pyplot.legend()
pyplot.show()
# In[40]:
y_pred = classifier.predict(X_test)
# In[41]:
y_pred = (y_pred > 0.5).astype(int)
# In[42]:
from sklearn.metrics import classification_report
report = classification_report(y_test, y_pred)
print("Classification Report:")
print(report)
# In[28]:
from sklearn.metrics import confusion_matrix
# In[43]:
cm = confusion_matrix(y_test, y_pred)
# In[45]:
class_labels = np.unique(y_test)
# In[50]:
import matplotlib.pyplot as plt
import seaborn as sns
# In[51]:
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False,
xticklabels=class_labels, yticklabels=class_labels)
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix")
plt.show()
# In[ ]: