Disease word classifier model comparison with different types of neural networks (CNN, GRU, LSTM) using ncbi_disease dataset. This data was retrieved from Hugging face dataset api. The project aims to apply named entity recognition on disease names.
id (string) | tokens (array) | ner_tags (array) |
---|---|---|
0
| [
"Identification",
"of",
"APC2",
",",
"a",
"homologue",
"of",
"the",
"adenomatous",
"polyposis",
"coli",
"tumour",
"suppressor",
"."
]
| [
0,
0,
0,
0,
0,
0,
0,
0,
1,
2,
2,
2,
0,
0
]
|
1
| [
"The",
"adenomatous",
"polyposis",
"coli",
"(",
"APC",
")",
"tumour",
"-",
"suppressor",
"protein",
"controls",
"the",
"Wnt",
"signalling",
"pathway",
"by",
"forming",
"a",
"complex",
"with",
"glycogen",
"synthase",
"kinase",
"3beta",
"(",
"GSK",
"-",
"3beta",
")",
",",
"axin",
"/",
"conductin",
"and",
"betacatenin",
"."
]
| [
0,
1,
2,
2,
2,
2,
2,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
|
2
| [
"Complex",
"formation",
"induces",
"the",
"rapid",
"degradation",
"of",
"betacatenin",
"."
]
| [
0,
0,
0,
0,
0,
0,
0,
0,
0
]
|
3
| [
"In",
"colon",
"carcinoma",
"cells",
",",
"loss",
"of",
"APC",
"leads",
"to",
"the",
"accumulation",
"of",
"betacatenin",
"in",
"the",
"nucleus",
",",
"where",
"it",
"binds",
"to",
"and",
"activates",
"the",
"Tcf",
"-",
"4",
"transcription",
"factor",
"(",
"reviewed",
"in",
"[",
"1",
"]",
"[",
"2",
"]",
")",
"."
]
| [
0,
1,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
|
4
| [
"Here",
",",
"we",
"report",
"the",
"identification",
"and",
"genomic",
"structure",
"of",
"APC",
"homologues",
"."
]
| [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
|
5
| [
"Mammalian",
"APC2",
",",
"which",
"closely",
"resembles",
"APC",
"in",
"overall",
"domain",
"structure",
",",
"was",
"functionally",
"analyzed",
"and",
"shown",
"to",
"contain",
"two",
"SAMP",
"domains",
",",
"both",
"of",
"which",
"are",
"required",
"for",
"binding",
"to",
"conductin",
"."
]
| [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
|
Instances of the dataset contain an array of tokens
, ner_tags
and an id
.
Sample data from the dataset:
{
'tokens': ['Identification', 'of', 'APC2', ',', 'a', 'homologue', 'of', 'the', 'adenomatous', 'polyposis', 'coli', 'tumour', 'suppressor', '.'],
'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0],
'id': '0'
}
id
: Sentence identifier.tokens
: Array of tokens composing a sentence.ner_tags
: Array of tags, where0
indicates no disease mentioned,1
signals the first token of a disease and2
the subsequent disease tokens.
For more detailed information about dataset
The data above had some punctuation chars and stopwords inside tokens column. These elements and their corresponding tags have been removed for a clean result and no unwanted classifications.
The data is vectorized in order to be sequenced and padded via tokenizer. Since CNN only accepts fixed size data, the lengths of all sentences are made the same size.
Since our problem is basically a multi class classification problem all of the output layers use "softmax" as an activation function. Also for each model "sparse categorical corssentropy" loss function is used.
After some experiment it is observed that the model is tend to overfit. For this reason dropout layer has been used with a probability 0.6. Also the learning rate has been setted to 0.0005.
On this model only single 1d convolutional layer has been used. Filters are defined as 16 and kernel size defined as 2. Same as previous model this model is tend to overfit. Therefore before the output layer dropout layer has been used.
All of the RNN based models (GRU, LSTM, Multiple LSTM) have bidirectional layer in this experiment. Basic GRU model with 32 unit. Only advantage of this model it is more easy to train than basic LSTM cells.
This model has better results than the GRU model but it's training duration is more than the GRU model.
Due to not enough data this model is the most overfitted model in the experiment. In spite of dropout layer and low learning rate. Also takes the longest to train.
The accuracy and loss values on the test data are listed below.
test loss, test acc: [0.2331709861755371, 0.9312196969985962] <-> Only Embedding
test loss, test acc: [0.2950023710727691, 0.8634837865829468] <-> CNN
test loss, test acc: [0.2362357676029205, 0.9317418932914734] <-> GRU
test loss, test acc: [0.2362425923347473, 0.9315180778503418] <-> Bidi LSTM
test loss, test acc: [0.2752841413021087, 0.8601267933845526] <-> Multiple Bidi LSTM