-
Notifications
You must be signed in to change notification settings - Fork 10
Models
From the spaCy documentation:
In general, spaCy expects all model packages to follow the naming convention of
[lang]_[name]
. For spaCy's models, we also chose to divide the name into three components:
- type: Model capabilities (e.g.
core
for general-purpose model with vocabulary, syntax, entities and word vectors, ordepent
for only vocab, syntax and entities)- genre: Type of text the model is trained on (e.g.
web
for web text,news
for news text)- size: Model size indicator (
sm
,md
orlg
)For example,
en_core_web_sm
is a small English model trained on written web text (blogs, news, comments), that includes vocabulary, vectors, syntax and entities.
The Greek models were trained on data from here for POS/DEP Tagger and data that we procuded for ourselves and can be found here for NER (check Prodigy Wiki for more info).
Following the name conventions mentioned above, two models have been trained for Greek language:
- el_core_web_sm: Vocabulary, syntax, entities.
- el_core_web_lg: Vocabulary, syntax, entities, word-vectors.
-
Import model:
import spacy nlp = spacy.load('el_core_web_sm') # For the model with vectors, run the following command: # nlp = spacy.load('el_core_web_lg')
-
Get doc object:
# replace with your own text text = "Η Ελλάδα είναι από τις ομορφότερες χώρες του κόσμου" doc = nlp(text)
-
Tokenize and lemmatize your sentence:
for token in doc: print("Token:{}, Lemma:{}".format(token, token.lemma_))
Output:
Token:Η, Lemma:η Token:Ελλάδα, Lemma:ελλάδα Token:είναι, Lemma:είναι Token:από, Lemma:από Token:τις, Lemma:τις Token:ομορφότερες, Lemma:ομορφός Token:χώρες, Lemma:χώρα Token:του, Lemma:του Token:κόσμου, Lemma:κόσμου
-
Get POS tags for each of the tokens
for token in doc: print("Token:{}, Tag:{}".format(token, token.tag_))
Output:
Token:Η, Tag:DET Token:Ελλάδα, Tag:PROPN Token:είναι, Tag:AUX Token:από, Tag:ADP Token:τις, Tag:DET Token:ομορφότερες, Tag:ADJ Token:χώρες, Tag:NOUN Token:του, Tag:DET Token:κόσμου, Tag:NOUN
-
Visualize POS tags and Dependencies
from spacy import displacy displacy.serve(doc)
-
Get Named Entities out of your sentence
for ent in doc.ents: print("Entity:{}, Label:{}".format(ent.text, ent.label_))
Output:
Entity:Ελλάδα, Label:GPE
-
Visualize Named Entities
from spacy import displacy displacy.serve(doc, style="ent")
-
Detect similarity between texts
# for this, we will need the model with the word-vectors nlp = spacy.load('el_core_web_lg') doc1 = nlp('Οι πυροσβέστες ψάχνουν αγωνιωδώς για επιζώντες. Οι φωτιές διέλυσαν τα πάντα. Τα πάντα είναι απανθρακωμένα.') doc2 = nlp('Το Πυροσβεστικό Σώμα συνεχίζει να αναζητά τους αγνωούμενους. Η πυρκαγιά κατέλυσε όλη την περιοχή. Όλα έγιναν στάχτη και κάρβουνο.') doc3 = nlp('Χθες αγόρασα ένα σκύλο! Και μια γάτα! Και ένα κουνέλι!') print(doc1.similarity(doc2)) print(doc1.similarity(doc3))
Output:
0.7155315553393391 0.46625177182352695
As we expected, the first two sentences that are semantically close have high similarity score. Contrary to this, the first and the last sentence that talk about different topics have lower similarity score.
For a lot more submodules that are derived from the models usage check here.