-
Notifications
You must be signed in to change notification settings - Fork 10
Norm exceptions
From the documentation page of spaCy:
spaCy usually tries to normalise words with different spellings to a single, common spelling. This has no effect on any other token attributes, or tokenization in general, but it ensures that equivalent tokens receive similar representations. This can improve the model's predictions on words that weren't common in the training data, but are equivalent to other words – for example, "realize" and "realise", or "thx" and "thanks".
The list of norm exceptions in Greek is constructed by appropriate parsing of a Greek dictionary.
Usually, dictionaries have a symbol that maps a word to another word that it is a slight variation of itself (i.e., a norm-exception). In the dictionary we parsed, this symbol was "->".
The full list can be found here. In the list, the first column contains the exceptions and the second column contains the corresponding norms.
For extending the list, please see the Contributing page of this wiki.