spaCy pipeline for french fictions or first person point of view texts (with a focus on personal pronouns, moods and tenses), mostly trained on novels.
Feature | Description |
---|---|
Language | french |
Name | fr_solipcysme |
Version | 0.2.4 |
spaCy | ==3.8.4 |
Default Pipeline | jusqucy_tokenizer ,commecy_normalizer , jusqucy_normalizer , pretagger_hunspell ,morphologizer , viceverser_lemmatizer , parser |
Components | jusqucy_tokenizer, jusqucy_normalizer, commecy_normalizer, morphologizer , viceverser_lemmatizer, parser |
Vectors | 669785 keys, 6697856 unique vectors (100 dimensions) |
Sources | Corpus narraFEATS (morphologizer), Universal Dependencies (parser), french-word-vectors (vectors) |
License | GPL |
Author | thjbdvlt |
pip install https://github.com/thjbdvlt/solipCysme/releases/download/v0.2.4/fr_solipcysme-0.2.4-py3-none-any.whl
import spacy
nlp = spacy.load("fr_solipcysme")
doc = nlp(
"la MACHINE à (b)rouiller le temps s'est peuuut-etre déraillée..?"
)
for i in doc:
print(
i.norm_, # commecy_normalizer / jusqucy_normalizer
i.pos_, # morphologizer
i.morph, # morphologizer
i.lemma_, # viceverser_lemmatizer
i.dep_, # parser
i.head, # parser
i.sent_start, # jusqucy_tokenizer
i._.ttype, # jusqucy_tokenizer
i._.isword, # jusqucy_tokenizer
)
print(
doc._.jusqucy_ttypes, # jusqucy_tokenizer
doc._.hunspell_po, # pretagger_hunspell
doc._.hunspell_is, # pretagger_hunspell
)
solipCysme not only is a trained pipeline, but also a set of minimal pipeline components and model architectures that can be used independently.
a modified MultiHashEmbed that makes it possible to use Doc
underscore attributes as features. The value of an attribute must be a list
of int
, and must have the same length as the Doc
itself.
a modified CharacterEmbed that makes it possible to use underscore attributes as features and that replace nC
(number of character) by nCstart
and nCend
, so that one can chose an asymetric representation of words (e. g., for french, to only suffix, with nCstart = 0
and nCend = 6
).
a component that makes Hunspell morphological analysis available as features for the SolipcysmeMultiHashe
or SolipcysmeCharEmbed
architectures.
- only knows about straigt apostroph (
'
) and quotes ("
). - morphologizer depends on the
jusqucy_tokenizer
, because this tokenizer sets a value to a doc extension (Doc._.jusqucy_ttypes
), used by the morpholgizer. - morphologizer depends on the
pretagger_hunspell
component, too; because the morphologizer uses the output of Hunspell as token features (po:
andis:
features). - no
Gender
feature
this work is released under GPL license.