solipCysme

spaCy pipeline for french fictions or first person point of view texts (with a focus on personal pronouns, moods and tenses), mostly trained on novels.

Feature	Description
Language	french
Name	`fr_solipcysme`
Version	`0.2.4`
spaCy	`==3.8.4`
Default Pipeline	`jusqucy_tokenizer`,`commecy_normalizer`, `jusqucy_normalizer`, `pretagger_hunspell`,`morphologizer`, `viceverser_lemmatizer`, `parser`
Components	jusqucy_tokenizer, jusqucy_normalizer, commecy_normalizer, `morphologizer`, viceverser_lemmatizer, `parser`
Vectors	669785 keys, 6697856 unique vectors (100 dimensions)
Sources	Corpus narraFEATS (morphologizer), Universal Dependencies (parser), french-word-vectors (vectors)
License	GPL
Author	thjbdvlt

installation

pip install https://github.com/thjbdvlt/solipCysme/releases/download/v0.2.4/fr_solipcysme-0.2.4-py3-none-any.whl

usage

import spacy

nlp = spacy.load("fr_solipcysme")

doc = nlp(
    "la MACHINE à (b)rouiller le temps s'est peuuut-etre déraillée..?"
)

for i in doc:
    print(
        i.norm_,      # commecy_normalizer / jusqucy_normalizer
        i.pos_,       # morphologizer
        i.morph,      # morphologizer
        i.lemma_,     # viceverser_lemmatizer
        i.dep_,       # parser
        i.head,       # parser
        i.sent_start, # jusqucy_tokenizer
        i._.ttype,    # jusqucy_tokenizer
        i._.isword,   # jusqucy_tokenizer
    )

print(
    doc._.jusqucy_ttypes,  # jusqucy_tokenizer
    doc._.hunspell_po,     # pretagger_hunspell
    doc._.hunspell_is,     # pretagger_hunspell
)

components and architectures

solipCysme not only is a trained pipeline, but also a set of minimal pipeline components and model architectures that can be used independently.

SolipcysmeMultiHashed

a modified MultiHashEmbed that makes it possible to use Doc underscore attributes as features. The value of an attribute must be a list of int, and must have the same length as the Doc itself.

SolipcysmeCharEmbed

a modified CharacterEmbed that makes it possible to use underscore attributes as features and that replace nC (number of character) by nCstart and nCend, so that one can chose an asymetric representation of words (e. g., for french, to only suffix, with nCstart = 0 and nCend = 6).

pretagger_hunspell

a component that makes Hunspell morphological analysis available as features for the SolipcysmeMultiHashe or SolipcysmeCharEmbed architectures.

limits and specificities

only knows about straigt apostroph (') and quotes (").
morphologizer depends on the jusqucy_tokenizer, because this tokenizer sets a value to a doc extension (Doc._.jusqucy_ttypes), used by the morpholgizer.
morphologizer depends on the pretagger_hunspell component, too; because the morphologizer uses the output of Hunspell as token features (po: and is: features).
no Gender feature

license

this work is released under GPL license.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
solipcysme		solipcysme
.gitignore		.gitignore
COPYING		COPYING
MANIFEST.in		MANIFEST.in
README.md		README.md
config.cfg		config.cfg
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

solipCysme

installation

usage

components and architectures

SolipcysmeMultiHashed

SolipcysmeCharEmbed

pretagger_hunspell

limits and specificities

license

About

Releases 3

Packages

Languages

License

thjbdvlt/solipCysme

Folders and files

Latest commit

History

Repository files navigation

solipCysme

installation

usage

components and architectures

SolipcysmeMultiHashed

SolipcysmeCharEmbed

pretagger_hunspell

limits and specificities

license

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages