GitHub - thjbdvlt/jusquci: french tokenizer for postgresql text search / spacy

jusquci -- tokenizer for french (PostgreSQL/spaCy).

text	tokens
jusqu'ici=>	`jusqu'` `ici` `=>`
celle-ci-->ici	`celle` `-ci` `-->` `ici`
lecteur-rice-x-s	`lecteur-rice-x-s`
peut-être--là	`peut-être` `--` `là`
correcteur·rices	`correcteur·rices`
mais.maintenant	`mais` `.` `maintenant`
[re]lecteur.rice.s	`[re]lecteur.rice.s`
autre(s)	`autre(s)`
(autres)	`(` `autres` `)`
(autre(s))	`(` `autre(s)` `)`
www.on-tenk.com	`www.on-tenk.com`
[@becker_1982,p.12]	`[` `@becker_1982` `,` `p.` `12` `]`
oui..?	`oui` `..?`
dedans/dehors	`dedans` `/` `dehors`
:happy: :) pour:	`:happy:` `:)` `pour` `:`
ô.ô^^=):-)xd	`ô.ô` `^^` `=)` `:-)` `xd`

postgresql extension

the primary role of this tokenizer is to be used as a text search parser in postgresql, hence it's proposed here as an postgresql extension.

make install install_stop

create extension jusquci;

select to_tsvector(
    'jusquci',
    'le quotidien,s''invente-t-il par mille.manière de braconner???'
);

in python

the single provided function (tokenize) returns three lists:

tokens: a list of strings.
tokens types: a list of token types ID; the types are defined as an Enum (jusqucy.ttypes.TokenType).
spaces: a list of boolean values that indicates if tokens are followed by a space or not (for spaCy, mostly).
is_sent_start: a list of boolean values that's used to set Token.is_sent_start (based of the token types).

the tokenizer can be used in a spacy pipeline. it tokenizes the text and add a attribute to the resulting Doc object, Doc._.ttypes in which are store token types (assigning to each token takes much more time).

import spacy
import jusqucy

nlp = spacy.blank('fr')
nlp.tokenizer = jusqucy.JusqucyTokenizer(nlp.vocab)

# or:
nlp = spacy.load(your_model, config={
    "nlp": {"tokenizer": {"@tokenizers": "jusqucy_tokenizer"}}
})

to get the token types:

from jusqucy.ttypes import TokenType
for token, ttype in zip(doc, doc._.jusqucy_ttypes):
    print(token, TokenType[ttype])

normalizer

a normalizer can also be used as a spacy component. it replace the norm_ attribute of token of some ttypes, in order to make the following components (e.g. morphologizer or parser) easier.

url: https://
number: 2
ordinal: 2ème
emoticon: :)
emoji: :)

as a command line tool

to use jusquci as a simple command line tokenizer (that reads from stdin), just compile it with the makefile in the cli directory. the program read a text from standard input and output tokens separated by spaces. it also add newlines after strong punctuation signs (., ?, !).

sources

tsexample, for the code.
the stopwords list is the concatenation of postgresql default stopwords for french (french.stop) and a list establish by Jacques Savoy¹. I've also added a few words: elised words with apostrophe (e.g. c'), to be consistent with the jusquci parser (postgresql doesn't include the apostrophe), and non-binary pronouns (e.g. iel, celleux).

os

only tested on linux (debian) and postgresql 16

license

licensed under GPLv3.

A stemming procedure and stopword list for general french corpora, Jacques Savoy, Institut interfacultaire d'informatique, Journal of the American Society for Information Science, 50(10), 1999, 944-952. I removed a word from this list: passé. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cli		cli
jusqucy		jusqucy
src		src
tests		tests
COPYING		COPYING
MANIFEST.in		MANIFEST.in
README.md		README.md
french_jusquci.stop		french_jusquci.stop
jusquci--1.0.sql		jusquci--1.0.sql
jusquci.c		jusquci.c
jusquci.control		jusquci.control
makefile		makefile
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

postgresql extension

in python

normalizer

as a command line tool

sources

os

license

About

Releases

Packages

Languages

License

thjbdvlt/jusquci

Folders and files

Latest commit

History

Repository files navigation

postgresql extension

in python

normalizer

as a command line tool

sources

os

license

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages