This is a spaCy language model for noisy Romanian legal documents with floret n-gram embeddings and LEGAL
entity recognition.
The embeddings are trained using MARCELL Romanian legislative corpus consisting in 160K documents available at https://legislatie.just.ro and released by the Research Institute for Artificial Intelligence of the Romanian Academy. We preprocessed the corpus, removed short sentences, standardized diacritics, tokenized words using an empty spaCy model for Romanian, and dumped every document into a single large file publicly available for download available here. Also available in spaCy universe.
To use the spaCy language model right away, install the released version:
pip install ro-legal-fl
Example:
import spacy
nlp = spacy.load("ro_legal_fl")
doc = nlp("Titlul III din LEGEA nr. 255 din 19 iulie 2013, publicată în MONITORUL OFICIAL")
# legal entity identification
for entity in doc.ents:
print('entity: ', entity, '; entity type: ', entity.label_)
# entity: III ; entity type: NUMERIC
# entity: LEGEA nr. 255 din 19 iulie 2013 ; entity type: LEGAL
# entity: MONITORUL OFICIAL ; entity type: ORG
# floret n-gram embeddings robust to typos
print(nlp('achizit1e public@').similarity(nlp('achiziții publice')))
# 0.7393895566928835
print(nlp('achizitii publice').similarity(nlp('achiziții publice')))
# 0.8996480808279399
The following data is used for training:
- A clean version of MARCELL Romanian legislative corpus used to train floret embeddings using hashing bucket size: 100000, vector dimensions: 280, 4-grams and 5-grams of characters
- Romanian universal dependency treebank annotations to train parsers, part of speech taggers, and lemmatizers; this dataset is essential for training a model that can identify different morphological forms of the same word (e.g., achizitii, achizitie, achizitia etc.) which depend strongly on the part of speech the word has in the particular context; combining this data with the embeddings trained previously on MARCELL corpus will result in a more robust model for legal document processing
- LegalNERo corpus released by the Research Institute for Artificial Intelligence "Mihai Draganescu" of the Romanian Academy that contains Named Entity annotations for different entity types: Legal, Persons, Locations, Organizations, and Time entities; useful to increase the model’s robustness to legal documents and to be able to identify mentions to legal acts as entities.
- RoNEC corpus or Romanian Named Entity corpus; useful to identify Persons, Organizations and several other entities in documents. Currently, at version 2.0, holds 12330 sentences with over 0.5M tokens, annotated with 15 classes, to a total of 80.283 distinctly annotated entities.
Feature | Description |
---|---|
Name | ro_legal_fl |
Version | 3.6.1 - fixed with spacy version |
spaCy | >=3.6.1,<3.7.0 |
Default Pipeline | tok2vec , tagger , morphologizer , parser , lemmatizer , attribute_ruler , ner |
Components | tok2vec , tagger , morphologizer , parser , lemmatizer , attribute_ruler , ner |
Vectors | -1 keys, 100000 unique vectors (280 dimensions) |
Sources | MARCELL legislative corpus, LegalNeRo, RoNEC |
License | CC4R https://constantvzw.org/wefts/cc4r.en.html |
Author | Sergiu Nisioi |
The evaluation of the legal spaCy model is not directly comparable with other models for Romanian because we used a different training set, a different domain, and a completely different test set. We copy in the table below the values of the language model released by spaCy on generic Romanian language called ro_core_news_lg1 only to present a rough comparison with the evaluation scores of our model on the legal domain:
Metric | Description | ro-core-news-lg | ro-legal-fl |
---|---|---|---|
TOKEN_ACC | Tokenization accuracy | 1.00 | 1.00 |
TAG_ACC | Part-of-speech tags (fine grained tags, Token.tag) | 0.97 | 0.96 |
SENTS_P | Sentence segmentation (precision) | 0.97 | 0.95 |
SENTS_R | Sentence segmentation (recall) | 0.97 | 0.96 |
SENTS_F | Sentence segmentation (F-score) | 0.97 | 0.96 |
DEP_UAS | Unlabeled dependencies | 0.89 | 0.89 |
DEP_LAS | Labeled dependencies | 0.84 | 0.83 |
LEMMA_ACC | Lemmatization | 0.96 | 0.96 |
POS_ACC | Part-of-speech tags (coarse grained tags, Token.pos) | 0.94 | 0.97 |
MORPH_ACC | Morphological analysis | 0.95 | 0.96 |
NER scores are reported in the following table:
Metric | Description | ro-core-news-lg | ro-legal-fl |
---|---|---|---|
ENTS_P | Named entities (precision) | 0.75 | 0.79 |
ENTS_R | Named entities (recall) | 0.77 | 0.76 |
ENTS_F | Named entities (F-score) | 0.76 | 0.77 |
Below are the evaluation metrics per entity type. The results are consistent with exiting published data on legal entity detection
P | R | F | |
---|---|---|---|
MONEY | 88.52 | 72.32 | 79.61 |
DATETIME | 85.31 | 84.58 | 84.94 |
PERSON | 76.71 | 72.40 | 74.49 |
QUANTITY | 89.27 | 84.55 | 86.85 |
NUMERIC | 86.53 | 81.72 | 84.06 |
LEGAL | 71.24 | 83.85 | 77.03 |
ORG | 69.24 | 71.96 | 70.58 |
ORDINAL | 89.14 | 89.14 | 89.14 |
PERIOD | 84.39 | 74.11 | 78.92 |
NAT_REL_POL | 85.09 | 77.46 | 81.10 |
GPE | 81.95 | 82.75 | 82.35 |
WORK_OF_ART | 39.15 | 28.14 | 32.74 |
LOC | 55.28 | 52.35 | 53.78 |
EVENT | 54.89 | 43.20 | 48.34 |
LANGUAGE | 80.28 | 78.08 | 79.17 |
FACILITY | 60.14 | 47.98 | 53.38 |
The commands below assume you are in the ro_legal_fl
directory:
cd ro_legal_fl
pip install -r requirements.txt
git clone https://github.com/explosion/floret
cd floret
make
The training uses continuous bag of words with subwords ranging between 4 and 5 characters, 2 hashes per entry, and a compact table of 100K entries. The configuration for training embeddings is defined in project.yml. Before proceeding with the training, floret must be compiled and installed on the machine where training will take place.
To train embeddings from scratch, one has to be in the directory of the project, have floret and spaCy and then run the following command:
python -m spacy project run either-train-embeddings
Which will run several shell scripts defined in project.yml
to download the corpus and start floret training. If the user does not want to train embeddings from scratch, but use the ones that we release within a spaCy package, then they may execute the following command instead: python -m spacy project run either-download-embeddings
.
We provide pre-trained embeddings that can be used with the pipeline. The embeddings are downloaded with the assets:
python -m spacy project assets
python -m spacy project run either-download-embeddings
An example of using floret vectors to identify similar legal terms can be visible in the following box.
./floret/floret nn ./vectors/marcell_clean.dim280.minCount50.n4-5.neg10.modeFloret.hashCount2.bucket100000/vectors.bin
Query word? | Similar Word | Similarity Score |
---|---|---|
sectoriale | ||
sectorial/sectoriale | 0.91564 | |
sectoriale/intersectoriale | 0.915279 | |
transsectoriale | 0.901447 | |
subsectoriale | 0.898561 | |
naționale/sectoriale | 0.881749 | |
multisectoriale | 0.869202 | |
publice/sectoriale | 0.863173 | |
publică/sectoriale | 0.844522 | |
intersectoriale | 0.84431 | |
intrasectoriale/intersectoriale | 0.841589 |
The results show a robust response where several versions of the word appear highly similar, including terms containing the slash sign after tokenization.
To build the spaCy package, in the same directory run the following two commands:
# to download the data and depdendencies
python -m spacy project assets
# to train-evaluate a model
python -m spacy project run all
# to package it
python -m spacy project run package
The first command will download the necessary assets:
- Romanian universal dependency treebank annotations to train parsers, part of speech taggers, and lemmatizers; this dataset is essential for training a model that can identify different morphological forms of the same word (e.g., achizitii, achizitie, achizitia etc.) which depend strongly on the part of speech the word has in the particular context; combining this data with the embeddings trained previously on MARCELL corpus will result in a more robust model for legal document processing
- LegalNERo corpus released by the Research Institute for Artificial Intelligence "Mihai Draganescu" of the Romanian Academy that contains Named Entity annotations for different entity types: Legal, Persons, Locations, Organizations, and Time entities; useful to increase the model’s robustness to legal documents and to be able to identify mentions to legal acts as entities.
- RoNEC corpus or Romanian named entities; useful to identify Persons, Organizations and several other entities in documents.
The second command will run the training pipeline where each action is defined in the project yaml file as shell scripts. The steps of the pipeline are:
- initialize the downloaded or trained floret vectors in the new spaCy model
- convert treebank dataset to spaCy binary dataset for training
- initialize prediction labels using the configuration defined in configs/ro_legal.cfg
- train tok2vec, tagger, morphologizer, parser, lemmatizer, and senter components using the treebank data
- evaluate the model on the test set
- convert LegalNERo to conllup format
- convert RoNEC to conllup format
- combine the two named entity recognition corpora into a single file
- convert the combined file into spaCy binary format
- prediction entity labels using the configuration defined in configs/ro_legal.cfg
- train named entity recognizer using the data created
- evaluate the model on the test set
- package everything into a wheel
This will take a lot of time, so please be patient. At the end, in the packages directory a wheel will be created named ro_legal_fl that can be installed using pip as an individual package.
This repository contains two datasets:
This dataset consist in an archive that containes raw scraped documents covering PPL. And a .csv file containing the metadata for each file in the archive: published year, month, header, source URL, type (if primary or secondary).
Files:
- historical_procurement_legislation.zip
- historical_procurement_legislation.csv
This dataset is extracted from the public pages of the Parliament (Senate and Chamber of Deputies). The files have been downloaded in PDF format the tesseract-ocr has been applied to convert them into Romanian. The archive contains a list of directories named after the PLX id of each legislative proposal from the Chamber of Deputies. Each directory contains a list of txt files encompassing the entire folder of a bill (written advices from different comissions, various forms that were passed. etc.) For each proposal each directory, there are two more directories called "impact" or "nonrelevant". The "impact" directory contains the articles, paragraphs and fragments that have been annotated as impacting public procurement legislation. The "nonrelevant" contains the remaining content of the bill.
Files:
- cdep_senat_txt_annotated.zip
- impacting_laws.csv