A spaCy Package for Romanian Legal Document Processing & Other Resources

This is a spaCy language model for noisy Romanian legal documents with floret n-gram embeddings and LEGAL entity recognition. The embeddings are trained using MARCELL Romanian legislative corpus consisting in 160K documents available at https://legislatie.just.ro and released by the Research Institute for Artificial Intelligence of the Romanian Academy. We preprocessed the corpus, removed short sentences, standardized diacritics, tokenized words using an empty spaCy model for Romanian, and dumped every document into a single large file publicly available for download available here. Also available in spaCy universe.

Usage

To use the spaCy language model right away, install the released version:

pip install ro-legal-fl

Example:

import spacy
nlp = spacy.load("ro_legal_fl")

doc = nlp("Titlul III din LEGEA nr. 255 din 19 iulie 2013, publicată în MONITORUL OFICIAL")

# legal entity identification
for entity in doc.ents:
    print('entity: ', entity, '; entity type: ', entity.label_)
# entity:  III ; entity type:  NUMERIC
# entity:  LEGEA nr. 255 din 19 iulie 2013 ; entity type:  LEGAL
# entity:  MONITORUL OFICIAL ; entity type:  ORG

# floret n-gram embeddings robust to typos
print(nlp('achizit1e public@').similarity(nlp('achiziții publice')))
# 0.7393895566928835
print(nlp('achizitii publice').similarity(nlp('achiziții publice')))
# 0.8996480808279399

Training Data

The following data is used for training:

A clean version of MARCELL Romanian legislative corpus used to train floret embeddings using hashing bucket size: 100000, vector dimensions: 280, 4-grams and 5-grams of characters
Romanian universal dependency treebank annotations to train parsers, part of speech taggers, and lemmatizers; this dataset is essential for training a model that can identify different morphological forms of the same word (e.g., achizitii, achizitie, achizitia etc.) which depend strongly on the part of speech the word has in the particular context; combining this data with the embeddings trained previously on MARCELL corpus will result in a more robust model for legal document processing
LegalNERo corpus released by the Research Institute for Artificial Intelligence "Mihai Draganescu" of the Romanian Academy that contains Named Entity annotations for different entity types: Legal, Persons, Locations, Organizations, and Time entities; useful to increase the model’s robustness to legal documents and to be able to identify mentions to legal acts as entities.
RoNEC corpus or Romanian Named Entity corpus; useful to identify Persons, Organizations and several other entities in documents. Currently, at version 2.0, holds 12330 sentences with over 0.5M tokens, annotated with 15 classes, to a total of 80.283 distinctly annotated entities.

Feature	Description
Name	`ro_legal_fl`
Version	`3.6.1` - fixed with spacy version
spaCy	`>=3.6.1,<3.7.0`
Default Pipeline	`tok2vec`, `tagger`, `morphologizer`, `parser`, `lemmatizer`, `attribute_ruler`, `ner`
Components	`tok2vec`, `tagger`, `morphologizer`, `parser`, `lemmatizer`, `attribute_ruler`, `ner`
Vectors	-1 keys, 100000 unique vectors (280 dimensions)
Sources	MARCELL legislative corpus, LegalNeRo, RoNEC
License	CC4R https://constantvzw.org/wefts/cc4r.en.html
Author	Sergiu Nisioi

Model Evaluation

The evaluation of the legal spaCy model is not directly comparable with other models for Romanian because we used a different training set, a different domain, and a completely different test set. We copy in the table below the values of the language model released by spaCy on generic Romanian language called ro_core_news_lg1 only to present a rough comparison with the evaluation scores of our model on the legal domain:

Metric	Description	ro-core-news-lg	ro-legal-fl
TOKEN_ACC	Tokenization accuracy	1.00	1.00
TAG_ACC	Part-of-speech tags (fine grained tags, Token.tag)	0.97	0.96
SENTS_P	Sentence segmentation (precision)	0.97	0.95
SENTS_R	Sentence segmentation (recall)	0.97	0.96
SENTS_F	Sentence segmentation (F-score)	0.97	0.96
DEP_UAS	Unlabeled dependencies	0.89	0.89
DEP_LAS	Labeled dependencies	0.84	0.83
LEMMA_ACC	Lemmatization	0.96	0.96
POS_ACC	Part-of-speech tags (coarse grained tags, Token.pos)	0.94	0.97
MORPH_ACC	Morphological analysis	0.95	0.96

NER scores are reported in the following table:

Metric	Description	ro-core-news-lg	ro-legal-fl
ENTS_P	Named entities (precision)	0.75	0.79
ENTS_R	Named entities (recall)	0.77	0.76
ENTS_F	Named entities (F-score)	0.76	0.77

NER per type

Below are the evaluation metrics per entity type. The results are consistent with exiting published data on legal entity detection

	P	R	F
MONEY	88.52	72.32	79.61
DATETIME	85.31	84.58	84.94
PERSON	76.71	72.40	74.49
QUANTITY	89.27	84.55	86.85
NUMERIC	86.53	81.72	84.06
LEGAL	71.24	83.85	77.03
ORG	69.24	71.96	70.58
ORDINAL	89.14	89.14	89.14
PERIOD	84.39	74.11	78.92
NAT_REL_POL	85.09	77.46	81.10
GPE	81.95	82.75	82.35
WORK_OF_ART	39.15	28.14	32.74
LOC	55.28	52.35	53.78
EVENT	54.89	43.20	48.34
LANGUAGE	80.28	78.08	79.17
FACILITY	60.14	47.98	53.38

Building a spaCy Package from Scratch

The commands below assume you are in the ro_legal_fl directory:

cd ro_legal_fl

Install Requirements and Compile floret

pip install -r requirements.txt


git clone https://github.com/explosion/floret
cd floret
make

Building floret Embedding for Romanian Legal Documents

The training uses continuous bag of words with subwords ranging between 4 and 5 characters, 2 hashes per entry, and a compact table of 100K entries. The configuration for training embeddings is defined in project.yml. Before proceeding with the training, floret must be compiled and installed on the machine where training will take place.

To train embeddings from scratch, one has to be in the directory of the project, have floret and spaCy and then run the following command:

python -m spacy project run either-train-embeddings

Which will run several shell scripts defined in project.yml to download the corpus and start floret training. If the user does not want to train embeddings from scratch, but use the ones that we release within a spaCy package, then they may execute the following command instead: python -m spacy project run either-download-embeddings.

Downloading pre-trained floret Embeddigns for Romanian

We provide pre-trained embeddings that can be used with the pipeline. The embeddings are downloaded with the assets:

python -m spacy project assets
python -m spacy project run either-download-embeddings

Example of using floret nearest-neighbors

An example of using floret vectors to identify similar legal terms can be visible in the following box.

./floret/floret nn ./vectors/marcell_clean.dim280.minCount50.n4-5.neg10.modeFloret.hashCount2.bucket100000/vectors.bin

Query word?	Similar Word	Similarity Score
sectoriale
	sectorial/sectoriale	0.91564
	sectoriale/intersectoriale	0.915279
	transsectoriale	0.901447
	subsectoriale	0.898561
	naționale/sectoriale	0.881749
	multisectoriale	0.869202
	publice/sectoriale	0.863173
	publică/sectoriale	0.844522
	intersectoriale	0.84431
	intrasectoriale/intersectoriale	0.841589

The results show a robust response where several versions of the word appear highly similar, including terms containing the slash sign after tokenization.

Building a Complete spaCy Package for Legal Documents

To build the spaCy package, in the same directory run the following two commands:

# to download the data and depdendencies
python -m spacy project assets
# to train-evaluate a model
python -m spacy project run all
# to package it
python -m spacy project run package

The first command will download the necessary assets:

Romanian universal dependency treebank annotations to train parsers, part of speech taggers, and lemmatizers; this dataset is essential for training a model that can identify different morphological forms of the same word (e.g., achizitii, achizitie, achizitia etc.) which depend strongly on the part of speech the word has in the particular context; combining this data with the embeddings trained previously on MARCELL corpus will result in a more robust model for legal document processing
LegalNERo corpus released by the Research Institute for Artificial Intelligence "Mihai Draganescu" of the Romanian Academy that contains Named Entity annotations for different entity types: Legal, Persons, Locations, Organizations, and Time entities; useful to increase the model’s robustness to legal documents and to be able to identify mentions to legal acts as entities.
RoNEC corpus or Romanian named entities; useful to identify Persons, Organizations and several other entities in documents.

The second command will run the training pipeline where each action is defined in the project yaml file as shell scripts. The steps of the pipeline are:

initialize the downloaded or trained floret vectors in the new spaCy model
convert treebank dataset to spaCy binary dataset for training
initialize prediction labels using the configuration defined in configs/ro_legal.cfg
train tok2vec, tagger, morphologizer, parser, lemmatizer, and senter components using the treebank data
evaluate the model on the test set
convert LegalNERo to conllup format
convert RoNEC to conllup format
combine the two named entity recognition corpora into a single file
convert the combined file into spaCy binary format
prediction entity labels using the configuration defined in configs/ro_legal.cfg
train named entity recognizer using the data created
evaluate the model on the test set
package everything into a wheel

This will take a lot of time, so please be patient. At the end, in the packages directory a wheel will be created named ro_legal_fl that can be installed using pip as an individual package.

Repository Resources

This repository contains two datasets:

1. Historical Public Procurement Legislation (PPL)

This dataset consist in an archive that containes raw scraped documents covering PPL. And a .csv file containing the metadata for each file in the archive: published year, month, header, source URL, type (if primary or secondary).

Files:

historical_procurement_legislation.zip
historical_procurement_legislation.csv

2. A subset of annotated legislative bills

This dataset is extracted from the public pages of the Parliament (Senate and Chamber of Deputies). The files have been downloaded in PDF format the tesseract-ocr has been applied to convert them into Romanian. The archive contains a list of directories named after the PLX id of each legislative proposal from the Chamber of Deputies. Each directory contains a list of txt files encompassing the entire folder of a bill (written advices from different comissions, various forms that were passed. etc.) For each proposal each directory, there are two more directories called "impact" or "nonrelevant". The "impact" directory contains the articles, paragraphs and fragments that have been annotated as impacting public procurement legislation. The "nonrelevant" contains the remaining content of the bill.

Files:

cdep_senat_txt_annotated.zip
impacting_laws.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A spaCy Package for Romanian Legal Document Processing & Other Resources

Contents

Usage

Training Data

Model Evaluation

NER per type

Building a spaCy Package from Scratch

Install Requirements and Compile floret

Building floret Embedding for Romanian Legal Documents

Downloading pre-trained floret Embeddigns for Romanian

Example of using floret nearest-neighbors

Building a Complete spaCy Package for Legal Documents

Repository Resources

1. Historical Public Procurement Legislation (PPL)

2. A subset of annotated legislative bills

About

Releases 2

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
img		img
ro_legal_fl		ro_legal_fl
LICENSE		LICENSE
README.md		README.md
cdep_senat_txt_annotated.zip		cdep_senat_txt_annotated.zip
historical_procurement_legislation.csv		historical_procurement_legislation.csv
historical_procurement_legislation.zip		historical_procurement_legislation.zip
impacting_laws.csv		impacting_laws.csv

License

senisioi/rolegal

Folders and files

Latest commit

History

Repository files navigation

A spaCy Package for Romanian Legal Document Processing & Other Resources

Contents

Usage

Training Data

Model Evaluation

NER per type

Building a spaCy Package from Scratch

Install Requirements and Compile floret

Building floret Embedding for Romanian Legal Documents

Downloading pre-trained floret Embeddigns for Romanian

Example of using floret nearest-neighbors

Building a Complete spaCy Package for Legal Documents

Repository Resources

1. Historical Public Procurement Legislation (PPL)

2. A subset of annotated legislative bills

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages