QUICK START

Ersatz is a simple, language-agnostic toolkit for both training sentence segmentation models as well as providing pretrained, high-performing models for sentence segmentation in a multilingual setting.

For more information, please see:

Rachel Wicks and Matt Post (2021): A unified approach to sentence segmentation of punctuated text in many languages In Proceedings of ACL.

QUICK START

Install

Install the Python (3.7+) module via pip

pip install ersatz

or from source

python setup.py install

Splitting

Ersatz can accept input from either standard input, or via a file path. Similarly, it produces output in the same manner:

cat raw.txt | ersatz > output.txt

ersatz --input raw.txt --output output.txt

To use a specific model (rather than the default), you can pass a name via --model_name, or a path via --model_path

Scoring

Ersatz also provides a simple scoring script which computes F1 from a given segmented file.

ersatz_score GOLD_STANDARD_FILE FILE_TO_SCORE

The above will print all errors as well as additional metrics at bottom. The accompanying test suite can be found here.

Training a Model

Data Preprocessing

Vocabulary

Requires uses a pretrained sentencepiece model that has had --eos_piece replaced with <eos> and --bos_piece replaced with <mos>.

spm_train --input $TRAIN_DATA_PATH \
   --model_prefix ersatz \
   --bos_piece "<mos>" \
   --eos_piece "<eos>"

Create training data

This pipeline takes a raw text file with one sentence per line (to use as labels) and creates a new raw text file with the appropriate left/right context and labels. One line is one training example. User is expected to shuffle this file manually (ie via shuf) after creation.

To create:

python dataset.py \
    --sentencepiece_path $SPM_PATH \
    --left-size $LEFT_SIZE \
    --right-size $RIGHT_SIZE \
    --output_path $OUTPUT_PATH \
    $INPUT_TRAIN_FILE_PATHS


shuf $OUTPUT_PATH > $SHUFFLED_TRAIN_OUTPUT_PATH

Repeat for validation data

python dataset.py \
    --sentencepiece_path $SPM_PATH \
    --left-size $LEFT_SIZE \
    --right-size $RIGHT_SIZE \
    --output_path $VALIDATION_OUTPUT_PATH \
    $INPUT_DEV_FILE_PATHS

Training

Something like:

        python trainer.py \
        --sentencepiece_path=$vocab_path \
        --left_size=$left_size \
        --right_size=$right_size \
        --output_path=$out \
        --transformer_nlayers=$transformer_nlayers \
        --activation_type=$activation_type \
        --linear_nlayers=$linear_nlayers \
        --min-epochs=$min_epochs \
        --max-epochs=$max_epochs \
        --lr=$lr \
        --dropout=$dropout \
        --embed_size=$embed_size \
        --factor_embed_size=$factor_embed_size \
        --source_factors \
        --nhead=$nhead \
        --log_interval=$log_interval \
        --validation_interval=$validation_interval \
        --eos_weight=$eos_weight \
        --early_stopping=$early_stopping \
        --tb_dir=$LOGDIR \
        $train_path \
        $valid_path

Splitting with a Pre-Trained Model

Expects a model_path (should probably change to a default in expected folder location...)
ersatz reads from either stdin or a file path (via --input).
ersatz writes to either stdout or a file path (via --output).
An alternate candidate set for splitting may be given using --determiner_type
- multilingual (default) is as described in paper
- en requires a space following punctuation
- all a space between any two characters
- Custom can be written that uses the determiner.Split() base class
By default, expects raw sentences. Splitting a .tsv is also a supported behavior.
1. --text_ids expects a comma separated list of column indices to split
2. --delim changes the delimiter character (default is \t)
Uses gpu if available, to force cpu, use --cpu

Example usage

Typical python usage:

python split.py --input unsegmented.txt --output sentences.txt ersatz.model

std[in,out] usage:

cat unsegmented.txt | split.py ersatz.model > sentences.txt

To split .tsv file:

cat unsegmented.tsv | split.py ersatz.model --text_ids 1 > sentences.txt

Scoring a Model's Output

python score.py [gold_standard_file_path] [file_to_score]

(There are legacy arguments, but they're not used)

Changelog

1.0.0 original release

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
ersatz		ersatz
scripts		scripts
tapes		tapes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QUICK START

Install

Splitting

Scoring

Training a Model

Data Preprocessing

Vocabulary

Create training data

Training

Splitting with a Pre-Trained Model

Example usage

Scoring a Model's Output

Changelog

About

Releases

Packages

Contributors 3

Languages

License

rewicks/ersatz

Folders and files

Latest commit

History

Repository files navigation

QUICK START

Install

Splitting

Scoring

Training a Model

Data Preprocessing

Vocabulary

Create training data

Training

Splitting with a Pre-Trained Model

Example usage

Scoring a Model's Output

Changelog

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages