This project provides an API to perform word alignment.
The list of languages supported depends on the transformer architecture used.
See the main.py file to run an example
sentence1 = "Today I went to the supermarket to buy apples".split()
sentence2 = "Oggi io sono andato al supermercato a comprare le mele".split()
BERT_NAME = "bert-base-multilingual-cased"
wa = WordAlignment(model_name=BERT_NAME, tokenizer_name=BERT_NAME, device='cpu', fp16=False)
_, decoded = wa.get_alignment(sentence1, sentence2, calculate_decode=True)
for (sentence1_w, sentence2_w) in decoded:
print(sentence1_w, "\t--->", sentence2_w)
Output:
Today ---> Oggi
I ---> io
went ---> andato
to ---> al
the ---> al
supermarket ---> supermercato
to ---> a
buy ---> comprare
apples ---> mele
The signature of the function is List[str], List[str], bool -> Tuple[List[int], List[List[str]]]
To speed up the computation you can avoid calculating the decoding posing the boolean value to False.
If calculate_decode is False the second value returned will be None.
The WordAlignment support FP16 but we discourage their use.
The Word Alignment is fully compatible with NVIDIA CUDA.
To use CUDA you have to install the CUDA version of Torch-Scatter lib, I made a simple script to automate it
bash cuda_install_requirements.sh
N.B.: The CUDA installation of Torch-Scatter require minutes to be compiled.
- Python3
- Torch
- Transformers
- Torch-Scatter
- Andrea Bacciu - Github