Skip to content

MartinoMensio/spacy-sentence-bert

Repository files navigation

Tests Downloads Current Release Version pypi Version

Sentence-BERT for spaCy

This package wraps sentence-transformers (also known as sentence-BERT) directly in spaCy. You can substitute the vectors provided in any spaCy model with vectors that have been tuned specifically for semantic similarity.

The models below are suggested for analysing sentence similarity, as the STS benchmark indicates. Keep in mind that sentence-transformers are configured with a maximum sequence length of 128. Therefore for longer texts it may be more suitable to work with other models (e.g. Universal Sentence Encoder).

Install

Compatibility:

  • python 3.7/3.8/3.9/3.10
  • spaCy>=3.0.0,<4.0.0, last tested on version 3.5
  • sentence-transformers: tested on version 2.2.2

To install this package, you can run one of the following:

  • pip install spacy-sentence-bert
  • pip install git+https://github.com/MartinoMensio/spacy-sentence-bert.git

You can install standalone spaCy packages from GitHub with pip. If you install standalone packages, you will be able to load a language model directly by using the spacy.load API, without need to add a pipeline stage. This table takes the models listed on the Sentence Transformers documentation and shows some statistics along with the instruction to install the standalone models. If you don't want to install the standalone models, you can still use them by adding a pipeline stage (see below).

sentence-BERT name spacy model name dimensions language STS benchmark standalone install
paraphrase-distilroberta-base-v1 en_paraphrase_distilroberta_base_v1 768 en 81.81 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_paraphrase_distilroberta_base_v1-0.1.2.tar.gz#en_paraphrase_distilroberta_base_v1-0.1.2
paraphrase-xlm-r-multilingual-v1 xx_paraphrase_xlm_r_multilingual_v1 768 50+ 83.50 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/xx_paraphrase_xlm_r_multilingual_v1-0.1.2.tar.gz#xx_paraphrase_xlm_r_multilingual_v1-0.1.2
stsb-roberta-large en_stsb_roberta_large 1024 en 86.39 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_roberta_large-0.1.2.tar.gz#en_stsb_roberta_large-0.1.2
stsb-roberta-base en_stsb_roberta_base 768 en 85.44 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_roberta_base-0.1.2.tar.gz#en_stsb_roberta_base-0.1.2
stsb-bert-large en_stsb_bert_large 1024 en 85.29 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_bert_large-0.1.2.tar.gz#en_stsb_bert_large-0.1.2
stsb-distilbert-base en_stsb_distilbert_base 768 en 85.16 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_distilbert_base-0.1.2.tar.gz#en_stsb_distilbert_base-0.1.2
stsb-bert-base en_stsb_bert_base 768 en 85.14 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_bert_base-0.1.2.tar.gz#en_stsb_bert_base-0.1.2
nli-bert-large en_nli_bert_large 1024 en 79.19 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_bert_large-0.1.2.tar.gz#en_nli_bert_large-0.1.2
nli-distilbert-base en_nli_distilbert_base 768 en 78.69 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_distilbert_base-0.1.2.tar.gz#en_nli_distilbert_base-0.1.2
nli-roberta-large en_nli_roberta_large 1024 en 78.69 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_roberta_large-0.1.2.tar.gz#en_nli_roberta_large-0.1.2
nli-bert-large-max-pooling en_nli_bert_large_max_pooling 1024 en 78.41 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_bert_large_max_pooling-0.1.2.tar.gz#en_nli_bert_large_max_pooling-0.1.2
nli-bert-large-cls-pooling en_nli_bert_large_cls_pooling 1024 en 78.29 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_bert_large_cls_pooling-0.1.2.tar.gz#en_nli_bert_large_cls_pooling-0.1.2
nli-distilbert-base-max-pooling en_nli_distilbert_base_max_pooling 768 en 77.61 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_distilbert_base_max_pooling-0.1.2.tar.gz#en_nli_distilbert_base_max_pooling-0.1.2
nli-roberta-base en_nli_roberta_base 768 en 77.49 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_roberta_base-0.1.2.tar.gz#en_nli_roberta_base-0.1.2
nli-bert-base-max-pooling en_nli_bert_base_max_pooling 768 en 77.21 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_bert_base_max_pooling-0.1.2.tar.gz#en_nli_bert_base_max_pooling-0.1.2
nli-bert-base en_nli_bert_base 768 en 77.12 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_bert_base-0.1.2.tar.gz#en_nli_bert_base-0.1.2
nli-bert-base-cls-pooling en_nli_bert_base_cls_pooling 768 en 76.30 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_bert_base_cls_pooling-0.1.2.tar.gz#en_nli_bert_base_cls_pooling-0.1.2
average_word_embeddings_glove.6B.300d en_average_word_embeddings_glove.6B.300d 768 en 61.77 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_average_word_embeddings_glove.6B.300d-0.1.2.tar.gz#en_average_word_embeddings_glove.6B.300d-0.1.2
average_word_embeddings_komninos en_average_word_embeddings_komninos 768 en 61.56 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_average_word_embeddings_komninos-0.1.2.tar.gz#en_average_word_embeddings_komninos-0.1.2
average_word_embeddings_levy_dependency en_average_word_embeddings_levy_dependency 768 en 59.22 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_average_word_embeddings_levy_dependency-0.1.2.tar.gz#en_average_word_embeddings_levy_dependency-0.1.2
average_word_embeddings_glove.840B.300d en_average_word_embeddings_glove.840B.300d 768 en 52.54 pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_average_word_embeddings_glove.840B.300d-0.1.2.tar.gz#en_average_word_embeddings_glove.840B.300d-0.1.2
quora-distilbert-base en_quora_distilbert_base 768 en N/A pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_quora_distilbert_base-0.1.2.tar.gz#en_quora_distilbert_base-0.1.2
quora-distilbert-multilingual xx_quora_distilbert_multilingual 768 50+ N/A pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/xx_quora_distilbert_multilingual-0.1.2.tar.gz#xx_quora_distilbert_multilingual-0.1.2
msmarco-distilroberta-base-v2 en_msmarco_distilroberta_base_v2 768 en N/A pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_msmarco_distilroberta_base_v2-0.1.2.tar.gz#en_msmarco_distilroberta_base_v2-0.1.2
msmarco-roberta-base-v2 en_msmarco_roberta_base_v2 768 en N/A pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_msmarco_roberta_base_v2-0.1.2.tar.gz#en_msmarco_roberta_base_v2-0.1.2
msmarco-distilbert-base-v2 en_msmarco_distilbert_base_v2 768 en N/A pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_msmarco_distilbert_base_v2-0.1.2.tar.gz#en_msmarco_distilbert_base_v2-0.1.2
nq-distilbert-base-v1 en_nq_distilbert_base_v1 768 en N/A pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nq_distilbert_base_v1-0.1.2.tar.gz#en_nq_distilbert_base_v1-0.1.2
distiluse-base-multilingual-cased-v2 xx_distiluse_base_multilingual_cased_v2 512 50+ N/A pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/xx_distiluse_base_multilingual_cased_v2-0.1.2.tar.gz#xx_distiluse_base_multilingual_cased_v2-0.1.2
stsb-xlm-r-multilingual xx_stsb_xlm_r_multilingual 768 50+ N/A pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/xx_stsb_xlm_r_multilingual-0.1.2.tar.gz#xx_stsb_xlm_r_multilingual-0.1.2
T-Systems-onsite/cross-en-de-roberta-sentence-transformer xx_cross_en_de_roberta_sentence_transformer 768 en,de N/A pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/xx_cross_en_de_roberta_sentence_transformer-0.1.2.tar.gz#xx_cross_en_de_roberta_sentence_transformer-0.1.2
LaBSE xx_LaBSE 768 109 N/A pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/xx_LaBSE-0.1.2.tar.gz#xx_LaBSE-0.1.2
allenai-specter en_allenai_specter 768 en N/A pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_allenai_specter-0.1.2.tar.gz#en_allenai_specter-0.1.2

If your model is not in this list (e.g., xlm-r-base-en-ko-nli-ststb), you can still use it with this library but not as a standalone language. You will need to add a pipeline stage properly configured (see below the nlp.add_pipe API).

Usage

There are different ways to load the models of sentence-bert.

  • spacy.load API: you need to have installed one of the models from the table above
  • spacy_sentence_bert.load_model: you can load one of the models from the table above without having installed the standalone packages
  • nlp.add_pipe API: you can load any of the sentence-bert models on top of your nlp object

spacy.load API

Standalone model installed from GitHub (e.g., from the table above, pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_roberta_large-0.1.2.tar.gz#en_stsb_roberta_large-0.1.2), you can load directly the model with the spaCy API:

import spacy
nlp = spacy.load('en_stsb_roberta_large')

spacy_sentence_bert.load_model API

You can obtain the same result without having to install the standalone model, by using this method:

import spacy_sentence_bert
nlp = spacy_sentence_bert.load_model('en_stsb_roberta_large')

nlp.add_pipe API

If you want to use one of the sentence embeddings over an existing Language object, you can use the nlp.add_pipe method. This also works if you want to use a language model that is not listed in the table above. Just make sure that sentence-transformers supports it.

import spacy
nlp = spacy.blank('en')
nlp.add_pipe('sentence_bert', config={'model_name': 'allenai-specter'})
nlp.pipe_names

The models, when first used, download sentence-BERT to the folder defined with TORCH_HOME in the environment variables (default ~/.cache/torch).

Once you have loaded the model, use it through the vector property and the similarity method of spaCy:

# get two documents
doc_1 = nlp('Hi there, how are you?')
doc_2 = nlp('Hello there, how are you doing today?')
# get the vector of the Doc, Span or Token
print(doc_1.vector.shape)
print(doc_1[3].vector.shape)
print(doc_1[2:4].vector.shape)
# or use the similarity method that is based on the vectors, on Doc, Span or Token
print(doc_1.similarity(doc_2[0:7]))

Utils

To build and upload

VERSION=0.1.2
# build the standalone models (17)
./build_models.sh
# build the archive at dist/spacy_sentence_bert-${VERSION}.tar.gz
python setup.py sdist
# upload to pypi
twine upload dist/spacy_sentence_bert-${VERSION}.tar.gz