Skip to content

Commit

Permalink
v0.0.4 instructions and pypi info corrected
Browse files Browse the repository at this point in the history
  • Loading branch information
MartinoMensio committed Jul 24, 2020
1 parent 2e91f95 commit de52b38
Show file tree
Hide file tree
Showing 20 changed files with 105 additions and 67 deletions.
98 changes: 68 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,20 +13,53 @@ To install this package, you can run one of the following:
- `pip install spacy_sentence_bert`
- `pip install git+https://github.com/MartinoMensio/spacy-sentence-bert.git`

You can install standalone spaCy packages from GitHub with pip.
From the [full list of models](https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0) this table describes the models available.


| sentence-BERT name | spacy model name | dimensions | language | STS benchmark | standalone install |
|----------------------------------------|--------------------|----------------------|------------|---------------|---------|
| `bert-base-nli-mean-tokens` | `en_bert_base_nli_mean_tokens` | 768 | en | 77.12 | `pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_base_nli_mean_tokens-0.0.4.tar.gz#en_bert_base_nli_mean_tokens-0.0.4` |
| `bert-base-nli-max-tokens` | `en_bert_base_nli_max_tokens` | 768 | en | 77.21 | `pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_base_nli_max_tokens-0.0.4.tar.gz#en_bert_base_nli_max_tokens-0.0.4` |
| `bert-base-nli-cls-token` | `en_bert_base_nli_cls_token` | 768 | en | 76.30 | `pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_base_nli_cls_token-0.0.4.tar.gz#en_bert_base_nli_cls_token-0.0.4` |
| `bert-large-nli-mean-tokens` | `en_bert_large_nli_mean_tokens` | 1024 | en | 79.19 | `pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_large_nli_mean_tokens-0.0.4.tar.gz#en_bert_large_nli_mean_tokens-0.0.4` |
| `bert-large-nli-max-tokens` | `en_bert_large_nli_max_tokens` | 1024 | en | 78.41 | `pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_large_nli_max_tokens-0.0.4.tar.gz#en_bert_large_nli_max_tokens-0.0.4` |
| `bert-large-nli-cls-token` | `en_bert_large_nli_max_tokens` | 1024 | en | 78.29 | `pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_large_nli_max_tokens-0.0.4.tar.gz#en_bert_large_nli_max_tokens-0.0.4` |
| `roberta-base-nli-mean-tokens` | `en_roberta_base_nli_mean_tokens` | 768 | en | 77.49 | `pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_roberta_base_nli_mean_tokens-0.0.4.tar.gz#en_roberta_base_nli_mean_tokens-0.0.4` |
| `roberta-large-nli-mean-tokens` | `en_roberta_large_nli_mean_tokens` | 1024 | en | 78.69 | `pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_roberta_large_nli_mean_tokens-0.0.4.tar.gz#en_roberta_large_nli_mean_tokens-0.0.4` |
| `distilbert-base-nli-mean-tokens` | `en_distilbert_base_nli_mean_tokens` | 768 | en | 76.97 | `pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_distilbert_base_nli_mean_tokens-0.0.4.tar.gz#en_distilbert_base_nli_mean_tokens-0.0.4` |
| `bert-base-nli-stsb-mean-tokens` | `en_bert_base_nli_stsb_mean_tokens` | 768 | en | 85.14 | `pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_base_nli_stsb_mean_tokens-0.0.4.tar.gz#en_bert_base_nli_stsb_mean_tokens-0.0.4` |
| `bert-large-nli-stsb-mean-tokens` | `en_bert_large_nli_stsb_mean_tokens` | 1024 | en | 85.29 | `pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_bert_large_nli_stsb_mean_tokens-0.0.4.tar.gz#en_bert_large_nli_stsb_mean_tokens-0.0.4` |
| `roberta-base-nli-stsb-mean-tokens` | `en_roberta_base_nli_stsb_mean_tokens` | 768 | en | 85.40 | `pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_roberta_base_nli_stsb_mean_tokens-0.0.4.tar.gz#en_roberta_base_nli_stsb_mean_tokens-0.0.4` |
| `roberta-large-nli-stsb-mean-tokens` | `en_roberta_large_nli_stsb_mean_tokens` | 1024 | en | 86.31 | `pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_roberta_large_nli_stsb_mean_tokens-0.0.4.tar.gz#en_roberta_large_nli_stsb_mean_tokens-0.0.4` |
| `distilbert-base-nli-stsb-mean-tokens` | `en_distilbert_base_nli_stsb_mean_tokens` | 768 | en | 84.38 | `pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/en_distilbert_base_nli_stsb_mean_tokens-0.0.4.tar.gz#en_distilbert_base_nli_stsb_mean_tokens-0.0.4` |
| `distiluse-base-multilingual-cased` | `xx_distiluse_base_multilingual_cased` | 512 | Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish | 80.10 | `pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/xx_distiluse_base_multilingual_cased-0.0.4.tar.gz#xx_distiluse_base_multilingual_cased-0.0.4` |
| `xlm-r-base-en-ko-nli-ststb` | `xx_xlm_r_base_en_ko_nli_ststb` | 768 | en,ko | 81.47 | `pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/xx_xlm_r_base_en_ko_nli_ststb-0.0.4.tar.gz#xx_xlm_r_base_en_ko_nli_ststb-0.0.4` |
| `xlm-r-large-en-ko-nli-ststb` | `xx_xlm_r_base_en_ko_nli_ststb` | 1024 | en,ko | 84.05 | `pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.0.4/xx_xlm_r_base_en_ko_nli_ststb-0.0.4.tar.gz#xx_xlm_r_base_en_ko_nli_ststb-0.0.4` |



## Usage

With this package installed
With this package installed you can obtain a Language model with:

```python
import spacy_sentence_bert
nlp = spacy_sentence_bert.load_model('en_bert_base_nli_cls_token')
nlp = spacy_sentence_bert.load_model('en_roberta_large_nli_stsb_mean_tokens')
```

Or if a specific standalone model is installed from GitHub, you can load it from spaCy:
```bash
pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/en_roberta_large_nli_stsb_mean_tokens-0.0.4/en_roberta_large_nli_stsb_mean_tokens-0.0.4.tar.gz
```

Or if a specific model is installed (e.g. `pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/en_bert_base_nli_cls_token-0.1.0/en_bert_base_nli_cls_token-0.1.0.tar.gz`)
```python
import spacy
nlp = spacy.load('en_bert_base_nli_cls_token')
nlp = spacy.load('en_roberta_large_nli_stsb_mean_tokens')
```

Or if you want to use one of the sentence embeddings over an existing Language object, you can use the `create_from` method:

```python
import spacy
import spacy_sentence_bert
Expand All @@ -35,30 +68,35 @@ nlp = spacy_sentence_bert.create_from(nlp_base, 'en_bert_base_nli_cls_token')
nlp.pipe_names
```

Once you have loaded the model, simply use it to obtain `vector`s and using the `similarity` method of spaCy:

```python
# get two documents
doc_1 = nlp('Hi there, how are you?')
doc_2 = nlp('Hello there, how are you doing today?')
# get the vector of the Doc, Span or Token
print(doc_1.vector.shape)
print(doc_1[3].vector.shape)
print(doc_1[2:4].vector.shape)
# or use the similarity method that is based on the vectors, on Doc, Span or Token
print(doc_1.similarity(doc_2[0:7]))
```




The models, when first used, download to the folder defined with `TORCH_HOME` in the environment variables (default `~/.cache/torch`).


## Utils

Full list of models
https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0


| sentence-BERT name | spacy model name | dimensions | language | STS benchmark |
|----------------------------------------|--------------------|--------------|------------|---|
| `bert-base-nli-mean-tokens` | `en_bert_base_nli_mean_tokens` | 768 | en | 77.12 |
| `bert-base-nli-max-tokens` | `en_bert_base_nli_max_tokens` | 768 | en | 77.21 |
| `bert-base-nli-cls-token` | `en_bert_base_nli_cls_token` | 768 | en | 76.30 |
| `bert-large-nli-mean-tokens` | `en_bert_large_nli_mean_tokens` | 1024 | en | 79.19 |
| `bert-large-nli-max-tokens` | `en_bert_large_nli_max_tokens` | 1024 | en | 78.41 |
| `bert-large-nli-cls-token` | `en_bert_large_nli_max_tokens` | 1024 | en | 78.29 |
| `roberta-base-nli-mean-tokens` | `en_roberta_base_nli_mean_tokens` | 768 | en | 77.49 |
| `roberta-large-nli-mean-tokens` | `en_roberta_large_nli_mean_tokens` | 1024 | en | 78.69 |
| `distilbert-base-nli-mean-tokens` | `en_distilbert_base_nli_mean_tokens` | 768 | en | 76.97 |
| `bert-base-nli-stsb-mean-tokens` | `en_bert_base_nli_stsb_mean_tokens` | 768 | en | 85.14 |
| `bert-large-nli-stsb-mean-tokens` | `en_bert_large_nli_stsb_mean_tokens` | 1024 | en | 85.29 |
| `roberta-base-nli-stsb-mean-tokens` | `en_roberta_base_nli_stsb_mean_tokens` | 768 | en | 85.40 |
| `roberta-large-nli-stsb-mean-tokens` | `en_roberta_large_nli_stsb_mean_tokens` | 1024 | en | 86.31 |
| `distilbert-base-nli-stsb-mean-tokens` | `en_distilbert_base_nli_stsb_mean_tokens` | 768 | en | 84.38 |
| `distiluse-base-multilingual-cased` | `xx_distiluse_base_multilingual_cased` | 512 | Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish | 80.10 |
| `xlm-r-base-en-ko-nli-ststb` | `xx_xlm_r_base_en_ko_nli_ststb` | 768 | en,ko | 81.47 |
| `xlm-r-large-en-ko-nli-ststb` | `xx_xlm_r_base_en_ko_nli_ststb` | 1024 | en,ko | 84.05 |


The models, when first used, download to the folder defined with `TORCH_HOME` in the environment variables (default `~/.cache/torch`).
To build and upload
```bash
VERSION=0.0.4
# build the standalone models
./build_models.sh
# build dist/spacy_sentence_bert-${VERSION}.tar.gz
python setup.py sdist
# upload to pypi
twine upload dist/spacy_sentence_bert-${VERSION}.tar.gz
```
2 changes: 1 addition & 1 deletion build_models.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
set -e


VERSION=0.0.3
VERSION=0.0.4

for MODEL_NAME in en_bert_base_nli_cls_token en_bert_base_nli_max_tokens en_bert_base_nli_mean_tokens \
en_bert_large_nli_mean_tokens en_bert_large_nli_max_tokens en_bert_large_nli_cls_token \
Expand Down
4 changes: 2 additions & 2 deletions setup.cfg
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
[metadata]
version = 0.0.3
version = 0.0.4
description = SpaCy models for using sentence-BERT
description-file = README.md
url = https://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub
url = https://github.com/MartinoMensio/spacy-sentence-berts
author = Martino Mensio
author_email = martino.mensio@open.ac.uk

Expand Down
4 changes: 2 additions & 2 deletions spacy_sentence_bert/meta/en_bert_base_nli_cls_token.json
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
{
"lang": "en",
"name": "bert_base_nli_cls_token",
"version": "0.0.3",
"version": "0.0.4",
"spacy_version": ">=2.3,<2.4",
"description": "Wrapper of sentence-transformers models for spaCy",
"author": "Martino Mensio",
"email": "martino.mensio@open.ac.uk",
"url": "https://github.com/MartinoMensio/spacy-sentence-bert",
"license": "MIT",
"requirements": [
"spacy-sentence-bert==0.0.3"
"spacy-sentence-bert==0.0.4"
],
"sources": [{
"name": "sentence-transformers",
Expand Down
4 changes: 2 additions & 2 deletions spacy_sentence_bert/meta/en_bert_base_nli_max_tokens.json
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
{
"lang": "en",
"name": "bert_base_nli_max_tokens",
"version": "0.0.3",
"version": "0.0.4",
"spacy_version": ">=2.3,<2.4",
"description": "Wrapper of sentence-transformers models for spaCy",
"author": "Martino Mensio",
"email": "martino.mensio@open.ac.uk",
"url": "https://github.com/MartinoMensio/spacy-sentence-bert",
"license": "MIT",
"requirements": [
"spacy-sentence-bert==0.0.3"
"spacy-sentence-bert==0.0.4"
],
"sources": [{
"name": "sentence-transformers",
Expand Down
4 changes: 2 additions & 2 deletions spacy_sentence_bert/meta/en_bert_base_nli_mean_tokens.json
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
{
"lang": "en",
"name": "bert_base_nli_mean_tokens",
"version": "0.0.3",
"version": "0.0.4",
"spacy_version": ">=2.3,<2.4",
"description": "Wrapper of sentence-transformers models for spaCy",
"author": "Martino Mensio",
"email": "martino.mensio@open.ac.uk",
"url": "https://github.com/MartinoMensio/spacy-sentence-bert",
"license": "MIT",
"requirements": [
"spacy-sentence-bert==0.0.3"
"spacy-sentence-bert==0.0.4"
],
"sources": [{
"name": "sentence-transformers",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
{
"lang": "en",
"name": "bert_base_nli_stsb_mean_tokens",
"version": "0.0.3",
"version": "0.0.4",
"spacy_version": ">=2.3,<2.4",
"description": "Wrapper of sentence-transformers models for spaCy",
"author": "Martino Mensio",
"email": "martino.mensio@open.ac.uk",
"url": "https://github.com/MartinoMensio/spacy-sentence-bert",
"license": "MIT",
"requirements": [
"spacy-sentence-bert==0.0.3"
"spacy-sentence-bert==0.0.4"
],
"sources": [{
"name": "sentence-transformers",
Expand Down
4 changes: 2 additions & 2 deletions spacy_sentence_bert/meta/en_bert_large_nli_cls_token.json
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
{
"lang": "en",
"name": "bert_large_nli_cls_token",
"version": "0.0.3",
"version": "0.0.4",
"spacy_version": ">=2.3,<2.4",
"description": "Wrapper of sentence-transformers models for spaCy",
"author": "Martino Mensio",
"email": "martino.mensio@open.ac.uk",
"url": "https://github.com/MartinoMensio/spacy-sentence-bert",
"license": "MIT",
"requirements": [
"spacy-sentence-bert==0.0.3"
"spacy-sentence-bert==0.0.4"
],
"sources": [{
"name": "sentence-transformers",
Expand Down
4 changes: 2 additions & 2 deletions spacy_sentence_bert/meta/en_bert_large_nli_max_tokens.json
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
{
"lang": "en",
"name": "bert_large_nli_max_tokens",
"version": "0.0.3",
"version": "0.0.4",
"spacy_version": ">=2.3,<2.4",
"description": "Wrapper of sentence-transformers models for spaCy",
"author": "Martino Mensio",
"email": "martino.mensio@open.ac.uk",
"url": "https://github.com/MartinoMensio/spacy-sentence-bert",
"license": "MIT",
"requirements": [
"spacy-sentence-bert==0.0.3"
"spacy-sentence-bert==0.0.4"
],
"sources": [{
"name": "sentence-transformers",
Expand Down
4 changes: 2 additions & 2 deletions spacy_sentence_bert/meta/en_bert_large_nli_mean_tokens.json
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
{
"lang": "en",
"name": "bert_large_nli_mean_tokens",
"version": "0.0.3",
"version": "0.0.4",
"spacy_version": ">=2.3,<2.4",
"description": "Wrapper of sentence-transformers models for spaCy",
"author": "Martino Mensio",
"email": "martino.mensio@open.ac.uk",
"url": "https://github.com/MartinoMensio/spacy-sentence-bert",
"license": "MIT",
"requirements": [
"spacy-sentence-bert==0.0.3"
"spacy-sentence-bert==0.0.4"
],
"sources": [{
"name": "sentence-transformers",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
{
"lang": "en",
"name": "bert_large_nli_stsb_mean_tokens",
"version": "0.0.3",
"version": "0.0.4",
"spacy_version": ">=2.3,<2.4",
"description": "Wrapper of sentence-transformers models for spaCy",
"author": "Martino Mensio",
"email": "martino.mensio@open.ac.uk",
"url": "https://github.com/MartinoMensio/spacy-sentence-bert",
"license": "MIT",
"requirements": [
"spacy-sentence-bert==0.0.3"
"spacy-sentence-bert==0.0.4"
],
"sources": [{
"name": "sentence-transformers",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
{
"lang": "en",
"name": "distilbert_base_nli_mean_tokens",
"version": "0.0.3",
"version": "0.0.4",
"spacy_version": ">=2.3,<2.4",
"description": "Wrapper of sentence-transformers models for spaCy",
"author": "Martino Mensio",
"email": "martino.mensio@open.ac.uk",
"url": "https://github.com/MartinoMensio/spacy-sentence-bert",
"license": "MIT",
"requirements": [
"spacy-sentence-bert==0.0.3"
"spacy-sentence-bert==0.0.4"
],
"sources": [{
"name": "sentence-transformers",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
{
"lang": "en",
"name": "distilbert_base_nli_stsb_mean_tokens",
"version": "0.0.3",
"version": "0.0.4",
"spacy_version": ">=2.3,<2.4",
"description": "Wrapper of sentence-transformers models for spaCy",
"author": "Martino Mensio",
"email": "martino.mensio@open.ac.uk",
"url": "https://github.com/MartinoMensio/spacy-sentence-bert",
"license": "MIT",
"requirements": [
"spacy-sentence-bert==0.0.3"
"spacy-sentence-bert==0.0.4"
],
"sources": [{
"name": "sentence-transformers",
Expand Down
4 changes: 2 additions & 2 deletions spacy_sentence_bert/meta/en_roberta_base_nli_mean_tokens.json
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
{
"lang": "en",
"name": "roberta_base_nli_mean_tokens",
"version": "0.0.3",
"version": "0.0.4",
"spacy_version": ">=2.3,<2.4",
"description": "Wrapper of sentence-transformers models for spaCy",
"author": "Martino Mensio",
"email": "martino.mensio@open.ac.uk",
"url": "https://github.com/MartinoMensio/spacy-sentence-bert",
"license": "MIT",
"requirements": [
"spacy-sentence-bert==0.0.3"
"spacy-sentence-bert==0.0.4"
],
"sources": [{
"name": "sentence-transformers",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
{
"lang": "en",
"name": "roberta_base_nli_stsb_mean_tokens",
"version": "0.0.3",
"version": "0.0.4",
"spacy_version": ">=2.3,<2.4",
"description": "Wrapper of sentence-transformers models for spaCy",
"author": "Martino Mensio",
"email": "martino.mensio@open.ac.uk",
"url": "https://github.com/MartinoMensio/spacy-sentence-bert",
"license": "MIT",
"requirements": [
"spacy-sentence-bert==0.0.3"
"spacy-sentence-bert==0.0.4"
],
"sources": [{
"name": "sentence-transformers",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
{
"lang": "en",
"name": "roberta_large_nli_mean_tokens",
"version": "0.0.3",
"version": "0.0.4",
"spacy_version": ">=2.3,<2.4",
"description": "Wrapper of sentence-transformers models for spaCy",
"author": "Martino Mensio",
"email": "martino.mensio@open.ac.uk",
"url": "https://github.com/MartinoMensio/spacy-sentence-bert",
"license": "MIT",
"requirements": [
"spacy-sentence-bert==0.0.3"
"spacy-sentence-bert==0.0.4"
],
"sources": [{
"name": "sentence-transformers",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
{
"lang": "en",
"name": "roberta_large_nli_stsb_mean_tokens",
"version": "0.0.3",
"version": "0.0.4",
"spacy_version": ">=2.3,<2.4",
"description": "Wrapper of sentence-transformers models for spaCy",
"author": "Martino Mensio",
"email": "martino.mensio@open.ac.uk",
"url": "https://github.com/MartinoMensio/spacy-sentence-bert",
"license": "MIT",
"requirements": [
"spacy-sentence-bert==0.0.3"
"spacy-sentence-bert==0.0.4"
],
"sources": [{
"name": "sentence-transformers",
Expand Down
Loading

0 comments on commit de52b38

Please sign in to comment.