Upgrading from static vectors to contextual vectors #12915

Gitclop · 2023-08-15T11:38:05Z

Gitclop
Aug 15, 2023

Hey, i have build a semantic-similarity pipeline using the following steps:
Load a set of n documents. For each document do the following:

Tokenisation -> Bag-of-Words
Generate document-vector from the bag-of-words
Build an annoy (aproximate nearest neighbour) index and find the x closest documents using euclidian distance

the static vectors have been trained with word2vec, and i want to compare the accuracy of my similarity task between those static vectors and different transformer models.

For the static vectors, i use a trained spacy-model and this code:
data.loc[:, ('Vektoren')] = data['bag_of_words'].map(lambda s: nlp(s).vector)

changing my vectors within the pipeline is probably not as easy as:

nlp = spacy.load('de_dep_news_trf')
data.loc[:, ('Vektoren')] = data['bag_of_words'].map(lambda s: nlp(s)._.trf_data.tensors)

so how would i upgrade to transoformer models? Is it usefull to still extract the document-vector or are there better ways of finding the x nearest neighbours/documents?

rmitsch · 2023-08-16T10:10:12Z

rmitsch
Aug 16, 2023
Maintainer

Hi @Gitclop, if you just want to extract transformer embeddings and copy them into your dataframe, you are pretty close already:

nlp = spacy.load('de_dep_news_trf')
data.loc[:, ('Vektoren')] = data['bag_of_words'].map(lambda s: nlp(s)._.trf_data.tensors[1])

Is it usefull to still extract the document-vector or are there better ways of finding the x nearest neighbours/documents?

As long as the underlying model has been trained to reflect document similarity in its embeddings, yes. Note that the pretrained transformer pipelines offered by spaCy aren't. For the purpose of comparing documents by their embeddings we recommend sentence-transformers.

2 replies

Gitclop Aug 17, 2023
Author

Thanks a lot!
I tried loading the spacy-sentence models from https://github.com/MartinoMensio/spacy-sentence-bert but every model i tried downgrades my spacy installation to 3.0.9 are there any up-to-date models available for spacy?

rmitsch Aug 17, 2023
Maintainer

We don't maintain this project, I suggest asking there. You can of course always infer the sentence embeddings for your texts without spaCy and add them to your dataframe, if that's all you need.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading from static vectors to contextual vectors #12915

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Upgrading from static vectors to contextual vectors #12915

Gitclop Aug 15, 2023

Replies: 1 comment · 2 replies

rmitsch Aug 16, 2023 Maintainer

Gitclop Aug 17, 2023 Author

rmitsch Aug 17, 2023 Maintainer

Gitclop
Aug 15, 2023

Replies: 1 comment 2 replies

rmitsch
Aug 16, 2023
Maintainer

Gitclop Aug 17, 2023
Author

rmitsch Aug 17, 2023
Maintainer