Skip to content

Tongjilibo/bert4vector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bert4vector

向量计算、存储、检索、相似度计算

licence GitHub release PyPI PyPI - Downloads GitHub stars GitHub Issues contributions welcome

Documentation | Bert4torch | Examples | Source code

1. 下载安装

  • 安装稳定版
pip install bert4vector
  • 安装最新版
pip install git+https://github.com/Tongjilibo/bert4vector

2. 快速使用

  • 向量计算
from bert4vector.core import BertSimilarity
model = BertSimilarity('/data/pretrain_ckpt/simbert/sushen@simbert_chinese_tiny')
sentences = ['喜欢打篮球的男生喜欢什么样的女生', '西安下雪了?是不是很冷啊?', '第一次去见女朋友父母该如何表现?', '小蝌蚪找妈妈怎么样', '给我推荐一款红色的车', '我喜欢北京']
vecs = model.encode(sentences, convert_to_numpy=True, normalize_embeddings=False)
print(vecs.shape)
# (6, 312)
  • 相似度计算
from bert4vector.core import BertSimilarity
text2vec = BertSimilarity('/data/pretrain_ckpt/simbert/sushen@simbert_chinese_tiny')
sent1 = ['你好', '天气不错']
sent2 = ['你好啊', '天气很好']
similarity = text2vec.similarity(sent1, sent2)
print(similarity)
# [[0.9075422  0.42991278]
#  [0.19584633 0.72635853]]
  • 向量存储和检索
from bert4vector.core import BertSimilarity
model = BertSimilarity('/data/pretrain_ckpt/simbert/sushen@simbert_chinese_tiny')
model.add_corpus(['你好', '我选你', '天气不错', '人很好看'])
print(model.search('你好'))
# {'你好': [{'corpus_id': 0, 'score': 0.9999, 'text': '你好'},
#           {'corpus_id': 3, 'score': 0.5694, 'text': '人很好看'}]} 
  • api部署
from bert4vector.pipelines import SimilaritySever
server = SimilaritySever('/data/pretrain_ckpt/embedding/BAAI--bge-base-zh-v1.5')
server.run(port=port)
# 接口调用可以参考'./examples/api.py'

3. 支持的句向量权重

模型分类 模型名称 权重来源 权重链接 备注(若有)
simbert simbert 追一科技 Tongjilibo/simbert-chinese-base, Tongjilibo/simbert-chinese-small, Tongjilibo/simbert-chinese-tiny
simbert_v2/roformer-sim 追一科技 junnyu/roformer_chinese_sim_char_basejunnyu/roformer_chinese_sim_char_ft_basejunnyu/roformer_chinese_sim_char_smalljunnyu/roformer_chinese_sim_char_ft_small roformer_chinese_sim_char_base, roformer_chinese_sim_char_ft_base, roformer_chinese_sim_char_small, roformer_chinese_sim_char_ft_small
embedding text2vec-base-chinese shibing624 shibing624/text2vec-base-chinese text2vec-base-chinese
m3e moka-ai moka-ai/m3e-base m3e-base
bge BAAI BAAI/bge-large-en-v1.5, BAAI/bge-large-zh-v1.5, BAAI/bge-base-en-v1.5, BAAI/bge-base-zh-v1.5, BAAI/bge-small-en-v1.5, BAAI/bge-small-zh-v1.5 bge-large-en-v1.5, bge-large-zh-v1.5, bge-base-en-v1.5, bge-base-zh-v1.5, bge-small-en-v1.5, bge-small-zh-v1.5
gte thenlper thenlper/gte-large-zh, thenlper/gte-base-zh gte-base-zh, gte-large-zh

*注:

  1. 高亮格式(如Tongjilibo/simbert-chinese-small)的表示可直接联网下载
  2. 国内镜像网站加速下载
    • HF_ENDPOINT=https://hf-mirror.com python your_script.py
    • export HF_ENDPOINT=https://hf-mirror.com后再执行python代码
    • 在python代码开头如下设置
    import os
    os.environ['HF_ENDPOINT'] = "https://hf-mirror.com"

4. 版本历史

更新日期 bert4vector 版本说明
20240710 0.0.4 增加最长公共子序列字面召回,不安装torch也可以使用部分功能
20240628 0.0.3 增加多种字面召回,增加api接口部署
20240131 0.0.2.post2 去除对bert4torch的版本依赖
20231228 0.0.2 初始版本,支持内存和faiss模式

5. 更新历史:

  • 20240710:增加最长公共子序列字面召回,不安装torch也可以使用部分功能
  • 20240628:增加多种字面召回,增加api接口部署
  • 20231228:初始版本,支持内存和faiss模式

6. Reference