Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][Vector Store]Support Apache Doris as vector store #17527

Merged
merged 6 commits into from
Feb 18, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions docs/docs/integrations/providers/apache_doris.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Apache Doris

>[Apache Doris](https://doris.apache.org/) is a modern data warehouse for real-time analytics.
It delivers lightning-fast analytics on real-time data at scale.

>Usually `Apache Doris` is categorized into OLAP, and it has showed excellent performance in [ClickBench — a Benchmark For Analytical DBMS](https://benchmark.clickhouse.com/). Since it has a super-fast vectorized execution engine, it could also be used as a fast vectordb.

## Installation and Setup


```bash
pip install pymysql
```

## Vector Store

See a [usage example](/docs/integrations/vectorstores/apache_doris).

```python
from langchain_community.vectorstores import ApacheDoris
```
322 changes: 322 additions & 0 deletions docs/docs/integrations/vectorstores/apache_doris.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,322 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "84180ad0-66cd-43e5-b0b8-2067a29e16ba",
"metadata": {
"collapsed": false
},
"source": [
"# Apache Doris\n",
"\n",
">[Apache Doris](https://doris.apache.org/) is a modern data warehouse for real-time analytics.\n",
"It delivers lightning-fast analytics on real-time data at scale.\n",
"\n",
">Usually `Apache Doris` is categorized into OLAP, and it has showed excellent performance in [ClickBench — a Benchmark For Analytical DBMS](https://benchmark.clickhouse.com/). Since it has a super-fast vectorized execution engine, it could also be used as a fast vectordb.\n",
"\n",
"Here we'll show how to use the Apache Doris Vector Store."
]
},
{
"cell_type": "markdown",
"id": "1685854f",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "311d44bb-4aca-4f3b-8f97-5e1f29238e40",
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade --quiet pymysql"
]
},
{
"cell_type": "markdown",
"id": "2c891bba",
"metadata": {},
"source": [
"Set `update_vectordb = False` at the beginning. If there is no docs updated, then we don't need to rebuild the embeddings of docs"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f4e6ca20-79dd-482a-8f68-af9d7dd59c7c",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"!pip install sqlalchemy\n",
"!pip install langchain"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "96f7c7a2-4811-4fdf-87f5-c60772f51fe1",
"metadata": {
"ExecuteTime": {
"end_time": "2024-02-14T12:54:01.392500Z",
"start_time": "2024-02-14T12:53:58.866615Z"
},
"collapsed": false
},
"outputs": [],
"source": [
"from langchain.chains import RetrievalQA\n",
"from langchain.text_splitter import TokenTextSplitter\n",
"from langchain_community.document_loaders import (\n",
" DirectoryLoader,\n",
" UnstructuredMarkdownLoader,\n",
")\n",
"from langchain_community.vectorstores.apache_doris import (\n",
" ApacheDoris,\n",
" ApacheDorisSettings,\n",
")\n",
"from langchain_openai import OpenAI, OpenAIEmbeddings\n",
"\n",
"update_vectordb = False"
]
},
{
"cell_type": "markdown",
"id": "ee821c00",
"metadata": {},
"source": [
"## Load docs and split them into tokens"
]
},
{
"cell_type": "markdown",
"id": "34ba0cfd",
"metadata": {},
"source": [
"Load all markdown files under the `docs` directory\n",
"\n",
"for Apache Doris documents, you can clone repo from https://github.com/apache/doris, and there is `docs` directory in it."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "799edf20-bcf4-4a65-bff7-b907f6bdba20",
"metadata": {
"ExecuteTime": {
"end_time": "2024-02-14T12:55:24.128917Z",
"start_time": "2024-02-14T12:55:19.463831Z"
},
"collapsed": false
},
"outputs": [],
"source": [
"loader = DirectoryLoader(\n",
" \"./docs\", glob=\"**/*.md\", loader_cls=UnstructuredMarkdownLoader\n",
")\n",
"documents = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "b415fe2a",
"metadata": {},
"source": [
"Split docs into tokens, and set `update_vectordb = True` because there are new docs/tokens."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "0dc5ba83-62ef-4f61-a443-e872f251e7da",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# load text splitter and split docs into snippets of text\n",
"text_splitter = TokenTextSplitter(chunk_size=400, chunk_overlap=50)\n",
"split_docs = text_splitter.split_documents(documents)\n",
"\n",
"# tell vectordb to update text embeddings\n",
"update_vectordb = True"
]
},
{
"cell_type": "markdown",
"id": "46966e25-9449-4a36-87d1-c0b25dce2994",
"metadata": {
"collapsed": false
},
"source": [
"split_docs[-20]"
]
},
{
"cell_type": "markdown",
"id": "99422e95-b407-43eb-aa68-9a62363fc82f",
"metadata": {
"collapsed": false
},
"source": [
"print(\"# docs = %d, # splits = %d\" % (len(documents), len(split_docs)))"
]
},
{
"cell_type": "markdown",
"id": "e780d77f-3f96-4690-a10f-f87566f7ccc6",
"metadata": {
"collapsed": false
},
"source": [
"## Create vectordb instance"
]
},
{
"cell_type": "markdown",
"id": "15702d9c",
"metadata": {},
"source": [
"### Use Apache Doris as vectordb"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "ced7dbe1",
"metadata": {
"ExecuteTime": {
"end_time": "2024-02-14T12:55:39.508287Z",
"start_time": "2024-02-14T12:55:39.500370Z"
}
},
"outputs": [],
"source": [
"def gen_apache_doris(update_vectordb, embeddings, settings):\n",
" if update_vectordb:\n",
" docsearch = ApacheDoris.from_documents(split_docs, embeddings, config=settings)\n",
" else:\n",
" docsearch = ApacheDoris(embeddings, settings)\n",
" return docsearch"
]
},
{
"cell_type": "markdown",
"id": "15d86fda",
"metadata": {},
"source": [
"## Convert tokens into embeddings and put them into vectordb"
]
},
{
"cell_type": "markdown",
"id": "ff1322ea",
"metadata": {},
"source": [
"Here we use Apache Doris as vectordb, you can configure Apache Doris instance via `ApacheDorisSettings`.\n",
"\n",
"Configuring Apache Doris instance is pretty much like configuring mysql instance. You need to specify:\n",
"1. host/port\n",
"2. username(default: 'root')\n",
"3. password(default: '')\n",
"4. database(default: 'default')\n",
"5. table(default: 'langchain')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "b34f8c31-c173-4902-8168-2e838ddfb9e9",
"metadata": {
"ExecuteTime": {
"end_time": "2024-02-14T12:56:02.671291Z",
"start_time": "2024-02-14T12:55:48.350294Z"
},
"collapsed": false
},
"outputs": [],
"source": [
"import os\n",
"from getpass import getpass\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c53ab3f2-9e34-4424-8b07-6292bde67e14",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"update_vectordb = True\n",
"\n",
"embeddings = OpenAIEmbeddings()\n",
"\n",
"# configure Apache Doris settings(host/port/user/pw/db)\n",
"settings = ApacheDorisSettings()\n",
"settings.port = 9030\n",
"settings.host = \"172.30.34.130\"\n",
"settings.username = \"root\"\n",
"settings.password = \"\"\n",
"settings.database = \"langchain\"\n",
"docsearch = gen_apache_doris(update_vectordb, embeddings, settings)\n",
"\n",
"print(docsearch)\n",
"\n",
"update_vectordb = False"
]
},
{
"cell_type": "markdown",
"id": "bde66626",
"metadata": {},
"source": [
"## Build QA and ask question to it"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "84921814",
"metadata": {},
"outputs": [],
"source": [
"llm = OpenAI()\n",
"qa = RetrievalQA.from_chain_type(\n",
" llm=llm, chain_type=\"stuff\", retriever=docsearch.as_retriever()\n",
")\n",
"query = \"what is apache doris\"\n",
"resp = qa.run(query)\n",
"print(resp)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
9 changes: 9 additions & 0 deletions libs/community/langchain_community/vectorstores/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,12 @@ def _import_annoy() -> Any:
return Annoy


def _import_apache_doris() -> Any:
from langchain_community.vectorstores.apache_doris import ApacheDoris

return ApacheDoris


def _import_atlas() -> Any:
from langchain_community.vectorstores.atlas import AtlasDB

Expand Down Expand Up @@ -497,6 +503,8 @@ def __getattr__(name: str) -> Any:
return _import_elastic_vector_search()
elif name == "Annoy":
return _import_annoy()
elif name == "ApacheDoris":
return _import_apache_doris()
elif name == "AtlasDB":
return _import_atlas()
elif name == "AwaDB":
Expand Down Expand Up @@ -640,6 +648,7 @@ def __getattr__(name: str) -> Any:
"AlibabaCloudOpenSearchSettings",
"AnalyticDB",
"Annoy",
"ApacheDoris",
"AtlasDB",
"AwaDB",
"AzureSearch",
Expand Down
Loading
Loading