Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][Vector Store]Support Apache Doris as vector store #17527

Merged
merged 6 commits into from
Feb 18, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions docs/docs/integrations/providers/apache_doris.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Apache Doris

>[Apache Doris](https://doris.apache.org/) is a modern data warehouse for real-time analytics.
It delivers lightning-fast analytics on real-time data at scale.

>Usually `Apache Doris` is categorized into OLAP, and it has showed excellent performance in [ClickBench — a Benchmark For Analytical DBMS](https://benchmark.clickhouse.com/). Since it has a super-fast vectorized execution engine, it could also be used as a fast vectordb.

## Installation and Setup


```bash
pip install pymysql
```

## Vector Store

See a [usage example](/docs/integrations/vectorstores/apache_doris).

```python
from langchain_community.vectorstores import ApacheDoris
```
362 changes: 362 additions & 0 deletions docs/docs/integrations/vectorstores/apache_doris.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,362 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "59723cea",
"metadata": {},
"source": [
"# Apache Doris\n",
"\n",
">[Apache Doris](https://doris.apache.org/) is a modern data warehouse for real-time analytics.\n",
"It delivers lightning-fast analytics on real-time data at scale.\n",
"\n",
">Usually `Apache Doris` is categorized into OLAP, and it has showed excellent performance in [ClickBench — a Benchmark For Analytical DBMS](https://benchmark.clickhouse.com/). Since it has a super-fast vectorized execution engine, it could also be used as a fast vectordb.\n",
"\n",
"Here we'll show how to use the Apache Doris Vector Store."
]
},
{
"cell_type": "markdown",
"id": "1685854f",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "311d44bb-4aca-4f3b-8f97-5e1f29238e40",
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade --quiet pymysql"
]
},
{
"cell_type": "markdown",
"id": "2c891bba",
"metadata": {},
"source": [
"Set `update_vectordb = False` at the beginning. If there is no docs updated, then we don't need to rebuild the embeddings of docs"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"!pip install sqlalchemy\n",
"!pip install langchain"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 1,
"outputs": [],
"source": [
"from langchain.chains import RetrievalQA\n",
"from langchain.text_splitter import TokenTextSplitter\n",
"from langchain_community.document_loaders import (\n",
" DirectoryLoader,\n",
" UnstructuredMarkdownLoader,\n",
")\n",
"from langchain_community.vectorstores.apache_doris import ApacheDoris\n",
"from langchain_community.vectorstores.apache_doris import ApacheDorisSettings\n",
"from langchain_openai import OpenAI, OpenAIEmbeddings\n",
"\n",
"update_vectordb = False"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-02-14T12:54:01.392500Z",
"start_time": "2024-02-14T12:53:58.866615Z"
}
}
},
{
"cell_type": "markdown",
"id": "ee821c00",
"metadata": {},
"source": [
"## Load docs and split them into tokens"
]
},
{
"cell_type": "markdown",
"id": "34ba0cfd",
"metadata": {},
"source": [
"Load all markdown files under the `docs` directory\n",
"\n",
"for Apache Doris documents, you can clone repo from https://github.com/apache/doris, and there is `docs` directory in it."
]
},
{
"cell_type": "code",
"execution_count": 2,
"outputs": [],
"source": [
"loader = DirectoryLoader(\n",
" \"./docs\", glob=\"**/*.md\", loader_cls=UnstructuredMarkdownLoader\n",
")\n",
"documents = loader.load()"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-02-14T12:55:24.128917Z",
"start_time": "2024-02-14T12:55:19.463831Z"
}
}
},
{
"cell_type": "markdown",
"id": "b415fe2a",
"metadata": {},
"source": [
"Split docs into tokens, and set `update_vectordb = True` because there are new docs/tokens."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "07e8acff",
"metadata": {
"ExecuteTime": {
"end_time": "2024-02-14T12:55:27.090729Z",
"start_time": "2024-02-14T12:55:26.857946Z"
}
},
"outputs": [],
"source": [
"# load text splitter and split docs into snippets of text\n",
"text_splitter = TokenTextSplitter(chunk_size=400, chunk_overlap=50)\n",
"split_docs = text_splitter.split_documents(documents)\n",
"\n",
"# tell vectordb to update text embeddings\n",
"update_vectordb = True"
]
},
{
"cell_type": "markdown",
"source": [],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 4,
"outputs": [
{
"data": {
"text/plain": "Document(page_content=\"xxxx on backend xxx exceed limit usage\\n\\nUsually occurs in operations such as Import, Alter, etc. This error means that the usage of the corresponding disk corresponding to the BE exceeds the threshold (default 95%). In this case, you can first use the show backends command, where MaxDiskUsedPct shows the usage of the disk with the highest usage on the corresponding BE. If If it exceeds 95%, this error will be reported.\\n\\nAt this point, you need to go to the corresponding BE node to check the usage in the data directory. The trash directory and snapshot directory can be manually cleaned to free up space. If the data directory occupies a large space, you need to consider deleting some data to free up space. For details, please refer to Disk Space Management.\\n\\nQ7. Calling stream load to import data through a Java program may result in a Broken Pipe error when a batch of data is large.\\n\\nApart from Broken Pipe, some other weird errors may occur.\\n\\nThis situation usually occurs after enabling httpv2. Because httpv2 is an http service implemented using spring boot, and uses tomcat as the default built-in container. However, there seems to be some problems with tomcat's handling of 307 forwarding, so the built-in container was modified to jetty later. In addition, the version of apache http client in the java program needs to use the version after 4.5.13. In the previous version, there were also some problems with the processing of forwarding.\\n\\nSo this problem can be solved in two ways:\\n\\nDisable httpv2\\n\\nRestart FE after adding enable_http_server_v2=false in fe.conf. However, the new version of the UI interface can no longer be used, and some new interfaces based on httpv2 can not be used. (Normal import queries are not affected).\\n\\nUpgrade\\n\\nUpgrading to Doris\", metadata={'source': '/Users/liugddx/code/doris/docs/en/docs/get-starting/faq/data-faq.md'})"
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"split_docs[-20]"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-02-14T12:55:29.659291Z",
"start_time": "2024-02-14T12:55:29.642954Z"
}
}
},
{
"cell_type": "markdown",
"source": [],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 5,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# docs = 5, # splits = 47\n"
]
}
],
"source": [
"print(\"# docs = %d, # splits = %d\" % (len(documents), len(split_docs)))"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-02-14T12:55:33.556128Z",
"start_time": "2024-02-14T12:55:33.547821Z"
}
}
},
{
"cell_type": "markdown",
"source": [],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"id": "5371f152",
"metadata": {},
"source": [
"## Create vectordb instance"
]
},
{
"cell_type": "markdown",
"id": "15702d9c",
"metadata": {},
"source": [
"### Use Apache Doris as vectordb"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "ced7dbe1",
"metadata": {
"ExecuteTime": {
"end_time": "2024-02-14T12:55:39.508287Z",
"start_time": "2024-02-14T12:55:39.500370Z"
}
},
"outputs": [],
"source": [
"def gen_apache_doris(update_vectordb, embeddings, settings):\n",
" if update_vectordb:\n",
" docsearch = ApacheDoris.from_documents(split_docs, embeddings, config=settings)\n",
" else:\n",
" docsearch = ApacheDoris(embeddings, settings)\n",
" return docsearch"
]
},
{
"cell_type": "markdown",
"id": "15d86fda",
"metadata": {},
"source": [
"## Convert tokens into embeddings and put them into vectordb"
]
},
{
"cell_type": "markdown",
"id": "ff1322ea",
"metadata": {},
"source": [
"Here we use Apache Doris as vectordb, you can configure Apache Doris instance via `ApacheDorisSettings`.\n",
"\n",
"Configuring Apache Doris instance is pretty much like configuring mysql instance. You need to specify:\n",
"1. host/port\n",
"2. username(default: 'root')\n",
"3. password(default: '')\n",
"4. database(default: 'default')\n",
"5. table(default: 'langchain')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"outputs": [],
"source": [
"import os\n",
"\n",
"from getpass import getpass\n",
"os.environ['OPENAI_API_KEY'] = getpass()"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-02-14T12:56:02.671291Z",
"start_time": "2024-02-14T12:55:48.350294Z"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"update_vectordb = True\n",
"\n",
"embeddings = OpenAIEmbeddings()\n",
"\n",
"# configure Apache Doris settings(host/port/user/pw/db)\n",
"settings = ApacheDorisSettings()\n",
"settings.port = 9030\n",
"settings.host = \"172.30.34.130\"\n",
"settings.username = \"root\"\n",
"settings.password = \"\"\n",
"settings.database = \"langchain\"\n",
"docsearch = gen_apache_doris(update_vectordb, embeddings, settings)\n",
"\n",
"print(docsearch)\n",
"\n",
"update_vectordb = False"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"id": "bde66626",
"metadata": {},
"source": [
"## Build QA and ask question to it"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "84921814",
"metadata": {},
"outputs": [],
"source": [
"llm = OpenAI()\n",
"qa = RetrievalQA.from_chain_type(\n",
" llm=llm, chain_type=\"stuff\", retriever=docsearch.as_retriever()\n",
")\n",
"query = \"what is apache doris\"\n",
"resp = qa.run(query)\n",
"print(resp)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading
Loading