langchain-ai · baskaryan · Feb 18, 2024 · Feb 14, 2024 · Feb 14, 2024 · Feb 14, 2024
diff --git a/docs/docs/integrations/providers/apache_doris.mdx b/docs/docs/integrations/providers/apache_doris.mdx
@@ -0,0 +1,21 @@
+# Apache Doris
+
+>[Apache Doris](https://doris.apache.org/) is a modern data warehouse for real-time analytics.
+It delivers lightning-fast analytics on real-time data at scale.
+
+>Usually `Apache Doris` is categorized into OLAP, and it has showed excellent performance in [ClickBench — a Benchmark For Analytical DBMS](https://benchmark.clickhouse.com/). Since it has a super-fast vectorized execution engine, it could also be used as a fast vectordb.
+
+## Installation and Setup
+
+
+```bash
+pip install pymysql
+```
+
+## Vector Store
+
+See a [usage example](/docs/integrations/vectorstores/apache_doris).
+
+```python
+from langchain_community.vectorstores import ApacheDoris
+```
diff --git a/docs/docs/integrations/vectorstores/apache_doris.ipynb b/docs/docs/integrations/vectorstores/apache_doris.ipynb
@@ -0,0 +1,322 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "84180ad0-66cd-43e5-b0b8-2067a29e16ba",
+   "metadata": {
+    "collapsed": false
+   },
+   "source": [
+    "# Apache Doris\n",
+    "\n",
+    ">[Apache Doris](https://doris.apache.org/) is a modern data warehouse for real-time analytics.\n",
+    "It delivers lightning-fast analytics on real-time data at scale.\n",
+    "\n",
+    ">Usually `Apache Doris` is categorized into OLAP, and it has showed excellent performance in [ClickBench — a Benchmark For Analytical DBMS](https://benchmark.clickhouse.com/). Since it has a super-fast vectorized execution engine, it could also be used as a fast vectordb.\n",
+    "\n",
+    "Here we'll show how to use the Apache Doris Vector Store."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1685854f",
+   "metadata": {},
+   "source": [
+    "## Setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "311d44bb-4aca-4f3b-8f97-5e1f29238e40",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install --upgrade --quiet  pymysql"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2c891bba",
+   "metadata": {},
+   "source": [
+    "Set `update_vectordb = False` at the beginning. If there is no docs updated, then we don't need to rebuild the embeddings of docs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f4e6ca20-79dd-482a-8f68-af9d7dd59c7c",
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "!pip install  sqlalchemy\n",
+    "!pip install langchain"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "96f7c7a2-4811-4fdf-87f5-c60772f51fe1",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-02-14T12:54:01.392500Z",
+     "start_time": "2024-02-14T12:53:58.866615Z"
+    },
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.chains import RetrievalQA\n",
+    "from langchain.text_splitter import TokenTextSplitter\n",
+    "from langchain_community.document_loaders import (\n",
+    "    DirectoryLoader,\n",
+    "    UnstructuredMarkdownLoader,\n",
+    ")\n",
+    "from langchain_community.vectorstores.apache_doris import (\n",
+    "    ApacheDoris,\n",
+    "    ApacheDorisSettings,\n",
+    ")\n",
+    "from langchain_openai import OpenAI, OpenAIEmbeddings\n",
+    "\n",
+    "update_vectordb = False"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ee821c00",
+   "metadata": {},
+   "source": [
+    "## Load docs and split them into tokens"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "34ba0cfd",
+   "metadata": {},
+   "source": [
+    "Load all markdown files under the `docs` directory\n",
+    "\n",
+    "for Apache Doris documents, you can clone repo from https://github.com/apache/doris, and there is `docs` directory in it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "799edf20-bcf4-4a65-bff7-b907f6bdba20",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-02-14T12:55:24.128917Z",
+     "start_time": "2024-02-14T12:55:19.463831Z"
+    },
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "loader = DirectoryLoader(\n",
+    "    \"./docs\", glob=\"**/*.md\", loader_cls=UnstructuredMarkdownLoader\n",
+    ")\n",
+    "documents = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b415fe2a",
+   "metadata": {},
+   "source": [
+    "Split docs into tokens, and set `update_vectordb = True` because there are new docs/tokens."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "0dc5ba83-62ef-4f61-a443-e872f251e7da",
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# load text splitter and split docs into snippets of text\n",
+    "text_splitter = TokenTextSplitter(chunk_size=400, chunk_overlap=50)\n",
+    "split_docs = text_splitter.split_documents(documents)\n",
+    "\n",
+    "# tell vectordb to update text embeddings\n",
+    "update_vectordb = True"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "46966e25-9449-4a36-87d1-c0b25dce2994",
+   "metadata": {
+    "collapsed": false
+   },
+   "source": [
+    "split_docs[-20]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "99422e95-b407-43eb-aa68-9a62363fc82f",
+   "metadata": {
+    "collapsed": false
+   },
+   "source": [
+    "print(\"# docs  = %d, # splits = %d\" % (len(documents), len(split_docs)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e780d77f-3f96-4690-a10f-f87566f7ccc6",
+   "metadata": {
+    "collapsed": false
+   },
+   "source": [
+    "## Create vectordb instance"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "15702d9c",
+   "metadata": {},
+   "source": [
+    "### Use Apache Doris as vectordb"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "ced7dbe1",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-02-14T12:55:39.508287Z",
+     "start_time": "2024-02-14T12:55:39.500370Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def gen_apache_doris(update_vectordb, embeddings, settings):\n",
+    "    if update_vectordb:\n",
+    "        docsearch = ApacheDoris.from_documents(split_docs, embeddings, config=settings)\n",
+    "    else:\n",
+    "        docsearch = ApacheDoris(embeddings, settings)\n",
+    "    return docsearch"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "15d86fda",
+   "metadata": {},
+   "source": [
+    "## Convert tokens into embeddings and put them into vectordb"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ff1322ea",
+   "metadata": {},
+   "source": [
+    "Here we use Apache Doris as vectordb, you can configure Apache Doris instance via `ApacheDorisSettings`.\n",
+    "\n",
+    "Configuring Apache Doris instance is pretty much like configuring mysql instance. You need to specify:\n",
+    "1. host/port\n",
+    "2. username(default: 'root')\n",
+    "3. password(default: '')\n",
+    "4. database(default: 'default')\n",
+    "5. table(default: 'langchain')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "b34f8c31-c173-4902-8168-2e838ddfb9e9",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-02-14T12:56:02.671291Z",
+     "start_time": "2024-02-14T12:55:48.350294Z"
+    },
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from getpass import getpass\n",
+    "\n",
+    "os.environ[\"OPENAI_API_KEY\"] = getpass()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c53ab3f2-9e34-4424-8b07-6292bde67e14",
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "update_vectordb = True\n",
+    "\n",
+    "embeddings = OpenAIEmbeddings()\n",
+    "\n",
+    "# configure Apache Doris settings(host/port/user/pw/db)\n",
+    "settings = ApacheDorisSettings()\n",
+    "settings.port = 9030\n",
+    "settings.host = \"172.30.34.130\"\n",
+    "settings.username = \"root\"\n",
+    "settings.password = \"\"\n",
+    "settings.database = \"langchain\"\n",
+    "docsearch = gen_apache_doris(update_vectordb, embeddings, settings)\n",
+    "\n",
+    "print(docsearch)\n",
+    "\n",
+    "update_vectordb = False"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bde66626",
+   "metadata": {},
+   "source": [
+    "## Build QA and ask question to it"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "84921814",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "llm = OpenAI()\n",
+    "qa = RetrievalQA.from_chain_type(\n",
+    "    llm=llm, chain_type=\"stuff\", retriever=docsearch.as_retriever()\n",
+    ")\n",
+    "query = \"what is apache doris\"\n",
+    "resp = qa.run(query)\n",
+    "print(resp)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/libs/community/langchain_community/vectorstores/__init__.py b/libs/community/langchain_community/vectorstores/__init__.py
@@ -74,6 +74,12 @@ def _import_annoy() -> Any:
     return Annoy
 
 
+def _import_apache_doris() -> Any:
+    from langchain_community.vectorstores.apache_doris import ApacheDoris
+
+    return ApacheDoris
+
+
 def _import_atlas() -> Any:
     from langchain_community.vectorstores.atlas import AtlasDB
 
@@ -497,6 +503,8 @@ def __getattr__(name: str) -> Any:
         return _import_elastic_vector_search()
     elif name == "Annoy":
         return _import_annoy()
+    elif name == "ApacheDoris":
+        return _import_apache_doris()
     elif name == "AtlasDB":
         return _import_atlas()
     elif name == "AwaDB":
@@ -640,6 +648,7 @@ def __getattr__(name: str) -> Any:
     "AlibabaCloudOpenSearchSettings",
     "AnalyticDB",
     "Annoy",
+    "ApacheDoris",
     "AtlasDB",
     "AwaDB",
     "AzureSearch",