Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[agents] Add support for AstraDB Collections (Astra Vector DB, using Stargate) #731

Merged
merged 15 commits into from
Nov 23, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion examples/applications/chatbot-rag-memory/crawler.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,6 @@ pipeline:
- name: "Detect language"
type: "language-detector"
configuration:
allowedLanguages: ["en"]
property: "language"
- name: "Split into chunks"
type: "text-splitter"
Expand Down
1 change: 0 additions & 1 deletion examples/applications/flare/crawler.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,6 @@ pipeline:
- name: "Detect language"
type: "language-detector"
configuration:
allowedLanguages: ["en", "fr"]
property: "language"
- name: "Split into chunks"
type: "text-splitter"
Expand Down
1 change: 0 additions & 1 deletion examples/applications/langchain-chat/crawler.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,6 @@ pipeline:
- name: "Detect language"
type: "language-detector"
configuration:
allowedLanguages: ["en", "fr"]
property: "language"
- name: "Split into chunks"
type: "text-splitter"
Expand Down
1 change: 0 additions & 1 deletion examples/applications/query-milvus/crawler.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,6 @@ pipeline:
- name: "Detect language"
type: "language-detector"
configuration:
allowedLanguages: ["en", "fr"]
property: "language"
- name: "Split into chunks"
type: "text-splitter"
Expand Down
1 change: 0 additions & 1 deletion examples/applications/query-solr/crawler.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,6 @@ pipeline:
- name: "Detect language"
type: "language-detector"
configuration:
allowedLanguages: ["en", "fr"]
property: "language"
- name: "Split into chunks"
type: "text-splitter"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
java/lib/*
45 changes: 45 additions & 0 deletions examples/applications/webcrawler-astra-vector-db/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Indexing a WebSite

This sample application shows how to use the WebCrawler Source Connector and Astra Vector DB.

## Collections

This application creates a collection named "documents" in your DB.
You can change the name of the collection in the file `configuration.yaml`.

## Configure access to the Vector Database

Export some ENV variables in order to configure access to the database:

```
export ASTRA_VECTOR_DB_TOKEN=AstraCS:...
export ASTRA_VECTOR_DB_ENDPOINT=https://....astra.datastax.com
```

You can find the credentials in the Astra DB console.

The examples/secrets/secrets.yaml resolves those environment variables for you.
When you go in production you are supposed to create a dedicated secrets.yaml file for each environment.

## Configure the pipeline

You can edit the file `crawler.yaml` and configure the list of the allowed web domains, this is required in order to not let the crawler escape outside your data.
Configure the list of seed URLs, for instance with your home page.

The default configuration in this example will crawl the LangStream website.

## Deploy the LangStream application

```
./bin/langstream docker run test -app examples/applications/webcrawler-astra-vector-db -s examples/secrets/secrets.yaml
```

## Talk with the Chat bot

If the application launches successfully, you can talk with the chat bot using UI.

You can also use the CLI:

```
./bin/langstream gateway chat test -cg bot-output -pg user-input -p sessionId=$(uuidgen)
```
106 changes: 106 additions & 0 deletions examples/applications/webcrawler-astra-vector-db/chatbot.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
#
# Copyright DataStax, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

topics:
- name: "questions-topic"
creation-mode: create-if-not-exists
- name: "answers-topic"
creation-mode: create-if-not-exists
- name: "log-topic"
creation-mode: create-if-not-exists
errors:
on-failure: "skip"
pipeline:
- name: "convert-to-structure"
type: "document-to-json"
input: "questions-topic"
configuration:
text-field: "question"
- name: "compute-embeddings"
type: "compute-ai-embeddings"
configuration:
model: "${secrets.open-ai.embeddings-model}" # This needs to match the name of the model deployment, not the base model
embeddings-field: "value.question_embeddings"
text: "{{ value.question }}"
flush-interval: 0
- name: "lookup-related-documents"
type: "query-vector-db"
configuration:
datasource: "AstraDatasource"
query: |
{
"collection-name": "${globals.collection-name}",
"limit": 20,
"vector": ?
}
fields:
- "value.question_embeddings"
output-field: "value.related_documents"
- name: "re-rank documents with MMR"
type: "re-rank"
configuration:
max: 5 # keep only the top 5 documents, because we have an hard limit on the prompt size
field: "value.related_documents"
query-text: "value.question"
query-embeddings: "value.question_embeddings"
output-field: "value.related_documents"
text-field: "record.text"
embeddings-field: "record.vector"
algorithm: "MMR"
lambda: 0.5
k1: 1.2
b: 0.75
- name: "ai-chat-completions"
type: "ai-chat-completions"
configuration:
model: "${secrets.open-ai.chat-completions-model}" # This needs to be set to the model deployment name, not the base name
# on the log-topic we add a field with the answer
completion-field: "value.answer"
# we are also logging the prompt we sent to the LLM
log-field: "value.prompt"
# here we configure the streaming behavior
# as soon as the LLM answers with a chunk we send it to the answers-topic
stream-to-topic: "answers-topic"
# on the streaming answer we send the answer as whole message
# the 'value' syntax is used to refer to the whole value of the message
stream-response-completion-field: "value"
# we want to stream the answer as soon as we have 20 chunks
# in order to reduce latency for the first message the agent sends the first message
# with 1 chunk, then with 2 chunks....up to the min-chunks-per-message value
# eventually we want to send bigger messages to reduce the overhead of each message on the topic
min-chunks-per-message: 20
messages:
- role: system
content: |
An user is going to perform a questions, The documents below may help you in answering to their questions.
Please try to leverage them in your answer as much as possible.
Take into consideration that the user is always asking questions about the LangStream project.
If you provide code or YAML snippets, please explicitly state that they are examples.
Do not provide information that is not related to the LangStream project.

Documents:
{{# value.related_documents}}
{{ text}}
{{/ value.related_documents}}
- role: user
content: "{{ value.question}}"
- name: "cleanup-response"
type: "drop-fields"
output: "log-topic"
configuration:
fields:
- "question_embeddings"
- "related_documents"
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#
#
# Copyright DataStax, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

configuration:
defaults:
globals:
collection-name: "documents"
resources:
- type: "open-ai-configuration"
name: "OpenAI Azure configuration"
configuration:
url: "${secrets.open-ai.url}"
access-key: "${secrets.open-ai.access-key}"
provider: "${secrets.open-ai.provider}"
- type: "datasource"
name: "AstraDatasource"
configuration:
service: "astra-vector-db"
token: "${secrets.astra-vector-db.token}"
endpoint: "${secrets.astra-vector-db.endpoint}"
93 changes: 93 additions & 0 deletions examples/applications/webcrawler-astra-vector-db/crawler.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
#
# Copyright DataStax, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

name: "Crawl a website"
topics:
- name: "chunks-topic"
creation-mode: create-if-not-exists
resources:
size: 2
pipeline:
- name: "Crawl the WebSite"
type: "webcrawler-source"
configuration:
seed-urls: ["https://docs.langstream.ai/"]
allowed-domains: ["https://docs.langstream.ai"]
forbidden-paths: []
min-time-between-requests: 500
reindex-interval-seconds: 3600
max-error-count: 5
max-urls: 1000
max-depth: 50
handle-robots-file: true
user-agent: "" # this is computed automatically, but you can override it
scan-html-documents: true
http-timeout: 10000
handle-cookies: true
max-unflushed-pages: 100
# store data directly on the agent disk, no need for external S3 storage
state-storage: disk
- name: "Extract text"
type: "text-extractor"
- name: "Normalise text"
type: "text-normaliser"
configuration:
make-lowercase: true
trim-spaces: true
- name: "Detect language"
type: "language-detector"
configuration:
property: "language"
- name: "Split into chunks"
type: "text-splitter"
configuration:
splitter_type: "RecursiveCharacterTextSplitter"
chunk_size: 400
separators: ["\n\n", "\n", " ", ""]
keep_separator: false
chunk_overlap: 100
length_function: "cl100k_base"
- name: "Convert to structured data"
type: "document-to-json"
configuration:
text-field: text
copy-properties: true
- name: "prepare-structure"
type: "compute"
configuration:
fields:
- name: "value.filename"
expression: "properties.url"
type: STRING
- name: "value.chunk_id"
expression: "properties.chunk_id"
type: STRING
- name: "value.language"
expression: "properties.language"
type: STRING
- name: "value.chunk_num_tokens"
expression: "properties.chunk_num_tokens"
type: STRING
- name: "compute-embeddings"
id: "step1"
type: "compute-ai-embeddings"
output: "chunks-topic"
configuration:
model: "${secrets.open-ai.embeddings-model}"
embeddings-field: "value.embeddings_vector"
text: "{{ value.text }}"
batch-size: 10
flush-interval: 500
43 changes: 43 additions & 0 deletions examples/applications/webcrawler-astra-vector-db/gateways.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
#
#
# Copyright DataStax, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

gateways:
- id: "user-input"
type: produce
topic: "questions-topic"
parameters:
- sessionId
produceOptions:
headers:
- key: langstream-client-session-id
valueFromParameters: sessionId

- id: "bot-output"
type: consume
topic: "answers-topic"
parameters:
- sessionId
consumeOptions:
filters:
headers:
- key: langstream-client-session-id
valueFromParameters: sessionId


- id: "llm-debug"
type: consume
topic: "log-topic"
Loading
Loading