Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harrison/matching engine vectorstore #3350

Merged
merged 50 commits into from
May 31, 2023
Merged
Show file tree
Hide file tree
Changes from 48 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
5dc6bce
Initial implementation for matching engine
Apr 14, 2023
0fc0cdf
Fixing bugs
Apr 14, 2023
43f458f
Finished initial implementation. About to test
Apr 14, 2023
ffd3939
Typos
Apr 14, 2023
a4e4f0a
Typos
Apr 14, 2023
7b1b3d5
Typos
Apr 14, 2023
e0d956c
Typos
Apr 14, 2023
7603e85
Typos
Apr 14, 2023
b492c03
Typos
Apr 14, 2023
de768d0
Typos
Apr 14, 2023
b1ecb20
Typos
Apr 14, 2023
7e00904
Typos
Apr 14, 2023
c3c27fe
Typos
Apr 14, 2023
18e6412
Typos
Apr 14, 2023
c0c10e8
Typos
Apr 14, 2023
21b443f
Typos
Apr 14, 2023
78bfdb2
Typos
Apr 14, 2023
45c2c8f
Typos
Apr 14, 2023
583abe7
Typos
Apr 14, 2023
d5602e6
Typos
Apr 14, 2023
c12b9ab
Typos
Apr 14, 2023
ad5c433
Typos
Apr 14, 2023
dc1db6f
Typos
Apr 14, 2023
cd63328
Typos
Apr 14, 2023
5a3dce9
Typos
Apr 14, 2023
fbdef49
Typos
Apr 14, 2023
a2c3e7e
Typos
Apr 14, 2023
4d80f81
Typos
Apr 14, 2023
014821f
Typos
Apr 14, 2023
bb2c78a
Typos
Apr 14, 2023
bca9c74
fixes
Apr 18, 2023
cb15331
Merge branch 'master' of github.com:tomaspiaggio/langchain
scafati98 Apr 18, 2023
3a1950b
Documenting code. Added TODOs to scafati98
Apr 18, 2023
0643d97
Documenting code. Added TODOs to scafati98
Apr 18, 2023
de31106
Added matchingengine.ipynb to documentation examples.
Apr 18, 2023
0f6a4ea
Fixed the comments. Didn't fix linting yet. Stuck with Poetry.
Apr 19, 2023
3267bb5
Continued with comments
Apr 19, 2023
364b83a
Added dependencies to pyproject.toml
Apr 19, 2023
e2554b1
Removed HUB_MODEL constant as it was the default anyway.
Apr 19, 2023
af73772
Changed gcs_bucket_uri to gcs_bucket_name in all places. Changed argu…
Apr 20, 2023
2f946f5
Moved all object creations out of the object and into from_components.
Apr 20, 2023
89d574a
merge
hwchase17 Apr 22, 2023
0221268
Merge branch 'tomaspiaggio-master' into harrison/matching-engine-vect…
hwchase17 Apr 22, 2023
22fbf99
cr
hwchase17 Apr 22, 2023
22c89c8
cr
hwchase17 Apr 29, 2023
c3097b5
cr
hwchase17 Apr 29, 2023
5443030
cr
dev2049 May 10, 2023
4e9eab4
merge
dev2049 May 10, 2023
ebf840f
cr
dev2049 May 31, 2023
a537dbe
merge
dev2049 May 31, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
346 changes: 346 additions & 0 deletions docs/modules/indexes/vectorstores/examples/matchingengine.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,346 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "655b8f55-2089-4733-8b09-35dea9580695",
"metadata": {},
"source": [
"# MatchingEngine\n",
"\n",
"This notebook shows how to use functionality related to the GCP Vertex AI `MatchingEngine` vector database.\n",
"\n",
"> Vertex AI [Matching Engine](https://cloud.google.com/vertex-ai/docs/matching-engine/overview) provides the industry's leading high-scale low latency vector database. These vector databases are commonly referred to as vector similarity-matching or an approximate nearest neighbor (ANN) service.\n",
"\n",
"**Note**: This module expects an endpoint and deployed index already created as the creation time takes close to one hour. To see how to create an index refer to the section [Create Index and deploy it to an Endpoint](#create-index-and-deploy-it-to-an-endpoint)"
]
},
{
"cell_type": "markdown",
"id": "a9971578-0ae9-4809-9e80-e5f9d3dcc98a",
"metadata": {},
"source": [
"## Create VectorStore from texts"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f7c96da4-8d97-4f69-8c13-d2fcafc03b05",
"metadata": {},
"outputs": [],
"source": [
"from langchain.vectorstores import MatchingEngine"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "58b70880-edd9-46f3-b769-f26c2bcc8395",
"metadata": {},
"outputs": [],
"source": [
"texts = ['The cat sat on', 'the mat.', 'I like to', 'eat pizza for', 'dinner.', 'The sun sets', 'in the west.']\n",
"\n",
"\n",
"vector_store = MatchingEngine.from_components(\n",
" texts=texts,\n",
" project_id=\"<my_project_id>\",\n",
" region=\"<my_region>\",\n",
" gcs_bucket_uri=\"<my_gcs_bucket>\",\n",
" index_id=\"<my_matching_engine_index_id>\",\n",
" enpoint_id=\"<my_matching_engine_endpoint_id>\"\n",
dev2049 marked this conversation as resolved.
Show resolved Hide resolved
")\n",
"\n",
"vector_store.add_texts(texts=texts)\n",
"\n",
"vector_store.similarity_search(\"lunch\", k=2)"
]
},
{
"cell_type": "markdown",
"id": "0e76e05c-d4ef-49a1-b1b9-2ea989a0eda3",
"metadata": {
"tags": []
},
"source": [
"## Create Index and deploy it to an Endpoint"
]
},
{
"cell_type": "markdown",
"id": "61935a91-5efb-48af-bb40-ea1e83e24974",
"metadata": {},
"source": [
"### Imports, Constants and Configs"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "421b66c9-5b8f-4ef7-821e-12886a62b672",
"metadata": {},
"outputs": [],
"source": [
"# Installing dependencies.\n",
"!pip install tensorflow \\\n",
" google-cloud-aiplatform \\\n",
" tensorflow-hub \\\n",
" tensorflow-text "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e4e9cc02-371e-40a1-bce9-37ac8efdf2cb",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import json\n",
"\n",
"from google.cloud import aiplatform\n",
"import tensorflow_hub as hub\n",
"import tensorflow_text"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "352a05df-6532-4aba-a36f-603327a5bc5b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"PROJECT_ID = \"<my_project_id>\"\n",
"REGION = \"<my_region>\"\n",
"VPC_NETWORK = \"<my_vpc_network_name>\"\n",
"PEERING_RANGE_NAME = \"ann-langchain-me-range\" # Name for creating the VPC peering.\n",
"BUCKET_URI = \"gs://<bucket_uri>\"\n",
"# The number of dimensions for the tensorflow universal sentence encoder. \n",
"# If other embedder is used, the dimensions would probably need to change.\n",
"DIMENSIONS = 512\n",
"DISPLAY_NAME = \"index-test-name\"\n",
"EMBEDDING_DIR = f\"{BUCKET_URI}/banana\"\n",
"DEPLOYED_INDEX_ID = \"endpoint-test-name\"\n",
"\n",
"PROJECT_NUMBER = !gcloud projects list --filter=\"PROJECT_ID:'{PROJECT_ID}'\" --format='value(PROJECT_NUMBER)'\n",
"PROJECT_NUMBER = PROJECT_NUMBER[0]\n",
"VPC_NETWORK_FULL = f\"projects/{PROJECT_NUMBER}/global/networks/{VPC_NETWORK}\"\n",
"\n",
"# Change this if you need the VPC to be created.\n",
"CREATE_VPC = False"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "076e7931-f83e-4597-8748-c8004fd8de96",
"metadata": {},
"outputs": [],
"source": [
"# Set the project id\n",
"! gcloud config set project {PROJECT_ID}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4265081b-a5b7-491e-8ac5-1e26975b9974",
"metadata": {},
"outputs": [],
"source": [
"# Remove the if condition to run the encapsulated code\n",
"if CREATE_VPC:\n",
" # Create a VPC network\n",
" ! gcloud compute networks create {VPC_NETWORK} --bgp-routing-mode=regional --subnet-mode=auto --project={PROJECT_ID}\n",
"\n",
" # Add necessary firewall rules\n",
" ! gcloud compute firewall-rules create {VPC_NETWORK}-allow-icmp --network {VPC_NETWORK} --priority 65534 --project {PROJECT_ID} --allow icmp\n",
"\n",
" ! gcloud compute firewall-rules create {VPC_NETWORK}-allow-internal --network {VPC_NETWORK} --priority 65534 --project {PROJECT_ID} --allow all --source-ranges 10.128.0.0/9\n",
"\n",
" ! gcloud compute firewall-rules create {VPC_NETWORK}-allow-rdp --network {VPC_NETWORK} --priority 65534 --project {PROJECT_ID} --allow tcp:3389\n",
"\n",
" ! gcloud compute firewall-rules create {VPC_NETWORK}-allow-ssh --network {VPC_NETWORK} --priority 65534 --project {PROJECT_ID} --allow tcp:22\n",
"\n",
" # Reserve IP range\n",
" ! gcloud compute addresses create {PEERING_RANGE_NAME} --global --prefix-length=16 --network={VPC_NETWORK} --purpose=VPC_PEERING --project={PROJECT_ID} --description=\"peering range\"\n",
"\n",
" # Set up peering with service networking\n",
" # Your account must have the \"Compute Network Admin\" role to run the following.\n",
" ! gcloud services vpc-peerings connect --service=servicenetworking.googleapis.com --network={VPC_NETWORK} --ranges={PEERING_RANGE_NAME} --project={PROJECT_ID}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9dfbb847-fc53-48c1-b0f2-00d1c4330b01",
"metadata": {},
"outputs": [],
"source": [
"# Creating bucket.\n",
"! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI"
]
},
{
"cell_type": "markdown",
"id": "f9698068-3d2f-471b-90c3-dae3e4ca6f63",
"metadata": {},
"source": [
"### Using Tensorflow Universal Sentence Encoder as an Embedder"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "144007e2-ddf8-43cd-ac45-848be0458ba9",
"metadata": {},
"outputs": [],
"source": [
"# Load the Universal Sentence Encoder module\n",
"module_url = \"https://tfhub.dev/google/universal-sentence-encoder-multilingual/3\"\n",
"model = hub.load(module_url)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "94a2bdcb-c7e3-4fb0-8c97-cc1f2263f06c",
"metadata": {},
"outputs": [],
"source": [
"# Generate embeddings for each word\n",
"embeddings = model(['banana'])"
]
},
{
"cell_type": "markdown",
"id": "5a4e6e99-5e42-4e55-90f6-c03aae4fbf14",
"metadata": {},
"source": [
"### Inserting a test embedding"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "024c78f3-4663-4d8f-9f3c-b7d82073ada4",
"metadata": {},
"outputs": [],
"source": [
"initial_config = {\"id\": \"banana_id\", \"embedding\": [float(x) for x in list(embeddings.numpy()[0])]}\n",
"\n",
"with open(\"data.json\", \"w\") as f:\n",
" json.dump(initial_config, f)\n",
"\n",
"!gsutil cp data.json {EMBEDDING_DIR}/file.json"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a11489f4-5904-4fc2-9178-f32c2df0406d",
"metadata": {},
"outputs": [],
"source": [
"aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)"
]
},
{
"cell_type": "markdown",
"id": "e3c6953b-11f6-4803-bf2d-36fa42abf3c7",
"metadata": {},
"source": [
"### Creating Index"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c31c3c56-bfe0-49ec-9901-cd146f592da7",
"metadata": {},
"outputs": [],
"source": [
"my_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(\n",
" display_name=DISPLAY_NAME,\n",
" contents_delta_uri=EMBEDDING_DIR,\n",
" dimensions=DIMENSIONS,\n",
" approximate_neighbors_count=150,\n",
" distance_measure_type=\"DOT_PRODUCT_DISTANCE\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "50770669-edf6-4796-9563-d1ea59cfa8e8",
"metadata": {},
"source": [
"### Creating Endpoint"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20c93d1b-a7d5-47b0-9c95-1aec1c62e281",
"metadata": {},
"outputs": [],
"source": [
"my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(\n",
" display_name=f\"{DISPLAY_NAME}-endpoint\",\n",
" network=VPC_NETWORK_FULL,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "b52df797-28db-4b4a-b79c-e8a274293a6a",
"metadata": {},
"source": [
"### Deploy Index"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "019a7043-ad11-4a48-bec7-18928547b2ba",
"metadata": {},
"outputs": [],
"source": [
"my_index_endpoint = my_index_endpoint.deploy_index(\n",
" index=my_index, \n",
" deployed_index_id=DEPLOYED_INDEX_ID\n",
")\n",
"\n",
"my_index_endpoint.deployed_indexes"
]
}
],
"metadata": {
"environment": {
"kernel": "python3",
"name": "common-cpu.m107",
"type": "gcloud",
"uri": "gcr.io/deeplearning-platform-release/base-cpu:m107"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading