diff --git a/docs/user_guide/hash_vs_json_05.ipynb b/docs/user_guide/hash_vs_json_05.ipynb index 183cbc58..f9cdf6e2 100644 --- a/docs/user_guide/hash_vs_json_05.ipynb +++ b/docs/user_guide/hash_vs_json_05.ipynb @@ -5,16 +5,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Vectorizers\n", + "# Hash vs JSON Storage\n", + "\n", + "\n", + "Out of the box, Redis provides a [variety of data structures](https://redis.com/redis-enterprise/data-structures/) that can adapt to your domain specific applications and use cases.\n", + "In this notebook, we will demonstrate how to use RedisVL with both [Hash](https://redis.io/docs/data-types/hashes/) and [JSON](https://redis.io/docs/data-types/json/) data.\n", "\n", - "In this notebook, we will show how to use RedisVL to create embeddings using the built-in text embedding vectorizers. Today RedisVL supports:\n", - "1. OpenAI\n", - "2. HuggingFace\n", - "3. Vertex AI\n", "\n", "Before running this notebook, be sure to\n", "1. Have installed ``redisvl`` and have that environment active for this notebook.\n", - "2. Have a running Redis Stack instance with RediSearch > 2.4 active.\n", + "2. Have a running Redis Stack or Redis Enterprise instance with RediSearch > 2.4 activated.\n", "\n", "For example, you can run Redis Stack locally with Docker:\n", "\n", @@ -22,7 +22,9 @@ "docker run -d -p 6379:6379 -p 8001:8001 redis/redis-stack:latest\n", "```\n", "\n", - "This will run Redis on port 6379 and RedisInsight at http://localhost:8001." + "Or create a [FREE Redis Enterprise instance.](https://redis.com/try-free).\n", + "\n", + "This example will assume a local Redis is running on port 6379 and RedisInsight at 8001." ] }, { @@ -32,345 +34,459 @@ "outputs": [], "source": [ "# import necessary modules\n", - "import os" + "import os\n", + "import pickle\n", + "from jupyterutils import table_print, result_print\n", + "from redisvl.index import SearchIndex\n", + "\n", + "\n", + "# load in the example data and printing utils\n", + "data = pickle.load(open(\"hybrid_example_data.pkl\", \"rb\"))" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
useragejobcredit_scoreoffice_locationuser_embedding
john18engineerhigh-122.4194,37.7749b'\\xcd\\xcc\\xcc=\\xcd\\xcc\\xcc=\\x00\\x00\\x00?'
derrick14doctorlow-122.4194,37.7749b'\\xcd\\xcc\\xcc=\\xcd\\xcc\\xcc=\\x00\\x00\\x00?'
nancy94doctorhigh-122.4194,37.7749b'333?\\xcd\\xcc\\xcc=\\x00\\x00\\x00?'
tyler100engineerhigh-122.0839,37.3861b'\\xcd\\xcc\\xcc=\\xcd\\xcc\\xcc>\\x00\\x00\\x00?'
tim12dermatologisthigh-122.0839,37.3861b'\\xcd\\xcc\\xcc>\\xcd\\xcc\\xcc>\\x00\\x00\\x00?'
taimur15CEOlow-122.0839,37.3861b'\\x9a\\x99\\x19?\\xcd\\xcc\\xcc=\\x00\\x00\\x00?'
joe35dentistmedium-122.0839,37.3861b'fff?fff?\\xcd\\xcc\\xcc='
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "table_print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Creating Text Embeddings\n", - "\n", - "This example will show how to create an embedding from 3 simple sentences with a number of different text vectorizers in RedisVL.\n", - "\n", - "- \"That is a happy dog\"\n", - "- \"That is a happy person\"\n", - "- \"Today is a nice day\"\n" + "## Hash or JSON -- how to choose?\n", + "Both storage options offer a variety of features and tradeoffs. Below we will work through a dummy dataset to learn when and how to use both." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### OpenAI\n", + "### Working with Hashes\n", + "Hashes in Redis are simple collections of field-value pairs. Think of it like a mutable single-level dictionary contains multiple \"rows\":\n", "\n", - "The ``OpenAITextVectorizer`` makes it simple to use RedisVL with the embeddings models at OpenAI. For this you will need to install ``openai``. \n", "\n", - "```bash\n", - "pip install openai\n", - "```\n" + "```python\n", + "{\n", + " \"model\": \"Deimos\",\n", + " \"brand\": \"Ergonom\",\n", + " \"type\": \"Enduro bikes\",\n", + " \"price\": 4972,\n", + "}\n", + "```\n", + "\n", + "Hashes are best suited for use cases with the following characteristics:\n", + "- Performance (speed) and storage space (memory consumption) are top concerns\n", + "- Data can be easily normalized and modeled as a single-level dict\n", + "\n", + "> Hashes are typically the default recommendation." ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# define the hash index schema\n", + "hash_schema = {\n", + " \"index\": {\n", + " \"name\": \"user-hashes\",\n", + " \"storage_type\": \"hash\", # default setting\n", + " \"prefix\": \"hash\",\n", + " \"key_separator\": \":\",\n", + " },\n", + " \"fields\": {\n", + " \"tag\": [{\"name\": \"credit_score\"}, {\"name\": \"user\"}],\n", + " \"text\": [{\"name\": \"job\"}],\n", + " \"numeric\": [{\"name\": \"age\"}],\n", + " \"geo\": [{\"name\": \"office_location\"}],\n", + " \"vector\": [{\n", + " \"name\": \"user_embedding\",\n", + " \"dims\": 3,\n", + " \"distance_metric\": \"cosine\",\n", + " \"algorithm\": \"flat\",\n", + " \"datatype\": \"float32\"}\n", + " ]\n", + " },\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ - "import getpass\n", + "# construct a search index from the hash schema\n", + "hindex = SearchIndex.from_dict(hash_schema)\n", "\n", - "# setup the API Key\n", - "api_key = os.environ.get(\"OPENAI_API_KEY\") or getpass.getpass(\"Enter your OpenAI API key: \")" + "# connect to local redis instance\n", + "hindex.connect(\"redis://localhost:6379\")\n", + "\n", + "# create the index (no data yet)\n", + "hindex.create(overwrite=True)" ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 5, "metadata": {}, "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Vector dimensions: 1536\n" - ] - }, { "data": { "text/plain": [ - "[-0.001046799123287201,\n", - " -0.0031105349771678448,\n", - " 0.0024228920228779316,\n", - " -0.004480978474020958,\n", - " -0.010343699716031551,\n", - " 0.012758520431816578,\n", - " -0.00535263866186142,\n", - " -0.003002384677529335,\n", - " -0.007115328684449196,\n", - " -0.03378167003393173]" + "'hash'" ] }, - "execution_count": 3, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "from redisvl.vectorize.text import OpenAITextVectorizer\n", - "\n", - "# create a vectorizer\n", - "oai = OpenAITextVectorizer(\n", - " model=\"text-embedding-ada-002\",\n", - " api_config={\"api_key\": api_key},\n", - ")\n", - "\n", - "test = oai.embed(\"This is a test sentence.\")\n", - "print(\"Vector dimensions: \", len(test))\n", - "test[:10]" + "# show the underlying storage type\n", + "hindex.storage_type" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Vectors as byte strings\n", + "One nuance when working with Hashes in Redis, is that all vectorized data must be passed as a byte string (for efficient storage, indexing, and processing). An example of that can be seen below:" ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[-0.017399806529283524,\n", - " -2.3427608653037169e-07,\n", - " 0.0014656063867732882,\n", - " -0.02562308870255947,\n", - " -0.019890939816832542,\n", - " 0.016027139499783516,\n", - " -0.0036763285752385855,\n", - " 0.0008253469131886959,\n", - " 0.006609130185097456,\n", - " -0.025165533646941185]" + "{'user': 'john',\n", + " 'age': 18,\n", + " 'job': 'engineer',\n", + " 'credit_score': 'high',\n", + " 'office_location': '-122.4194,37.7749',\n", + " 'user_embedding': b'\\xcd\\xcc\\xcc=\\xcd\\xcc\\xcc=\\x00\\x00\\x00?'}" ] }, - "execution_count": 4, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "# Create many embeddings at once\n", - "sentences = [\n", - " \"That is a happy dog\",\n", - " \"That is a happy person\",\n", - " \"Today is a sunny day\"\n", - "]\n", - "\n", - "embeddings = oai.embed_many(sentences)\n", - "embeddings[0][:10]" + "# show a single entry from the data that will be loaded\n", + "data[0]" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "# load hash data\n", + "hindex.load(data)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Number of Embeddings: 3\n" + "\n", + "Statistics:\n", + "╭─────────────────────────────┬─────────────╮\n", + "│ Stat Key │ Value │\n", + "├─────────────────────────────┼─────────────┤\n", + "│ num_docs │ 7 │\n", + "│ num_terms │ 6 │\n", + "│ max_doc_id │ 7 │\n", + "│ num_records │ 44 │\n", + "│ percent_indexed │ 1 │\n", + "│ hash_indexing_failures │ 0 │\n", + "│ number_of_uses │ 2 │\n", + "│ bytes_per_record_avg │ 3.40909 │\n", + "│ doc_table_size_mb │ 0.000700951 │\n", + "│ inverted_sz_mb │ 0.000143051 │\n", + "│ key_table_size_mb │ 0.000276566 │\n", + "│ offset_bits_per_record_avg │ 8 │\n", + "│ offset_vectors_sz_mb │ 8.58307e-06 │\n", + "│ offsets_per_term_avg │ 0.204545 │\n", + "│ records_per_doc_avg │ 6.28571 │\n", + "│ sortable_values_size_mb │ 0 │\n", + "│ total_indexing_time │ 0.919 │\n", + "│ total_inverted_index_blocks │ 18 │\n", + "│ vector_index_sz_mb │ 0.0202332 │\n", + "╰─────────────────────────────┴─────────────╯\n" ] } ], "source": [ - "# openai also supports asyncronous requests, which we can use to speed up the vectorization process.\n", - "embeddings = await oai.aembed_many(sentences)\n", - "print(\"Number of Embeddings:\", len(embeddings))\n" + "!rvl stats -i user-hashes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Huggingface\n", - "\n", - "[Huggingface](https://huggingface.co/models) is a popular NLP platform that has a number of pre-trained models you can use off the shelf. RedisVL supports using Huggingface \"Sentence Transformers\" to create embeddings from text. To use Huggingface, you will need to install the ``sentence-transformers`` library.\n", - "\n", - "```bash\n", - "pip install sentence-transformers\n", - "```" + "#### Performing Queries\n", + "Once our index is created and data is loaded into the right format, we can run queries against the index with RedisVL:" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 9, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
vector_distanceusercredit_scoreagejoboffice_location
0johnhigh18engineer-122.4194,37.7749
0.109129190445tylerhigh100engineer-122.0839,37.3861
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], "source": [ - "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n", - "from redisvl.vectorize.text import HFTextVectorizer\n", + "from redisvl.query import VectorQuery\n", + "from redisvl.query.filter import Tag, Text\n", "\n", + "t = (Tag(\"credit_score\") == \"high\") & (Text(\"job\") % \"enginee*\")\n", "\n", - "# create a vectorizer\n", - "# choose your model from the huggingface website\n", - "hf = HFTextVectorizer(model=\"sentence-transformers/all-mpnet-base-v2\")\n", + "v = VectorQuery([0.1, 0.1, 0.5],\n", + " \"user_embedding\",\n", + " return_fields=[\"user\", \"credit_score\", \"age\", \"job\", \"office_location\"],\n", + " filter_expression=t)\n", "\n", - "# embed a sentence\n", - "test = hf.embed(\"This is a test sentence.\")\n", - "test[:10]" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [], - "source": [ - "# You can also create many embeddings at once\n", - "embeddings = hf.embed_many(sentences, as_buffer=True)\n" + "\n", + "results = hindex.query(v)\n", + "result_print(results)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### VertexAI\n", - "\n", - "[VertexAI](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) is GCP's fully-featured AI platform including a number of pretrained LLMs. RedisVL supports using VertexAI to create embeddings from these models. To use VertexAI, you will first need to install the ``google-cloud-aiplatform`` library.\n", - "\n", - "```bash\n", - "pip install google-cloud-aiplatform>=1.26\n", + "### Working with JSON\n", + "Redis also supports native **JSON** objects. These can be multi-level (nested) objects, with full JSONPath support for updating/retrieving sub elements:\n", + "\n", + "```python\n", + "{\n", + " \"name\": \"bike\",\n", + " \"metadata\": {\n", + " \"model\": \"Deimos\",\n", + " \"brand\": \"Ergonom\",\n", + " \"type\": \"Enduro bikes\",\n", + " \"price\": 4972,\n", + " }\n", + "}\n", "```\n", "\n", - "1. Then you need to gain access to a [Google Cloud Project](https://cloud.google.com/gcp?hl=en) and provide [access to credentials](https://cloud.google.com/docs/authentication/application-default-credentials). This typically accomplished with the `GOOGLE_APPLICATION_CREDENTIALS` environment variable pointing to the path of a JSON key file downloaded from your service account on GCP.\n", - "2. Lastly, you need to find your [project ID](https://support.google.com/googleapi/answer/7014113?hl=en) and [geographic region for VertexAI](https://cloud.google.com/vertex-ai/docs/general/locations)." + "JSON is best suited for use cases with the following characteristics:\n", + "- Ease of use and data model flexibility are top concerns\n", + "- Application data is already native JSON\n", + "- Replacing another document storage/db solution" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "from redisvl.vectorize.text import VertexAITextVectorizer\n", - "\n", - "\n", - "# create a vectorizer\n", - "vtx = VertexAITextVectorizer(\n", - " api_config={\n", - " \"project_id\": os.environ[\"GCP_PROJECT_ID\"],\n", - " \"location\": os.environ[\"GCP_LOCATION\"]\n", - " }\n", - ")\n", - "\n", - "# embed a sentence\n", - "test = vtx.embed(\"This is a test sentence.\")\n", - "test[:10]" + "#### Full JSON Path support\n", + "Because RedisJSON enables full path support, when creating an index schema, elements need to be indexed and selected by their path with the `name` param and aliased using the `as_name` param as shown below." ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 10, "metadata": {}, + "outputs": [], "source": [ - "## Search with Provider Embeddings\n", - "\n", - "Now that we've created our embeddings, we can use them to search for similar sentences. We will use the same 3 sentences from above and search for similar sentences.\n", - "\n", - "First, we need to create the schema for our index.\n", - "\n", - "Here's what the schema for the example looks like in yaml for the HuggingFace vectorizer:\n", - "\n", - "```yaml\n", - "index:\n", - " name: providers\n", - " prefix: rvl\n", - "\n", - "fields:\n", - " text:\n", - " - name: sentence\n", - " vector:\n", - " - name: embedding\n", - " dims: 768\n", - " algorithm: flat\n", - " distance_metric: cosine\n", - "```" + "# define the json index schema\n", + "json_schema = {\n", + " \"index\": {\n", + " \"name\": \"user-json\",\n", + " \"storage_type\": \"json\", # updated storage_type option\n", + " \"prefix\": \"json\",\n", + " \"key_separator\": \":\",\n", + " },\n", + " \"fields\": {\n", + " \"tag\": [{\"name\": \"$.credit_score\", \"as_name\": \"credit_score\"}, {\"name\": \"$.user\", \"as_name\": \"user\"}],\n", + " \"text\": [{\"name\": \"$.job\", \"as_name\": \"job\"}],\n", + " \"numeric\": [{\"name\": \"$.age\", \"as_name\": \"age\"}],\n", + " \"geo\": [{\"name\": \"$.office_location\", \"as_name\": \"office_location\"}],\n", + " \"vector\": [{\n", + " \"name\": \"$.user_embedding\",\n", + " \"as_name\": \"user_embedding\",\n", + " \"dims\": 3,\n", + " \"distance_metric\": \"cosine\",\n", + " \"algorithm\": \"flat\",\n", + " \"datatype\": \"float32\"}\n", + " ]\n", + " },\n", + "}" ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ - "from redisvl.index import SearchIndex\n", - "\n", - "# construct a search index from the schema\n", - "index = SearchIndex.from_yaml(\"./schema.yaml\")\n", + "# construct a search index from the json schema\n", + "jindex = SearchIndex.from_dict(json_schema)\n", "\n", "# connect to local redis instance\n", - "index.connect(\"redis://localhost:6379\")\n", + "jindex.connect(\"redis://localhost:6379\")\n", "\n", "# create the index (no data yet)\n", - "index.create(overwrite=True)" + "jindex.create(overwrite=True)" ] }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "\u001b[32m20:13:35\u001b[0m \u001b[34m[RedisVL]\u001b[0m \u001b[1;30mINFO\u001b[0m Indices:\n", - "\u001b[32m20:13:35\u001b[0m \u001b[34m[RedisVL]\u001b[0m \u001b[1;30mINFO\u001b[0m 1. providers\n" + "\u001b[32m12:28:36\u001b[0m \u001b[34m[RedisVL]\u001b[0m \u001b[1;30mINFO\u001b[0m Indices:\n", + "\u001b[32m12:28:36\u001b[0m \u001b[34m[RedisVL]\u001b[0m \u001b[1;30mINFO\u001b[0m 1. user-hashes\n", + "\u001b[32m12:28:36\u001b[0m \u001b[34m[RedisVL]\u001b[0m \u001b[1;30mINFO\u001b[0m 2. user-json\n" ] } ], "source": [ - "# use the CLI to see the created index\n", + "# note the multiple indices in the same database\n", "!rvl index listall" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Vectors as float arrays\n", + "Vectorized data stored in JSON must be stored as a pure array (python list) of floats. We will modify our sample data to account for this below:" + ] + }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 13, "metadata": {}, "outputs": [], "source": [ - "# load expects an iterable of dictionaries where\n", - "# the vector is stored as a bytes buffer\n", + "import numpy as np\n", "\n", - "data = [{\"text\": t,\n", - " \"embedding\": v}\n", - " for t, v in zip(sentences, embeddings)]\n", + "json_data = data.copy()\n", "\n", - "index.load(data)" + "for d in json_data:\n", + " d['user_embedding'] = np.frombuffer(d['user_embedding'], dtype=np.float32).tolist()" ] }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 14, "metadata": {}, "outputs": [ { - "name": "stdout", - "output_type": "stream", - "text": [ - "That is a happy dog\n", - "0.160862445831\n", - "That is a happy person\n", - "0.273598074913\n", - "Today is a sunny day\n", - "0.744559526443\n" - ] + "data": { + "text/plain": [ + "{'user': 'john',\n", + " 'age': 18,\n", + " 'job': 'engineer',\n", + " 'credit_score': 'high',\n", + " 'office_location': '-122.4194,37.7749',\n", + " 'user_embedding': [0.10000000149011612, 0.10000000149011612, 0.5]}" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" } ], "source": [ - "from redisvl.query import VectorQuery\n", - "\n", - "# use the HuggingFace vectorizer again to create a query embedding\n", - "query_embedding = hf.embed(\"That is a happy cat\")\n", - "\n", - "query = VectorQuery(\n", - " vector=query_embedding,\n", - " vector_field_name=\"embedding\",\n", - " return_fields=[\"text\"],\n", - " num_results=3\n", - ")\n", - "\n", - "results = index.search(query.query, query_params=query.params)\n", - "for doc in results.docs:\n", - " print(doc.text)\n", - " print(doc.vector_distance)" + "# inspect a single JSON record\n", + "json_data[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "jindex.load(json_data)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
vector_distanceusercredit_scoreagejoboffice_location
0johnhigh18engineer-122.4194,37.7749
0.109129190445tylerhigh100engineer-122.0839,37.3861
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# we can now run the exact same query as above\n", + "result_print(jindex.query(v))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cleanup" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "hindex.delete()\n", + "jindex.delete()" ] } ],