Add JSON Path based Query Engine (run-llama#4595)

* add llama index json path query based index * update to use jsonpath-ng module for json path query parsing/execution. Update notebook so that it works * add docstrings + unit tests * formatting * remove GPTJSONIndex as it didnt really fit into the paradigm of an index * updates from PR comments * fix linting error * black reformatting * line length
sandman · Jun 4, 2023 · 6a3b1b9 · 6a3b1b9
1 parent 6986d9d
commit 6a3b1b9
Show file tree

Hide file tree

Showing 7 changed files with 643 additions and 0 deletions.
diff --git a/data_requirements.txt b/data_requirements.txt
@@ -6,6 +6,7 @@ slack_sdk
 discord.py
 boto3
 moto[s3,dynamodb]
+jsonpath-ng
 
 # google
 google-api-python-client

diff --git a/docs/examples/index_structs/struct_indices/JSONIndexDemo.ipynb b/docs/examples/index_structs/struct_indices/JSONIndexDemo.ipynb
@@ -0,0 +1,358 @@
+{
+    "cells": [
+        {
+            "attachments": {},
+            "cell_type": "markdown",
+            "id": "e45f9b60-cd6b-4c15-958f-1feca5438128",
+            "metadata": {},
+            "source": [
+                "# JSON Index\n",
+                "The JSON index is useful for querying JSON documents that conform to a JSON schema.\n",
+                "\n",
+                "This JSON schema is then used in the context of a prompt to convert a natural language query into a structured JSON Path query. This JSON Path query is then used to retrieve data to answer the given question."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 1,
+            "id": "f7c5da2e",
+            "metadata": {},
+            "outputs": [
+                {
+                    "name": "stdout",
+                    "output_type": "stream",
+                    "text": [
+                        "Requirement already satisfied: jsonpath-ng in /workspaces/llama_index/.venv/lib/python3.10/site-packages (1.5.3)\n",
+                        "Requirement already satisfied: ply in /workspaces/llama_index/.venv/lib/python3.10/site-packages (from jsonpath-ng) (3.11)\n",
+                        "Requirement already satisfied: decorator in /workspaces/llama_index/.venv/lib/python3.10/site-packages (from jsonpath-ng) (5.1.1)\n",
+                        "Requirement already satisfied: six in /workspaces/llama_index/.venv/lib/python3.10/site-packages (from jsonpath-ng) (1.16.0)\n"
+                    ]
+                }
+            ],
+            "source": [
+                "# First, install the jsonpath-ng package which is used by default to parse & execute the JSONPath queries.\n",
+                "!pip install jsonpath-ng"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 2,
+            "id": "119eb42b",
+            "metadata": {
+                "tags": []
+            },
+            "outputs": [],
+            "source": [
+                "import logging\n",
+                "import sys\n",
+                "\n",
+                "logging.basicConfig(stream=sys.stdout, level=logging.INFO)\n",
+                "logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 3,
+            "id": "7aa21e46",
+            "metadata": {},
+            "outputs": [
+                {
+                    "data": {
+                        "text/plain": [
+                            "False"
+                        ]
+                    },
+                    "execution_count": 3,
+                    "metadata": {},
+                    "output_type": "execute_result"
+                }
+            ],
+            "source": [
+                "import dotenv\n",
+                "dotenv.load_dotenv(\"../../../.env\")"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 4,
+            "id": "107396a9-4aa7-49b3-9f0f-a755726c19ba",
+            "metadata": {
+                "tags": []
+            },
+            "outputs": [],
+            "source": [
+                "from IPython.display import Markdown, display"
+            ]
+        },
+        {
+            "attachments": {},
+            "cell_type": "markdown",
+            "id": "5ece7d73-0f67-4ff5-95e5-249a25bd118c",
+            "metadata": {},
+            "source": [
+                "### Let's start on a Toy JSON\n",
+                "\n",
+                "Very simple JSON object containing data from a blog post site with user comments.\n",
+                "\n",
+                "We will also provide a JSON schema (which we were able to generate by giving ChatGPT a sample of the JSON).\n",
+                "\n",
+                "#### Advice\n",
+                "Do make sure that you've provided a helpful `\"description\"` value for each of the fields in your JSON schema.\n",
+                "\n",
+                "As you can see in the given example, the description for the `\"username\"` field mentions that usernames are lowercased. You'll see that this ends up being helpful for the LLM in producing the correct JSON path query."
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 5,
+            "id": "1484fe58-4853-4a76-bffc-435a9cce3e2e",
+            "metadata": {
+                "tags": []
+            },
+            "outputs": [],
+            "source": [
+                "# Test on some sample data \n",
+                "json_value = {\n",
+                "  \"blogPosts\": [\n",
+                "    {\n",
+                "      \"id\": 1,\n",
+                "      \"title\": \"First blog post\",\n",
+                "      \"content\": \"This is my first blog post\"\n",
+                "    },\n",
+                "    {\n",
+                "      \"id\": 2,\n",
+                "      \"title\": \"Second blog post\",\n",
+                "      \"content\": \"This is my second blog post\"\n",
+                "    }\n",
+                "  ],\n",
+                "  \"comments\": [\n",
+                "    {\n",
+                "      \"id\": 1,\n",
+                "      \"content\": \"Nice post!\",\n",
+                "      \"username\": \"jerry\",\n",
+                "      \"blogPostId\": 1\n",
+                "    },\n",
+                "    {\n",
+                "      \"id\": 2,\n",
+                "      \"content\": \"Interesting thoughts\",\n",
+                "      \"username\": \"simon\",\n",
+                "      \"blogPostId\": 2\n",
+                "    },\n",
+                "    {\n",
+                "      \"id\": 3,\n",
+                "      \"content\": \"Loved reading this!\",\n",
+                "      \"username\": \"simon\",\n",
+                "      \"blogPostId\": 2\n",
+                "    }\n",
+                "  ]\n",
+                "}\n",
+                "\n",
+                "# JSON Schema object that the above JSON value conforms to\n",
+                "json_schema = {\n",
+                "  \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n",
+                "  \"description\": \"Schema for a very simple blog post app\",\n",
+                "  \"type\": \"object\",\n",
+                "  \"properties\": {\n",
+                "    \"blogPosts\": {\n",
+                "      \"description\": \"List of blog posts\",\n",
+                "      \"type\": \"array\",\n",
+                "      \"items\": {\n",
+                "        \"type\": \"object\",\n",
+                "        \"properties\": {\n",
+                "          \"id\": {\n",
+                "            \"description\": \"Unique identifier for the blog post\",\n",
+                "            \"type\": \"integer\"\n",
+                "          },\n",
+                "          \"title\": {\n",
+                "            \"description\": \"Title of the blog post\",\n",
+                "            \"type\": \"string\"\n",
+                "          },\n",
+                "          \"content\": {\n",
+                "            \"description\": \"Content of the blog post\",\n",
+                "            \"type\": \"string\"\n",
+                "          }\n",
+                "        },\n",
+                "        \"required\": [\"id\", \"title\", \"content\"]\n",
+                "      }\n",
+                "    },\n",
+                "    \"comments\": {\n",
+                "      \"description\": \"List of comments on blog posts\",\n",
+                "      \"type\": \"array\",\n",
+                "      \"items\": {\n",
+                "        \"type\": \"object\",\n",
+                "        \"properties\": {\n",
+                "          \"id\": {\n",
+                "            \"description\": \"Unique identifier for the comment\",\n",
+                "            \"type\": \"integer\"\n",
+                "          },\n",
+                "          \"content\": {\n",
+                "            \"description\": \"Content of the comment\",\n",
+                "            \"type\": \"string\"\n",
+                "          },\n",
+                "          \"username\": {\n",
+                "            \"description\": \"Username of the commenter (lowercased)\",\n",
+                "            \"type\": \"string\"\n",
+                "          },\n",
+                "          \"blogPostId\": {\n",
+                "            \"description\": \"Identifier for the blog post to which the comment belongs\",\n",
+                "            \"type\": \"integer\"\n",
+                "          }\n",
+                "        },\n",
+                "        \"required\": [\"id\", \"content\", \"username\", \"blogPostId\"]\n",
+                "      }\n",
+                "    }\n",
+                "  },\n",
+                "  \"required\": [\"blogPosts\", \"comments\"]\n",
+                "}\n"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 6,
+            "id": "4fea2edb-b3d4-4313-a656-d6edb00d93c0",
+            "metadata": {
+                "tags": []
+            },
+            "outputs": [
+                {
+                    "name": "stdout",
+                    "output_type": "stream",
+                    "text": [
+                        "INFO:numexpr.utils:NumExpr defaulting to 2 threads.\n",
+                        "NumExpr defaulting to 2 threads.\n"
+                    ]
+                },
+                {
+                    "name": "stderr",
+                    "output_type": "stream",
+                    "text": [
+                        "/workspaces/llama_index/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+                        "  from .autonotebook import tqdm as notebook_tqdm\n"
+                    ]
+                }
+            ],
+            "source": [
+                "from llama_index.indices.service_context import ServiceContext\n",
+                "from langchain.llms.openai import OpenAI\n",
+                "from llama_index.indices.struct_store import GPTJSONQueryEngine\n",
+                "\n",
+                "llm = OpenAI(model_name=\"text-davinci-003\")\n",
+                "service_context = ServiceContext.from_defaults()\n",
+                "nl_query_engine = GPTJSONQueryEngine(json_value=json_value, json_schema=json_schema, service_context=service_context)\n",
+                "raw_query_engine = GPTJSONQueryEngine(json_value=json_value, json_schema=json_schema, service_context=service_context, synthesize_response=False)"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 7,
+            "id": "451836bc-b073-4838-8ab8-3def7d2c4d9d",
+            "metadata": {
+                "tags": []
+            },
+            "outputs": [
+                {
+                    "name": "stdout",
+                    "output_type": "stream",
+                    "text": [
+                        "INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 797 tokens\n",
+                        "> [query] Total LLM token usage: 797 tokens\n",
+                        "INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 0 tokens\n",
+                        "> [query] Total embedding token usage: 0 tokens\n",
+                        "INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 363 tokens\n",
+                        "> [query] Total LLM token usage: 363 tokens\n",
+                        "INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 0 tokens\n",
+                        "> [query] Total embedding token usage: 0 tokens\n"
+                    ]
+                }
+            ],
+            "source": [
+                "nl_response = nl_query_engine.query(\n",
+                "    \"What comments has Jerry been writing?\",\n",
+                ")\n",
+                "raw_response = raw_query_engine.query(\n",
+                "    \"What comments has Jerry been writing?\",\n",
+                ")"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 8,
+            "id": "4253d4c3-f3e5-4779-bcd1-2e6e2818305f",
+            "metadata": {
+                "tags": []
+            },
+            "outputs": [
+                {
+                    "data": {
+                        "text/markdown": [
+                            "<h1>Natural language Response</h1><br><b> Jerry has written one comment with the content 'Nice post!' on blog post with id 1.</b>"
+                        ],
+                        "text/plain": [
+                            "<IPython.core.display.Markdown object>"
+                        ]
+                    },
+                    "metadata": {},
+                    "output_type": "display_data"
+                },
+                {
+                    "data": {
+                        "text/markdown": [
+                            "<h1>Raw JSON Response</h1><br><b>[{\"id\": 1, \"content\": \"Nice post!\", \"username\": \"jerry\", \"blogPostId\": 1}]</b>"
+                        ],
+                        "text/plain": [
+                            "<IPython.core.display.Markdown object>"
+                        ]
+                    },
+                    "metadata": {},
+                    "output_type": "display_data"
+                }
+            ],
+            "source": [
+                "display(Markdown(f\"<h1>Natural language Response</h1><br><b>{nl_response}</b>\"))\n",
+                "display(Markdown(f\"<h1>Raw JSON Response</h1><br><b>{raw_response}</b>\"))"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 9,
+            "id": "5e10b7da-b355-49b2-9f80-f17541d4f850",
+            "metadata": {
+                "tags": []
+            },
+            "outputs": [
+                {
+                    "name": "stdout",
+                    "output_type": "stream",
+                    "text": [
+                        " $.comments[?(@.username == 'jerry')]\n"
+                    ]
+                }
+            ],
+            "source": [
+                "# get the json path query string. Same would apply to raw_response\n",
+                "print(nl_response.extra_info[\"json_path_response_str\"])"
+            ]
+        }
+    ],
+    "metadata": {
+        "kernelspec": {
+            "display_name": ".venv",
+            "language": "python",
+            "name": "python3"
+        },
+        "language_info": {
+            "codemirror_mode": {
+                "name": "ipython",
+                "version": 3
+            },
+            "file_extension": ".py",
+            "mimetype": "text/x-python",
+            "name": "python",
+            "nbconvert_exporter": "python",
+            "pygments_lexer": "ipython3",
+            "version": "3.10.4"
+        }
+    },
+    "nbformat": 4,
+    "nbformat_minor": 5
+}
diff --git a/llama_index/indices/struct_store/__init__.py b/llama_index/indices/struct_store/__init__.py
@@ -10,6 +10,7 @@
     GPTNLStructStoreQueryEngine,
     GPTSQLStructStoreQueryEngine,
 )
+from llama_index.indices.struct_store.json_query import GPTJSONQueryEngine
 
 __all__ = [
     "GPTSQLStructStoreIndex",
@@ -18,4 +19,5 @@
     "GPTNLPandasQueryEngine",
     "GPTNLStructStoreQueryEngine",
     "GPTSQLStructStoreQueryEngine",
+    "GPTJSONQueryEngine",
 ]