forked from run-llama/llama_index
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add JSON Path based Query Engine (run-llama#4595)
* add llama index json path query based index * update to use jsonpath-ng module for json path query parsing/execution. Update notebook so that it works * add docstrings + unit tests * formatting * remove GPTJSONIndex as it didnt really fit into the paradigm of an index * updates from PR comments * fix linting error * black reformatting * line length
- Loading branch information
1 parent
6986d9d
commit 6a3b1b9
Showing
7 changed files
with
643 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,6 +6,7 @@ slack_sdk | |
discord.py | ||
boto3 | ||
moto[s3,dynamodb] | ||
jsonpath-ng | ||
|
||
google-api-python-client | ||
|
358 changes: 358 additions & 0 deletions
358
docs/examples/index_structs/struct_indices/JSONIndexDemo.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,358 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"id": "e45f9b60-cd6b-4c15-958f-1feca5438128", | ||
"metadata": {}, | ||
"source": [ | ||
"# JSON Index\n", | ||
"The JSON index is useful for querying JSON documents that conform to a JSON schema.\n", | ||
"\n", | ||
"This JSON schema is then used in the context of a prompt to convert a natural language query into a structured JSON Path query. This JSON Path query is then used to retrieve data to answer the given question." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "f7c5da2e", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Requirement already satisfied: jsonpath-ng in /workspaces/llama_index/.venv/lib/python3.10/site-packages (1.5.3)\n", | ||
"Requirement already satisfied: ply in /workspaces/llama_index/.venv/lib/python3.10/site-packages (from jsonpath-ng) (3.11)\n", | ||
"Requirement already satisfied: decorator in /workspaces/llama_index/.venv/lib/python3.10/site-packages (from jsonpath-ng) (5.1.1)\n", | ||
"Requirement already satisfied: six in /workspaces/llama_index/.venv/lib/python3.10/site-packages (from jsonpath-ng) (1.16.0)\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# First, install the jsonpath-ng package which is used by default to parse & execute the JSONPath queries.\n", | ||
"!pip install jsonpath-ng" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "119eb42b", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import logging\n", | ||
"import sys\n", | ||
"\n", | ||
"logging.basicConfig(stream=sys.stdout, level=logging.INFO)\n", | ||
"logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "7aa21e46", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"False" | ||
] | ||
}, | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"import dotenv\n", | ||
"dotenv.load_dotenv(\"../../../.env\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"id": "107396a9-4aa7-49b3-9f0f-a755726c19ba", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from IPython.display import Markdown, display" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"id": "5ece7d73-0f67-4ff5-95e5-249a25bd118c", | ||
"metadata": {}, | ||
"source": [ | ||
"### Let's start on a Toy JSON\n", | ||
"\n", | ||
"Very simple JSON object containing data from a blog post site with user comments.\n", | ||
"\n", | ||
"We will also provide a JSON schema (which we were able to generate by giving ChatGPT a sample of the JSON).\n", | ||
"\n", | ||
"#### Advice\n", | ||
"Do make sure that you've provided a helpful `\"description\"` value for each of the fields in your JSON schema.\n", | ||
"\n", | ||
"As you can see in the given example, the description for the `\"username\"` field mentions that usernames are lowercased. You'll see that this ends up being helpful for the LLM in producing the correct JSON path query." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"id": "1484fe58-4853-4a76-bffc-435a9cce3e2e", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Test on some sample data \n", | ||
"json_value = {\n", | ||
" \"blogPosts\": [\n", | ||
" {\n", | ||
" \"id\": 1,\n", | ||
" \"title\": \"First blog post\",\n", | ||
" \"content\": \"This is my first blog post\"\n", | ||
" },\n", | ||
" {\n", | ||
" \"id\": 2,\n", | ||
" \"title\": \"Second blog post\",\n", | ||
" \"content\": \"This is my second blog post\"\n", | ||
" }\n", | ||
" ],\n", | ||
" \"comments\": [\n", | ||
" {\n", | ||
" \"id\": 1,\n", | ||
" \"content\": \"Nice post!\",\n", | ||
" \"username\": \"jerry\",\n", | ||
" \"blogPostId\": 1\n", | ||
" },\n", | ||
" {\n", | ||
" \"id\": 2,\n", | ||
" \"content\": \"Interesting thoughts\",\n", | ||
" \"username\": \"simon\",\n", | ||
" \"blogPostId\": 2\n", | ||
" },\n", | ||
" {\n", | ||
" \"id\": 3,\n", | ||
" \"content\": \"Loved reading this!\",\n", | ||
" \"username\": \"simon\",\n", | ||
" \"blogPostId\": 2\n", | ||
" }\n", | ||
" ]\n", | ||
"}\n", | ||
"\n", | ||
"# JSON Schema object that the above JSON value conforms to\n", | ||
"json_schema = {\n", | ||
" \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n", | ||
" \"description\": \"Schema for a very simple blog post app\",\n", | ||
" \"type\": \"object\",\n", | ||
" \"properties\": {\n", | ||
" \"blogPosts\": {\n", | ||
" \"description\": \"List of blog posts\",\n", | ||
" \"type\": \"array\",\n", | ||
" \"items\": {\n", | ||
" \"type\": \"object\",\n", | ||
" \"properties\": {\n", | ||
" \"id\": {\n", | ||
" \"description\": \"Unique identifier for the blog post\",\n", | ||
" \"type\": \"integer\"\n", | ||
" },\n", | ||
" \"title\": {\n", | ||
" \"description\": \"Title of the blog post\",\n", | ||
" \"type\": \"string\"\n", | ||
" },\n", | ||
" \"content\": {\n", | ||
" \"description\": \"Content of the blog post\",\n", | ||
" \"type\": \"string\"\n", | ||
" }\n", | ||
" },\n", | ||
" \"required\": [\"id\", \"title\", \"content\"]\n", | ||
" }\n", | ||
" },\n", | ||
" \"comments\": {\n", | ||
" \"description\": \"List of comments on blog posts\",\n", | ||
" \"type\": \"array\",\n", | ||
" \"items\": {\n", | ||
" \"type\": \"object\",\n", | ||
" \"properties\": {\n", | ||
" \"id\": {\n", | ||
" \"description\": \"Unique identifier for the comment\",\n", | ||
" \"type\": \"integer\"\n", | ||
" },\n", | ||
" \"content\": {\n", | ||
" \"description\": \"Content of the comment\",\n", | ||
" \"type\": \"string\"\n", | ||
" },\n", | ||
" \"username\": {\n", | ||
" \"description\": \"Username of the commenter (lowercased)\",\n", | ||
" \"type\": \"string\"\n", | ||
" },\n", | ||
" \"blogPostId\": {\n", | ||
" \"description\": \"Identifier for the blog post to which the comment belongs\",\n", | ||
" \"type\": \"integer\"\n", | ||
" }\n", | ||
" },\n", | ||
" \"required\": [\"id\", \"content\", \"username\", \"blogPostId\"]\n", | ||
" }\n", | ||
" }\n", | ||
" },\n", | ||
" \"required\": [\"blogPosts\", \"comments\"]\n", | ||
"}\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"id": "4fea2edb-b3d4-4313-a656-d6edb00d93c0", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"INFO:numexpr.utils:NumExpr defaulting to 2 threads.\n", | ||
"NumExpr defaulting to 2 threads.\n" | ||
] | ||
}, | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"/workspaces/llama_index/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", | ||
" from .autonotebook import tqdm as notebook_tqdm\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from llama_index.indices.service_context import ServiceContext\n", | ||
"from langchain.llms.openai import OpenAI\n", | ||
"from llama_index.indices.struct_store import GPTJSONQueryEngine\n", | ||
"\n", | ||
"llm = OpenAI(model_name=\"text-davinci-003\")\n", | ||
"service_context = ServiceContext.from_defaults()\n", | ||
"nl_query_engine = GPTJSONQueryEngine(json_value=json_value, json_schema=json_schema, service_context=service_context)\n", | ||
"raw_query_engine = GPTJSONQueryEngine(json_value=json_value, json_schema=json_schema, service_context=service_context, synthesize_response=False)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"id": "451836bc-b073-4838-8ab8-3def7d2c4d9d", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 797 tokens\n", | ||
"> [query] Total LLM token usage: 797 tokens\n", | ||
"INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 0 tokens\n", | ||
"> [query] Total embedding token usage: 0 tokens\n", | ||
"INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 363 tokens\n", | ||
"> [query] Total LLM token usage: 363 tokens\n", | ||
"INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 0 tokens\n", | ||
"> [query] Total embedding token usage: 0 tokens\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"nl_response = nl_query_engine.query(\n", | ||
" \"What comments has Jerry been writing?\",\n", | ||
")\n", | ||
"raw_response = raw_query_engine.query(\n", | ||
" \"What comments has Jerry been writing?\",\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"id": "4253d4c3-f3e5-4779-bcd1-2e6e2818305f", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/markdown": [ | ||
"<h1>Natural language Response</h1><br><b> Jerry has written one comment with the content 'Nice post!' on blog post with id 1.</b>" | ||
], | ||
"text/plain": [ | ||
"<IPython.core.display.Markdown object>" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
}, | ||
{ | ||
"data": { | ||
"text/markdown": [ | ||
"<h1>Raw JSON Response</h1><br><b>[{\"id\": 1, \"content\": \"Nice post!\", \"username\": \"jerry\", \"blogPostId\": 1}]</b>" | ||
], | ||
"text/plain": [ | ||
"<IPython.core.display.Markdown object>" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
} | ||
], | ||
"source": [ | ||
"display(Markdown(f\"<h1>Natural language Response</h1><br><b>{nl_response}</b>\"))\n", | ||
"display(Markdown(f\"<h1>Raw JSON Response</h1><br><b>{raw_response}</b>\"))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"id": "5e10b7da-b355-49b2-9f80-f17541d4f850", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
" $.comments[?(@.username == 'jerry')]\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# get the json path query string. Same would apply to raw_response\n", | ||
"print(nl_response.extra_info[\"json_path_response_str\"])" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": ".venv", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.4" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.