Skip to content

Commit

Permalink
Add JSON Path based Query Engine (run-llama#4595)
Browse files Browse the repository at this point in the history
* add llama index json path query based index

* update to use jsonpath-ng module for json path query parsing/execution. Update notebook so that it works

* add docstrings + unit tests

* formatting

* remove GPTJSONIndex as it didnt really fit into the paradigm of an index

* updates from PR comments

* fix linting error

* black reformatting

* line length
  • Loading branch information
sourabhdesai authored Jun 4, 2023
1 parent 6986d9d commit 6a3b1b9
Show file tree
Hide file tree
Showing 7 changed files with 643 additions and 0 deletions.
1 change: 1 addition & 0 deletions data_requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ slack_sdk
discord.py
boto3
moto[s3,dynamodb]
jsonpath-ng

# google
google-api-python-client
Expand Down
358 changes: 358 additions & 0 deletions docs/examples/index_structs/struct_indices/JSONIndexDemo.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,358 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "e45f9b60-cd6b-4c15-958f-1feca5438128",
"metadata": {},
"source": [
"# JSON Index\n",
"The JSON index is useful for querying JSON documents that conform to a JSON schema.\n",
"\n",
"This JSON schema is then used in the context of a prompt to convert a natural language query into a structured JSON Path query. This JSON Path query is then used to retrieve data to answer the given question."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "f7c5da2e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: jsonpath-ng in /workspaces/llama_index/.venv/lib/python3.10/site-packages (1.5.3)\n",
"Requirement already satisfied: ply in /workspaces/llama_index/.venv/lib/python3.10/site-packages (from jsonpath-ng) (3.11)\n",
"Requirement already satisfied: decorator in /workspaces/llama_index/.venv/lib/python3.10/site-packages (from jsonpath-ng) (5.1.1)\n",
"Requirement already satisfied: six in /workspaces/llama_index/.venv/lib/python3.10/site-packages (from jsonpath-ng) (1.16.0)\n"
]
}
],
"source": [
"# First, install the jsonpath-ng package which is used by default to parse & execute the JSONPath queries.\n",
"!pip install jsonpath-ng"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "119eb42b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import logging\n",
"import sys\n",
"\n",
"logging.basicConfig(stream=sys.stdout, level=logging.INFO)\n",
"logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "7aa21e46",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import dotenv\n",
"dotenv.load_dotenv(\"../../../.env\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "107396a9-4aa7-49b3-9f0f-a755726c19ba",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from IPython.display import Markdown, display"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "5ece7d73-0f67-4ff5-95e5-249a25bd118c",
"metadata": {},
"source": [
"### Let's start on a Toy JSON\n",
"\n",
"Very simple JSON object containing data from a blog post site with user comments.\n",
"\n",
"We will also provide a JSON schema (which we were able to generate by giving ChatGPT a sample of the JSON).\n",
"\n",
"#### Advice\n",
"Do make sure that you've provided a helpful `\"description\"` value for each of the fields in your JSON schema.\n",
"\n",
"As you can see in the given example, the description for the `\"username\"` field mentions that usernames are lowercased. You'll see that this ends up being helpful for the LLM in producing the correct JSON path query."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "1484fe58-4853-4a76-bffc-435a9cce3e2e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Test on some sample data \n",
"json_value = {\n",
" \"blogPosts\": [\n",
" {\n",
" \"id\": 1,\n",
" \"title\": \"First blog post\",\n",
" \"content\": \"This is my first blog post\"\n",
" },\n",
" {\n",
" \"id\": 2,\n",
" \"title\": \"Second blog post\",\n",
" \"content\": \"This is my second blog post\"\n",
" }\n",
" ],\n",
" \"comments\": [\n",
" {\n",
" \"id\": 1,\n",
" \"content\": \"Nice post!\",\n",
" \"username\": \"jerry\",\n",
" \"blogPostId\": 1\n",
" },\n",
" {\n",
" \"id\": 2,\n",
" \"content\": \"Interesting thoughts\",\n",
" \"username\": \"simon\",\n",
" \"blogPostId\": 2\n",
" },\n",
" {\n",
" \"id\": 3,\n",
" \"content\": \"Loved reading this!\",\n",
" \"username\": \"simon\",\n",
" \"blogPostId\": 2\n",
" }\n",
" ]\n",
"}\n",
"\n",
"# JSON Schema object that the above JSON value conforms to\n",
"json_schema = {\n",
" \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n",
" \"description\": \"Schema for a very simple blog post app\",\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"blogPosts\": {\n",
" \"description\": \"List of blog posts\",\n",
" \"type\": \"array\",\n",
" \"items\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"id\": {\n",
" \"description\": \"Unique identifier for the blog post\",\n",
" \"type\": \"integer\"\n",
" },\n",
" \"title\": {\n",
" \"description\": \"Title of the blog post\",\n",
" \"type\": \"string\"\n",
" },\n",
" \"content\": {\n",
" \"description\": \"Content of the blog post\",\n",
" \"type\": \"string\"\n",
" }\n",
" },\n",
" \"required\": [\"id\", \"title\", \"content\"]\n",
" }\n",
" },\n",
" \"comments\": {\n",
" \"description\": \"List of comments on blog posts\",\n",
" \"type\": \"array\",\n",
" \"items\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"id\": {\n",
" \"description\": \"Unique identifier for the comment\",\n",
" \"type\": \"integer\"\n",
" },\n",
" \"content\": {\n",
" \"description\": \"Content of the comment\",\n",
" \"type\": \"string\"\n",
" },\n",
" \"username\": {\n",
" \"description\": \"Username of the commenter (lowercased)\",\n",
" \"type\": \"string\"\n",
" },\n",
" \"blogPostId\": {\n",
" \"description\": \"Identifier for the blog post to which the comment belongs\",\n",
" \"type\": \"integer\"\n",
" }\n",
" },\n",
" \"required\": [\"id\", \"content\", \"username\", \"blogPostId\"]\n",
" }\n",
" }\n",
" },\n",
" \"required\": [\"blogPosts\", \"comments\"]\n",
"}\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "4fea2edb-b3d4-4313-a656-d6edb00d93c0",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:numexpr.utils:NumExpr defaulting to 2 threads.\n",
"NumExpr defaulting to 2 threads.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/workspaces/llama_index/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n"
]
}
],
"source": [
"from llama_index.indices.service_context import ServiceContext\n",
"from langchain.llms.openai import OpenAI\n",
"from llama_index.indices.struct_store import GPTJSONQueryEngine\n",
"\n",
"llm = OpenAI(model_name=\"text-davinci-003\")\n",
"service_context = ServiceContext.from_defaults()\n",
"nl_query_engine = GPTJSONQueryEngine(json_value=json_value, json_schema=json_schema, service_context=service_context)\n",
"raw_query_engine = GPTJSONQueryEngine(json_value=json_value, json_schema=json_schema, service_context=service_context, synthesize_response=False)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "451836bc-b073-4838-8ab8-3def7d2c4d9d",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 797 tokens\n",
"> [query] Total LLM token usage: 797 tokens\n",
"INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 0 tokens\n",
"> [query] Total embedding token usage: 0 tokens\n",
"INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 363 tokens\n",
"> [query] Total LLM token usage: 363 tokens\n",
"INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 0 tokens\n",
"> [query] Total embedding token usage: 0 tokens\n"
]
}
],
"source": [
"nl_response = nl_query_engine.query(\n",
" \"What comments has Jerry been writing?\",\n",
")\n",
"raw_response = raw_query_engine.query(\n",
" \"What comments has Jerry been writing?\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "4253d4c3-f3e5-4779-bcd1-2e6e2818305f",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/markdown": [
"<h1>Natural language Response</h1><br><b> Jerry has written one comment with the content 'Nice post!' on blog post with id 1.</b>"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"<h1>Raw JSON Response</h1><br><b>[{\"id\": 1, \"content\": \"Nice post!\", \"username\": \"jerry\", \"blogPostId\": 1}]</b>"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(Markdown(f\"<h1>Natural language Response</h1><br><b>{nl_response}</b>\"))\n",
"display(Markdown(f\"<h1>Raw JSON Response</h1><br><b>{raw_response}</b>\"))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "5e10b7da-b355-49b2-9f80-f17541d4f850",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" $.comments[?(@.username == 'jerry')]\n"
]
}
],
"source": [
"# get the json path query string. Same would apply to raw_response\n",
"print(nl_response.extra_info[\"json_path_response_str\"])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
2 changes: 2 additions & 0 deletions llama_index/indices/struct_store/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
GPTNLStructStoreQueryEngine,
GPTSQLStructStoreQueryEngine,
)
from llama_index.indices.struct_store.json_query import GPTJSONQueryEngine

__all__ = [
"GPTSQLStructStoreIndex",
Expand All @@ -18,4 +19,5 @@
"GPTNLPandasQueryEngine",
"GPTNLStructStoreQueryEngine",
"GPTSQLStructStoreQueryEngine",
"GPTJSONQueryEngine",
]
Loading

0 comments on commit 6a3b1b9

Please sign in to comment.