Add JSON Path based Query Engine #4595

sourabhdesai · 2023-06-01T07:48:37Z

Putting up a PR for the implementation so far of a JSON Path based query index for arbitrary JSON objects. Let me know what you think of this approach and implementation of it!

The user needs to supply both the JSON value and the JSON schema.

Future improvements:

Consider workarounds for when the entire JSON schema doesn't fit in the context window
Try to figure out a way to include embedding based vector search as part of this
- One idea that was discussed is replacing the parts of the JSON Path query where it is filtering for objects within a list with a semantic search over those objects instead.
Consider giving the user the option/capability to have an LLM generate a JSON schema from a given sample JSON
A way to store the JSON besides just having it in memory
- Currently, this is the same as GPTPandasIndex which just stores the given input dataframe as a private instance variable.
This would be the third use case so far that involves taking a dataset, a schema for that dataset, and then operating the index by converting the user query into a structured query based on the supplied schema. We may want to do this again at some point for something like GraphQL APIs for example. Perhaps we can consider making a generic abstraction for these kinds of use cases? My implementation here is already very similar to that of GPTPandasIndex but since I hadn't delved into this codebase too much before I was hesitant to architecture astronaut this too much.

jerryjliu

i will take a deeper look soon -

jerryjliu · 2023-06-01T16:36:53Z

llama_index/indices/struct_store/json.py

+
+
+class GPTJSONIndex(BaseGPTStructStoreIndex[JSONStructDatapoint]):
+    index_struct_cls = JSONStructDatapoint


index_struct_cls is supposed to be a class of an IndexStruct (JSONStructDatapoint is not an index struct)

Ah ok that may explain the error I'm seeing in the Jupyter notebook too then!

jerryjliu · 2023-06-01T16:39:13Z

llama_index/indices/struct_store/json.py

+
+    def _build_index_from_nodes(self, nodes: Sequence[Node]) -> JSONStructDatapoint:
+        """Build index from documents."""
+        index_struct = self.index_struct_cls(fields=self.json_value)


high-level i'm not clear what the "index" stores in this case

Yeah I was not sure how to store any of the JSON data in this case. I was going off of GPTPandasIndex, which just assigns the given dataframe to an internal instance variable on the class, and then executes pandas code against the stored instance variable. I'm following a similar approach here where I just store the JSON value in an instance variable and run JSON path queries against it.

…n. Update notebook so that it works

Disiok · 2023-06-02T18:23:56Z

llama_index/prompts/default_prompts.py

+    "Given a task, respond with a JSON Path query that "
+    "can retrieve data from a JSON value that matches the schema.\n"
+    "Task: {query_str}\n"
+    "JSONPath: "


I'm surprised that this works without providing examples of what JSONPath query looks like.
Saw the notebook that you tried with GPT4. Does it work with weaker models?

I think might be worthwhile to do a bit of prompt engineering here, and ideally have this work for at least text-davinci and turbo as well.

So I tried with text-davinci-003 & gpt-3.5-turbo and they seem to do just as well on the example JSON in the notebook. I think with this approach, it'll be very important for users to fill out the correct JSON schema with description fields. I'll include a note about that in the example Jupyter notebook

Disiok · 2023-06-02T18:25:30Z

llama_index/prompts/prompts.py

+
+Required template variables: `query_str`.
+"""
+JSONPathPrompt = Prompt


Probably not clear from the codebase, but we started deprecating these type specific prompt classes.

Disiok · 2023-06-02T18:26:05Z

llama_index/data_structs/table.py

+
+
+@dataclass
+class JSONStructDatapoint(BaseStructTable):


Why is this called a datapoint?

Disiok · 2023-06-02T18:28:02Z

llama_index/indices/struct_store/json.py

+JSONType = Union[Dict[str, "JSONType"], List["JSONType"], str, int, float, bool, None]
+
+
+class GPTJSONIndex(BaseGPTStructStoreIndex[JSONStructDatapoint]):


I think a meta question here (for everyone), should we even call these "index"?
It barely fits with the Index interface as it stands.

It does not store state in the index struct, does not support build/insert/update/delete, and does not expose a retriever.

One possibility is that we only add this as query engine.

cc @jerryjliu

Calling it something else would help with some common misconceptions. Lots of people try to call persist on the SQL or Pandas indexes, which doesn't actually work

Disiok · 2023-06-02T18:31:10Z

llama_index/indices/struct_store/json_query.py

+
+
+def default_output_processor(llm_output, json_value: JSONType) -> JSONType:
+    from jsonpath_ng.ext import parse


nit: Add nicer import error that tells users what to pip install.

Disiok · 2023-06-02T18:32:44Z

llama_index/indices/struct_store/json_query.py

+            "json_path_response_str": json_path_response_str,
+        }
+
+        return Response(


I think we should add option to synthesize a natural language response using the JSONPath output.
This would be similar to the SQL query engine.

Disiok · 2023-06-02T18:33:12Z

llama_index/indices/struct_store/json_query.py

+            response=json.dumps(json_path_output), extra_info=response_extra_info
+        )
+
+    async def _aquery(self, query_bundle: QueryBundle) -> Response:


Let's actually implement this. It should just be calling apredict instead of predict.

Disiok

Looks great as a starting point!

My main question (not just to @sourabhdesai) is whether if having an Index makes sense here.

sourabhdesai · 2023-06-03T03:24:25Z

Should be ready for review! After discussing with @Disiok we decided to move forward with an approach where we just expose a query engine and an example notebook that shows how to use the query engine directly. The GPTJSONIndex from previous commits has been removed.

Disiok · 2023-06-04T21:08:01Z

docs/examples/index_structs/struct_indices/JSONIndexDemo.ipynb

+                "from langchain.llms.openai import OpenAI\n",
+                "from llama_index.indices.struct_store import GPTJSONQueryEngine\n",
+                "\n",
+                "llm = OpenAI(model_name=\"text-davinci-003\")\n",


nit: don't think we actually passed this in anywhere.

Disiok · 2023-06-04T21:08:48Z

llama_index/indices/struct_store/json_query.py

+JSONType = Union[Dict[str, "JSONType"], List["JSONType"], str, int, float, bool, None]
+
+
+DEFAULT_RESPONSE_SYNTHESIS_PROMPT_TMPL = (


very nit: we don't have a consistent policy for where to put prompts right now. Some are in the prompts files, some are inline in the query engine definition.

Disiok · 2023-06-04T21:09:11Z

llama_index/indices/struct_store/__init__.py

@@ -18,4 +19,5 @@
    "GPTNLPandasQueryEngine",
    "GPTNLStructStoreQueryEngine",
    "GPTSQLStructStoreQueryEngine",
+    "GPTJSONQueryEngine",


nit: should add this to docs.

Disiok

Looks great, gonna merge and address my own comments in separate PR.

add llama index json path query based index

ba07a9d

jerryjliu reviewed Jun 1, 2023

View reviewed changes

sourabhdesai added 4 commits June 1, 2023 21:06

update to use jsonpath-ng module for json path query parsing/executio…

cffe064

…n. Update notebook so that it works

Merge branch 'main' into json-path

c3d281d

add docstrings + unit tests

5976319

formatting

87cfb2f

Disiok reviewed Jun 2, 2023

View reviewed changes

llama_index/data_structs/table.py Outdated

@dataclass

class JSONStructDatapoint(BaseStructTable):

Copy link

Collaborator

Disiok Jun 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this called a datapoint?

Disiok reviewed Jun 2, 2023

View reviewed changes

sourabhdesai added 5 commits June 2, 2023 21:43

remove GPTJSONIndex as it didnt really fit into the paradigm of an index

f7824ea

updates from PR comments

66ecfe6

fix linting error

0fb2ab2

black reformatting

e3d0e89

line length

9d3b549

sourabhdesai changed the title ~~DRAFT: add json path query based index~~ Add JSON Path based Query Engine Jun 3, 2023

sourabhdesai marked this pull request as ready for review June 3, 2023 03:22

Disiok reviewed Jun 4, 2023

View reviewed changes

Disiok approved these changes Jun 4, 2023

View reviewed changes

Disiok merged commit 6a3b1b9 into run-llama:main Jun 4, 2023

sourabhdesai deleted the json-path branch June 4, 2023 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JSON Path based Query Engine #4595

Add JSON Path based Query Engine #4595

sourabhdesai commented Jun 1, 2023 •

edited

Loading

jerryjliu left a comment

jerryjliu Jun 1, 2023

sourabhdesai Jun 1, 2023

jerryjliu Jun 1, 2023

sourabhdesai Jun 1, 2023

Disiok Jun 2, 2023

Disiok Jun 2, 2023

sourabhdesai Jun 2, 2023

Disiok Jun 2, 2023

Disiok Jun 2, 2023

Disiok Jun 2, 2023

Disiok Jun 2, 2023

Disiok Jun 2, 2023

Disiok Jun 2, 2023

logan-markewich Jun 2, 2023

Disiok Jun 2, 2023

Disiok Jun 2, 2023

Disiok Jun 2, 2023

Disiok left a comment

sourabhdesai commented Jun 3, 2023

Disiok Jun 4, 2023

Disiok Jun 4, 2023

Disiok Jun 4, 2023

Disiok left a comment



		class GPTJSONIndex(BaseGPTStructStoreIndex[JSONStructDatapoint]):
		index_struct_cls = JSONStructDatapoint

		JSONType = Union[Dict[str, "JSONType"], List["JSONType"], str, int, float, bool, None]


		class GPTJSONIndex(BaseGPTStructStoreIndex[JSONStructDatapoint]):



		def default_output_processor(llm_output, json_value: JSONType) -> JSONType:
		from jsonpath_ng.ext import parse

		JSONType = Union[Dict[str, "JSONType"], List["JSONType"], str, int, float, bool, None]


		DEFAULT_RESPONSE_SYNTHESIS_PROMPT_TMPL = (

Add JSON Path based Query Engine #4595

Add JSON Path based Query Engine #4595

Conversation

sourabhdesai commented Jun 1, 2023 • edited Loading

jerryjliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Disiok left a comment

Choose a reason for hiding this comment

sourabhdesai commented Jun 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Disiok left a comment

Choose a reason for hiding this comment

sourabhdesai commented Jun 1, 2023 •

edited

Loading