Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JSON Path based Query Engine #4595

Merged
merged 10 commits into from
Jun 4, 2023
Merged

Conversation

sourabhdesai
Copy link
Contributor

@sourabhdesai sourabhdesai commented Jun 1, 2023

Putting up a PR for the implementation so far of a JSON Path based query index for arbitrary JSON objects. Let me know what you think of this approach and implementation of it!

The user needs to supply both the JSON value and the JSON schema.

Future improvements:

  • Consider workarounds for when the entire JSON schema doesn't fit in the context window
  • Try to figure out a way to include embedding based vector search as part of this
    • One idea that was discussed is replacing the parts of the JSON Path query where it is filtering for objects within a list with a semantic search over those objects instead.
  • Consider giving the user the option/capability to have an LLM generate a JSON schema from a given sample JSON
  • A way to store the JSON besides just having it in memory
    • Currently, this is the same as GPTPandasIndex which just stores the given input dataframe as a private instance variable.
  • This would be the third use case so far that involves taking a dataset, a schema for that dataset, and then operating the index by converting the user query into a structured query based on the supplied schema. We may want to do this again at some point for something like GraphQL APIs for example. Perhaps we can consider making a generic abstraction for these kinds of use cases? My implementation here is already very similar to that of GPTPandasIndex but since I hadn't delved into this codebase too much before I was hesitant to architecture astronaut this too much.

Copy link
Collaborator

@jerryjliu jerryjliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will take a deeper look soon -



class GPTJSONIndex(BaseGPTStructStoreIndex[JSONStructDatapoint]):
index_struct_cls = JSONStructDatapoint
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index_struct_cls is supposed to be a class of an IndexStruct (JSONStructDatapoint is not an index struct)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok that may explain the error I'm seeing in the Jupyter notebook too then!


def _build_index_from_nodes(self, nodes: Sequence[Node]) -> JSONStructDatapoint:
"""Build index from documents."""
index_struct = self.index_struct_cls(fields=self.json_value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high-level i'm not clear what the "index" stores in this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was not sure how to store any of the JSON data in this case. I was going off of GPTPandasIndex, which just assigns the given dataframe to an internal instance variable on the class, and then executes pandas code against the stored instance variable. I'm following a similar approach here where I just store the JSON value in an instance variable and run JSON path queries against it.

"Given a task, respond with a JSON Path query that "
"can retrieve data from a JSON value that matches the schema.\n"
"Task: {query_str}\n"
"JSONPath: "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that this works without providing examples of what JSONPath query looks like.
Saw the notebook that you tried with GPT4. Does it work with weaker models?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think might be worthwhile to do a bit of prompt engineering here, and ideally have this work for at least text-davinci and turbo as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tried with text-davinci-003 & gpt-3.5-turbo and they seem to do just as well on the example JSON in the notebook. I think with this approach, it'll be very important for users to fill out the correct JSON schema with description fields. I'll include a note about that in the example Jupyter notebook


Required template variables: `query_str`.
"""
JSONPathPrompt = Prompt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not clear from the codebase, but we started deprecating these type specific prompt classes.



@dataclass
class JSONStructDatapoint(BaseStructTable):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this called a datapoint?

JSONType = Union[Dict[str, "JSONType"], List["JSONType"], str, int, float, bool, None]


class GPTJSONIndex(BaseGPTStructStoreIndex[JSONStructDatapoint]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a meta question here (for everyone), should we even call these "index"?
It barely fits with the Index interface as it stands.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not store state in the index struct, does not support build/insert/update/delete, and does not expose a retriever.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility is that we only add this as query engine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling it something else would help with some common misconceptions. Lots of people try to call persist on the SQL or Pandas indexes, which doesn't actually work



def default_output_processor(llm_output, json_value: JSONType) -> JSONType:
from jsonpath_ng.ext import parse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add nicer import error that tells users what to pip install.

"json_path_response_str": json_path_response_str,
}

return Response(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add option to synthesize a natural language response using the JSONPath output.
This would be similar to the SQL query engine.

response=json.dumps(json_path_output), extra_info=response_extra_info
)

async def _aquery(self, query_bundle: QueryBundle) -> Response:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's actually implement this. It should just be calling apredict instead of predict.

Copy link
Collaborator

@Disiok Disiok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great as a starting point!

My main question (not just to @sourabhdesai) is whether if having an Index makes sense here.

@sourabhdesai sourabhdesai changed the title DRAFT: add json path query based index Add JSON Path based Query Engine Jun 3, 2023
@sourabhdesai sourabhdesai marked this pull request as ready for review June 3, 2023 03:22
@sourabhdesai
Copy link
Contributor Author

Should be ready for review! After discussing with @Disiok we decided to move forward with an approach where we just expose a query engine and an example notebook that shows how to use the query engine directly. The GPTJSONIndex from previous commits has been removed.

"from langchain.llms.openai import OpenAI\n",
"from llama_index.indices.struct_store import GPTJSONQueryEngine\n",
"\n",
"llm = OpenAI(model_name=\"text-davinci-003\")\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't think we actually passed this in anywhere.

JSONType = Union[Dict[str, "JSONType"], List["JSONType"], str, int, float, bool, None]


DEFAULT_RESPONSE_SYNTHESIS_PROMPT_TMPL = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nit: we don't have a consistent policy for where to put prompts right now. Some are in the prompts files, some are inline in the query engine definition.

@@ -18,4 +19,5 @@
"GPTNLPandasQueryEngine",
"GPTNLStructStoreQueryEngine",
"GPTSQLStructStoreQueryEngine",
"GPTJSONQueryEngine",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should add this to docs.

Copy link
Collaborator

@Disiok Disiok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, gonna merge and address my own comments in separate PR.

@Disiok Disiok merged commit 6a3b1b9 into run-llama:main Jun 4, 2023
@sourabhdesai sourabhdesai deleted the json-path branch June 4, 2023 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants