-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JSON Path based Query Engine #4595
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will take a deeper look soon -
|
||
|
||
class GPTJSONIndex(BaseGPTStructStoreIndex[JSONStructDatapoint]): | ||
index_struct_cls = JSONStructDatapoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index_struct_cls is supposed to be a class of an IndexStruct
(JSONStructDatapoint is not an index struct)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok that may explain the error I'm seeing in the Jupyter notebook too then!
|
||
def _build_index_from_nodes(self, nodes: Sequence[Node]) -> JSONStructDatapoint: | ||
"""Build index from documents.""" | ||
index_struct = self.index_struct_cls(fields=self.json_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
high-level i'm not clear what the "index" stores in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was not sure how to store any of the JSON data in this case. I was going off of GPTPandasIndex
, which just assigns the given dataframe to an internal instance variable on the class, and then executes pandas code against the stored instance variable. I'm following a similar approach here where I just store the JSON value in an instance variable and run JSON path queries against it.
…n. Update notebook so that it works
"Given a task, respond with a JSON Path query that " | ||
"can retrieve data from a JSON value that matches the schema.\n" | ||
"Task: {query_str}\n" | ||
"JSONPath: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised that this works without providing examples of what JSONPath query looks like.
Saw the notebook that you tried with GPT4. Does it work with weaker models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think might be worthwhile to do a bit of prompt engineering here, and ideally have this work for at least text-davinci and turbo as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I tried with text-davinci-003
& gpt-3.5-turbo
and they seem to do just as well on the example JSON in the notebook. I think with this approach, it'll be very important for users to fill out the correct JSON schema with description fields. I'll include a note about that in the example Jupyter notebook
llama_index/prompts/prompts.py
Outdated
|
||
Required template variables: `query_str`. | ||
""" | ||
JSONPathPrompt = Prompt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not clear from the codebase, but we started deprecating these type specific prompt classes.
llama_index/data_structs/table.py
Outdated
|
||
|
||
@dataclass | ||
class JSONStructDatapoint(BaseStructTable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this called a datapoint?
JSONType = Union[Dict[str, "JSONType"], List["JSONType"], str, int, float, bool, None] | ||
|
||
|
||
class GPTJSONIndex(BaseGPTStructStoreIndex[JSONStructDatapoint]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a meta question here (for everyone), should we even call these "index"?
It barely fits with the Index
interface as it stands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not store state in the index struct, does not support build/insert/update/delete, and does not expose a retriever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possibility is that we only add this as query engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jerryjliu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling it something else would help with some common misconceptions. Lots of people try to call persist
on the SQL or Pandas indexes, which doesn't actually work
|
||
|
||
def default_output_processor(llm_output, json_value: JSONType) -> JSONType: | ||
from jsonpath_ng.ext import parse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Add nicer import error that tells users what to pip install.
"json_path_response_str": json_path_response_str, | ||
} | ||
|
||
return Response( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add option to synthesize a natural language response using the JSONPath output.
This would be similar to the SQL query engine.
response=json.dumps(json_path_output), extra_info=response_extra_info | ||
) | ||
|
||
async def _aquery(self, query_bundle: QueryBundle) -> Response: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's actually implement this. It should just be calling apredict instead of predict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great as a starting point!
My main question (not just to @sourabhdesai) is whether if having an Index
makes sense here.
Should be ready for review! After discussing with @Disiok we decided to move forward with an approach where we just expose a query engine and an example notebook that shows how to use the query engine directly. The |
"from langchain.llms.openai import OpenAI\n", | ||
"from llama_index.indices.struct_store import GPTJSONQueryEngine\n", | ||
"\n", | ||
"llm = OpenAI(model_name=\"text-davinci-003\")\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't think we actually passed this in anywhere.
JSONType = Union[Dict[str, "JSONType"], List["JSONType"], str, int, float, bool, None] | ||
|
||
|
||
DEFAULT_RESPONSE_SYNTHESIS_PROMPT_TMPL = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nit: we don't have a consistent policy for where to put prompts right now. Some are in the prompts files, some are inline in the query engine definition.
@@ -18,4 +19,5 @@ | |||
"GPTNLPandasQueryEngine", | |||
"GPTNLStructStoreQueryEngine", | |||
"GPTSQLStructStoreQueryEngine", | |||
"GPTJSONQueryEngine", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should add this to docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, gonna merge and address my own comments in separate PR.
Putting up a PR for the implementation so far of a JSON Path based query index for arbitrary JSON objects. Let me know what you think of this approach and implementation of it!
The user needs to supply both the JSON value and the JSON schema.
Future improvements:
GPTPandasIndex
which just stores the given input dataframe as a private instance variable.GPTPandasIndex
but since I hadn't delved into this codebase too much before I was hesitant to architecture astronaut this too much.