Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.7.0b1 pydantic_core.from_json(..., allow_partial=True) truncates partial strings completely #82

Closed
1 task done
willbakst opened this issue Apr 9, 2024 · 8 comments · Fixed by #101
Closed
1 task done

Comments

@willbakst
Copy link

Initial Checks

  • I confirm that I'm using Pydantic V2

Description

In the below example, I would expect for the author field to still populate into the model as it's being streamed. For example, consider a description field where it might be on the longer side. This will only give me the description once it's fully formed and closed, which somewhat defeats the purpose of it being partial here for streaming.

Example Code

from pydantic_core import from_json

obj = from_json(
    '{"title": "Pride and Prejudice", "author": "Jane A',
    allow_partial=True,
)
print(obj)
# Actual: {'title': 'Pride and Prejudice'}
# Expected: {'title': 'Pride and Prejudice', 'author': 'Jane A'}

Python, Pydantic & OS Version

pydantic version: 2.7.0b1
pydantic-core version: 2.18.0
pydantic-core build: profile=release pgo=false
python version: 3.9.16 (main, Apr  2 2023, 22:08:02)  [Clang 14.0.0 (clang-1400.0.29.202)]
platform: macOS-14.2.1-arm64-arm-64bit
related packages: typing_extensions-4.10.0 fastapi-0.109.2 pydantic-extra-types-2.6.0 mypy-1.1.1 pydantic-settings-2.2.1
@samuelcolvin
Copy link
Member

String parsing in JSON is extremely complex. We might be able to recover from some end of string situations, but it will require a lot of work.

PR welcome to try, but I think this is an optimisation that isn't required for 2.7.

I'm not even sure you'd always want partial strings to be included, for example:

  • if you're then parsing the string to an int, the partial string "1 is very different from "123.45"
  • Could could imagine "invoice-ACX could have a very different meaning to "invoice-ACXAA"
  • Even with human text "the answer is obvious could mean something different to "The answer is obviously competely worng"

@samuelcolvin samuelcolvin transferred this issue from pydantic/pydantic Apr 10, 2024
@Folyd
Copy link

Folyd commented May 2, 2024

Same for me:

>>> from_json('{"message": "The **Chihuly Garden and Glass** is open at different times throughout the week. Here are', allow_partial=True)
{}

@willbakst
Copy link
Author

willbakst commented May 2, 2024

Hmm, I see what you're saying, but ultimately I don't expect the parsed partial string to be "correct" until it's done streaming, and I want the partial string during streaming so that I can display the chunks as they stream.

In particular this leads to poor real-time behavior when working the LLM streaming, namely for really long values. For example, let's say I have the following model:

from pydantic import BaseModel, Field

class MyModel(BaseModel):
    description: str = Field(
        ...,
        description="A really long, verbose description about pydantic"
)

If I try to stream this output, I will have to wait for the entire output to be completed before I actually get anything parsed, ultimately defeating the purpose of partial parsing / streaming in this case at all.

If we detect an open string, can we not just parse it as a string as-is by closing it? Does this make it less complex? Or is this too simplified a way of handling things?

I can't really think of a case where I would want to allow for partial parsing outside of streaming where the desired behavior would be as originally described; however, we could always expose an additional boolean to let the user decide if they want the more or less strict partial parsing?

Would be interested in working on a PR for this once aligned on desired behavior.

@samuelcolvin
Copy link
Member

This should be fixed in #101, @willbakst can you take try it out and let me know if it solves your problem.

@willbakst
Copy link
Author

@samuelcolvin just took a look, the partial_mode='trailing-string' works as expected for my use case!

I'm assuming this will be a similar API change in pydantic-core / pydantic with a switch from allow_partial to partial_mode?

@samuelcolvin
Copy link
Member

Yes, we'll probably avoid changing the name of the kwarg in pydantic-core to avoid breaking changes.

In retrospect maybe it was a mistake therefore to change them in jiter.

@willbakst
Copy link
Author

Got it, makes sense.

Mostly unaffected by this since I'll just update to use jiter directly

@samuelcolvin
Copy link
Member

Yup, I think pydantic_core.from_json is pretty redundant now, it's also about 30% slower than jiter due to the PGO test cases used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants