Coercion of incomplete floats (eg.`1.`) #146

gislerro · 2024-09-28T00:36:07Z

I've noticed that when iteratively parsing a json stream character by character 'incomplete' floats are 'stepped' over:

import json
import jiter

obj = {'float': 1.0}
obj_json = json.dumps(obj)

token_stream = []
sub = ''
for c in obj_json:
    sub += c
    token_stream.append(sub)

for mode in ['trailing-strings', 'on']:
    for stream in token_stream:
        parsed = jiter.from_json(stream.encode(), partial_mode=mode)
        print(f'{mode}: {stream} -> {parsed}')
    print('\n')

prints the following output:

trailing-strings: { -> {}
trailing-strings: {" -> {}
trailing-strings: {"f -> {}
trailing-strings: {"fl -> {}
trailing-strings: {"flo -> {}
trailing-strings: {"floa -> {}
trailing-strings: {"float -> {}
trailing-strings: {"float" -> {}
trailing-strings: {"float": -> {}
trailing-strings: {"float":  -> {}
trailing-strings: {"float": 1 -> {'float': 1}
trailing-strings: {"float": 1. -> {}
trailing-strings: {"float": 1.0 -> {'float': 1.0}
trailing-strings: {"float": 1.0} -> {'float': 1.0}


on: { -> {}
on: {" -> {}
on: {"f -> {}
on: {"fl -> {}
on: {"flo -> {}
on: {"floa -> {}
on: {"float -> {}
on: {"float" -> {}
on: {"float": -> {}
on: {"float":  -> {}
on: {"float": 1 -> {'float': 1}
on: {"float": 1. -> {}
on: {"float": 1.0 -> {'float': 1.0}
on: {"float": 1.0} -> {'float': 1.0}

I'd suggest to coerce 1. into the float 1.0 (or expose an option to enable that)

The text was updated successfully, but these errors were encountered:

samuelcolvin · 2024-10-30T21:15:52Z

Humm, I think the behaviour we have now is fine.

I think it’s perfectly reasonable for 1. to be interpreted as an invalid value and dropped (sp [1.1, 2.2, 3. is interpreted as [1.1, 2,2]), same as fal is invalid (so [true, null, fal is interpretted as [true, null]) even though fal could be part of a valid value when it completes.

samuelcolvin · 2024-10-30T21:16:27Z

or presumably 123e will be dropped from partial parsing even though it could be the beginning of a value number.

samuelcolvin · 2024-10-30T21:25:11Z

I think the correct thing to do is drop any incomplete number.

The thing is that 10 is not similar to or a good proxy for 1024 in the way that "Samuel is a" is a useful stand-in for "Samuel is aging".

gislerro · 2024-10-31T10:02:46Z

The thing is that 10 is not similar to or a good proxy for 1024 in the way that "Samuel is a" is a useful stand-in for "Samuel is aging".

I believe 1. is a good enough proxy for 1 and it's not the same as fal - false.

Let me explain:

float: when getting a tokenstream for the float 1.1 character by character you get the following numbers: [1, undefined, 1.1]
boolean: false -> [undefined, undefined, undefined, undefined, False]
so for floats there's a jump from defined to undefined and back, whereas for the boolean it stays undefined until the parsing succeeds

For some context, im using jiter to parse a json stream generated by an LLM, the parsed dictionary is then dumped into a new pydantic model when the output from jiter changes.

An assumption I'd like to make is that when jiter parses a key for the first time- the key/field is then present in all subsequent models.

Of course I could enforce this myself by detecting when a parsed dictionary drops a key and then just readd the previous parse, but I thought it would be reasonable to add this to jiter itself since it supports a partial-string mode

davidhewitt · 2024-11-01T11:07:02Z

so for floats there's a jump from defined to undefined and back, whereas for the boolean it stays undefined until the parsing succeeds

The point above is that we think it would be better to remove the jump by having the parse be undefined until we're sure the number is terminated. Agreed that the jump is bad, and

An assumption I'd like to make is that when jiter parses a key for the first time- the key/field is then present in all subsequent models.

Yes, I think we agree with this too 👍

In our proposal, the parse would change to:

on: { -> {}
on: {" -> {}
on: {"f -> {}
on: {"fl -> {}
on: {"flo -> {}
on: {"floa -> {}
on: {"float -> {}
on: {"float" -> {}
on: {"float": -> {}
on: {"float":  -> {}
on: {"float": 1 -> {}
on: {"float": 1. -> {}
on: {"float": 1.0 -> {}
on: {"float": 1.0} -> {'float': 1.0}

davidhewitt mentioned this issue Oct 30, 2024

handle trailing period in partial float mode #156

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coercion of incomplete floats (eg.`1.`) #146

Coercion of incomplete floats (eg.`1.`) #146

gislerro commented Sep 28, 2024

samuelcolvin commented Oct 30, 2024

samuelcolvin commented Oct 30, 2024 •

edited

Loading

samuelcolvin commented Oct 30, 2024

gislerro commented Oct 31, 2024 •

edited

Loading

davidhewitt commented Nov 1, 2024

Coercion of incomplete floats (eg.1.) #146

Coercion of incomplete floats (eg.1.) #146

Comments

gislerro commented Sep 28, 2024

samuelcolvin commented Oct 30, 2024

samuelcolvin commented Oct 30, 2024 • edited Loading

samuelcolvin commented Oct 30, 2024

gislerro commented Oct 31, 2024 • edited Loading

davidhewitt commented Nov 1, 2024

Coercion of incomplete floats (eg.`1.`) #146

Coercion of incomplete floats (eg.`1.`) #146

samuelcolvin commented Oct 30, 2024 •

edited

Loading

gislerro commented Oct 31, 2024 •

edited

Loading