Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parser] Fix tokenizing inf #7370

Merged
merged 10 commits into from
Feb 1, 2021
Merged

[Parser] Fix tokenizing inf #7370

merged 10 commits into from
Feb 1, 2021

Conversation

masahi
Copy link
Member

@masahi masahi commented Jan 29, 2021

Fixes #7339

Not sure if this is the best solution. Having inff in the printer output is dubious, but I kept it. If we want to special case inf in the printer that would also be easy.

Also, do we care about single precision inf or double precision inf? (if there is any difference)

The output after AnnotateSpans

def @main(%x: Tensor[(3, 4), float32]) -> Tensor[(3, 4), float32] {
  clip(%x, a_min=-inff, a_max=inff) /* GeneratedSource */ /* ty=Tensor[(3, 4), float32] */
}

@jroesch @altanh

@altanh
Copy link
Contributor

altanh commented Jan 29, 2021

Can you use std::numeric_limits<T>::infinity()? I'm not sure there is a difference between single vs double precision inf when it gets casted to the correct dtype, as floating point infinity is generally treated specially so the conversion should work.

@altanh
Copy link
Contributor

altanh commented Jan 29, 2021

BTW just to confirm, this fix should correctly handle the case of -np.inf right? Might be worth adding a test case to be sure.

src/parser/tokenizer.h Outdated Show resolved Hide resolved
@masahi
Copy link
Member Author

masahi commented Jan 29, 2021

BTW just to confirm, this fix should correctly handle the case of -np.inf right? Might be worth adding a test case to be sure.

@altanh Supporting -inf was a bit tricky, but it works now. This is the new output after AnnotateSpans.

def @main(%x: Tensor[(3, 4), float32]) -> Tensor[(3, 4), float32] {
  clip(%x, a_min=-inff, a_max=inff) /* GeneratedSource */ /* ty=Tensor[(3, 4), float32] */
}

@masahi
Copy link
Member Author

masahi commented Jan 29, 2021

@jroesch To support -inf, I had to change Tokenize() loop, because I have to tokenize - and inf separately. I refactored the tokenizer a bit to clean up how to handle negation of both normal numbers and inf.

Now, a digit parsing in TokenizeOnce does not handle negation and ParseNumber always returns a positive number. The trailing negations are handled in Tokenize() loop.

This way, negation of positive numbers and inf are done in a unified way. Let me know if you like this change.

@altanh
Copy link
Contributor

altanh commented Jan 29, 2021

Hmm, the -inf case actually makes me wonder about the previous behavior for parsing numbers. I personally don't think something like ------4 should be accepted and tokenized as 4, for example. IMO we should reject these either right here at tokenizer or at the parser. The behavior of stod also agrees with this.

I also noticed we could maybe use MatchString to directly match on "inf". I feel like the most durable solution is to actually have a single branch for handling - like this:

// ...
} else if (next == "-") {
  // assuming More()
  Next();
  if (IsDigit(Peek()) || MatchString("inff")) {
    // handle *negative* num normally, we might want to refactor previous handling to make things nicer
  } else {
    // return TokenType::kMinus normally
  }
}

@altanh
Copy link
Contributor

altanh commented Jan 29, 2021

I suppose this is the exactly opposite approach to your current change, I think it depends on how much we care about this nested negation thing.

@masahi
Copy link
Member Author

masahi commented Jan 29, 2021

yes I don't want to bother with nested negation either, but there is an annoying test

def test_negative():
# need to handle parsing non-literal operations
# assert isinstance(parse_text("let %x = 1; -%x").body, relay.Call)
assert get_scalar(parse_text("--10")) == 10
assert get_scalar(parse_text("---10")) == -10

I didn't notice MatchString, I'll try this to clean up inf case

@altanh
Copy link
Contributor

altanh commented Jan 29, 2021

Hah, got it, I'm in favor of just deleting that test 😂 but I guess someone might depend on this behavior.

Actually... we could do the negs++ counter thing in the handling for - branch and only collapse them if we eventually hit a digit or inff, otherwise reset pos and return a single kMinus. I understand Jared's comment now about the "multi-token return" as this will be incredibly slow if someone decides to write a lot of minus signs and not put a number after. This approach should be able to avoid modifying the tokenization loop.

@masahi
Copy link
Member Author

masahi commented Jan 29, 2021

@altanh Thanks for suggestion, I believe we have the cleanest solution now.

@jroesch @altanh ready for review.

Copy link
Contributor

@altanh altanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -212,6 +212,25 @@ struct Tokenizer {
}
}

Token ParseNumber(bool is_pos) {
std::stringstream ss;
while (More() && IsNumeric(Peek())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of the scope of this PR, but I imagine this is too loose in terms of accepting weird strings that satisfy the IsNumeric constraint, such as 1e2e3+..2. That being said, stod("1e2e3+..2") returns 100, so I guess it's alright lol.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is like some of the oldest tokenizer code, and I was just quickly trying to write one without getting super into regex, etc. Might be worth completely replacing soon.

@jroesch
Copy link
Member

jroesch commented Feb 1, 2021

LGTM

@jroesch jroesch merged commit 0d303b4 into apache:main Feb 1, 2021
alexwong pushed a commit to alexwong/tvm that referenced this pull request Feb 11, 2021
* fix tokenizing inf

* use ParseNumber to parse inf, handle -inf

* fix neg handling

* fixed multi negation

* refactor

* use while loop

* simplyfing

* fix lint

* simpler implementation per altan's suggestion

* disable flaky test
electriclilies pushed a commit to electriclilies/tvm that referenced this pull request Feb 18, 2021
* fix tokenizing inf

* use ParseNumber to parse inf, handle -inf

* fix neg handling

* fixed multi negation

* refactor

* use while loop

* simplyfing

* fix lint

* simpler implementation per altan's suggestion

* disable flaky test
Lokiiiiii pushed a commit to Lokiiiiii/tvm that referenced this pull request Mar 2, 2021
* fix tokenizing inf

* use ParseNumber to parse inf, handle -inf

* fix neg handling

* fixed multi negation

* refactor

* use while loop

* simplyfing

* fix lint

* simpler implementation per altan's suggestion

* disable flaky test
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Mar 2, 2021
* fix tokenizing inf

* use ParseNumber to parse inf, handle -inf

* fix neg handling

* fixed multi negation

* refactor

* use while loop

* simplyfing

* fix lint

* simpler implementation per altan's suggestion

* disable flaky test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug][Parser] Parser/tokenizer doesn't handle inf float
3 participants