[Parser] Fix tokenizing inf #7370

masahi · 2021-01-29T03:27:26Z

Not sure if this is the best solution. Having inff in the printer output is dubious, but I kept it. If we want to special case inf in the printer that would also be easy.

Also, do we care about single precision inf or double precision inf? (if there is any difference)

The output after AnnotateSpans

def @main(%x: Tensor[(3, 4), float32]) -> Tensor[(3, 4), float32] {
  clip(%x, a_min=-inff, a_max=inff) /* GeneratedSource */ /* ty=Tensor[(3, 4), float32] */
}

@jroesch @altanh

altanh · 2021-01-29T04:47:32Z

Can you use std::numeric_limits<T>::infinity()? I'm not sure there is a difference between single vs double precision inf when it gets casted to the correct dtype, as floating point infinity is generally treated specially so the conversion should work.

altanh · 2021-01-29T04:49:07Z

BTW just to confirm, this fix should correctly handle the case of -np.inf right? Might be worth adding a test case to be sure.

src/parser/tokenizer.h

masahi · 2021-01-29T06:03:44Z

BTW just to confirm, this fix should correctly handle the case of -np.inf right? Might be worth adding a test case to be sure.

@altanh Supporting -inf was a bit tricky, but it works now. This is the new output after AnnotateSpans.

def @main(%x: Tensor[(3, 4), float32]) -> Tensor[(3, 4), float32] {
  clip(%x, a_min=-inff, a_max=inff) /* GeneratedSource */ /* ty=Tensor[(3, 4), float32] */
}

masahi · 2021-01-29T07:05:11Z

@jroesch To support -inf, I had to change Tokenize() loop, because I have to tokenize - and inf separately. I refactored the tokenizer a bit to clean up how to handle negation of both normal numbers and inf.

Now, a digit parsing in TokenizeOnce does not handle negation and ParseNumber always returns a positive number. The trailing negations are handled in Tokenize() loop.

This way, negation of positive numbers and inf are done in a unified way. Let me know if you like this change.

altanh · 2021-01-29T07:15:47Z

Hmm, the -inf case actually makes me wonder about the previous behavior for parsing numbers. I personally don't think something like ------4 should be accepted and tokenized as 4, for example. IMO we should reject these either right here at tokenizer or at the parser. The behavior of stod also agrees with this.

I also noticed we could maybe use MatchString to directly match on "inf". I feel like the most durable solution is to actually have a single branch for handling - like this:

// ...
} else if (next == "-") {
  // assuming More()
  Next();
  if (IsDigit(Peek()) || MatchString("inff")) {
    // handle *negative* num normally, we might want to refactor previous handling to make things nicer
  } else {
    // return TokenType::kMinus normally
  }
}

altanh · 2021-01-29T07:17:04Z

I suppose this is the exactly opposite approach to your current change, I think it depends on how much we care about this nested negation thing.

masahi · 2021-01-29T07:18:46Z

yes I don't want to bother with nested negation either, but there is an annoying test

tvm/tests/python/relay/test_ir_parser.py

Lines 198 to 202 in 8daa97e

    
           def test_negative(): 
        
               # need to handle parsing non-literal operations 
        
               # assert isinstance(parse_text("let %x = 1; -%x").body, relay.Call) 
        
               assert get_scalar(parse_text("--10")) == 10 
        
               assert get_scalar(parse_text("---10")) == -10

I didn't notice MatchString, I'll try this to clean up inf case

altanh · 2021-01-29T07:23:13Z

Hah, got it, I'm in favor of just deleting that test 😂 but I guess someone might depend on this behavior.

Actually... we could do the negs++ counter thing in the handling for - branch and only collapse them if we eventually hit a digit or inff, otherwise reset pos and return a single kMinus. I understand Jared's comment now about the "multi-token return" as this will be incredibly slow if someone decides to write a lot of minus signs and not put a number after. This approach should be able to avoid modifying the tokenization loop.

masahi · 2021-01-29T08:49:05Z

@altanh Thanks for suggestion, I believe we have the cleanest solution now.

@jroesch @altanh ready for review.

altanh

LGTM

altanh · 2021-01-30T00:00:26Z

src/parser/tokenizer.h

@@ -212,6 +212,25 @@ struct Tokenizer {
    }
  }

+  Token ParseNumber(bool is_pos) {
+    std::stringstream ss;
+    while (More() && IsNumeric(Peek())) {


out of the scope of this PR, but I imagine this is too loose in terms of accepting weird strings that satisfy the IsNumeric constraint, such as 1e2e3+..2. That being said, stod("1e2e3+..2") returns 100, so I guess it's alright lol.

This is like some of the oldest tokenizer code, and I was just quickly trying to write one without getting super into regex, etc. Might be worth completely replacing soon.

jroesch · 2021-02-01T21:39:29Z

LGTM

* fix tokenizing inf * use ParseNumber to parse inf, handle -inf * fix neg handling * fixed multi negation * refactor * use while loop * simplyfing * fix lint * simpler implementation per altan's suggestion * disable flaky test

altanh suggested changes Jan 29, 2021

View reviewed changes

src/parser/tokenizer.h Outdated Show resolved Hide resolved

masahi force-pushed the fix-tokenizer branch from d23fd1d to 55a9c21 Compare January 29, 2021 06:01

masahi force-pushed the fix-tokenizer branch from eb1d572 to 55f7492 Compare January 29, 2021 07:12

masahi added 9 commits January 29, 2021 17:47

fix tokenizing inf

b1e70b1

use ParseNumber to parse inf, handle -inf

41cd0c9

fix neg handling

8cb5daf

fixed multi negation

d131b94

refactor

0c0fccd

use while loop

f55a589

simplyfing

e983e04

fix lint

0ff96d6

simpler implementation per altan's suggestion

fee9c5e

masahi force-pushed the fix-tokenizer branch from b462b81 to fee9c5e Compare January 29, 2021 08:47

disable flaky test

659d9e8

altanh approved these changes Jan 30, 2021

View reviewed changes

jroesch approved these changes Feb 1, 2021

View reviewed changes

jroesch merged commit 0d303b4 into apache:main Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parser] Fix tokenizing inf #7370

[Parser] Fix tokenizing inf #7370

masahi commented Jan 29, 2021 •

edited

Loading

altanh commented Jan 29, 2021

altanh commented Jan 29, 2021

masahi commented Jan 29, 2021

masahi commented Jan 29, 2021 •

edited

Loading

altanh commented Jan 29, 2021 •

edited

Loading

altanh commented Jan 29, 2021

masahi commented Jan 29, 2021

altanh commented Jan 29, 2021

masahi commented Jan 29, 2021

altanh left a comment

altanh Jan 30, 2021

jroesch Feb 1, 2021

jroesch commented Feb 1, 2021

[Parser] Fix tokenizing inf #7370

[Parser] Fix tokenizing inf #7370

Conversation

masahi commented Jan 29, 2021 • edited Loading

altanh commented Jan 29, 2021

altanh commented Jan 29, 2021

masahi commented Jan 29, 2021

masahi commented Jan 29, 2021 • edited Loading

altanh commented Jan 29, 2021 • edited Loading

altanh commented Jan 29, 2021

masahi commented Jan 29, 2021

altanh commented Jan 29, 2021

masahi commented Jan 29, 2021

altanh left a comment

Choose a reason for hiding this comment

altanh Jan 30, 2021

Choose a reason for hiding this comment

jroesch Feb 1, 2021

Choose a reason for hiding this comment

jroesch commented Feb 1, 2021

masahi commented Jan 29, 2021 •

edited

Loading

masahi commented Jan 29, 2021 •

edited

Loading

altanh commented Jan 29, 2021 •

edited

Loading