No spaces before quotes #31

tradingaddict · 2022-12-31T13:44:59Z

The detokenizer isn't prepending spaces before quotes like it says it should in the Tokenizer.py examples.
If I use one of the examples on the detokenizer:

["He", "said", "''", "heya", "!", "''", "yesterday", "."]

it returns:

He said"heya!" yesterday.

tomaarsen · 2023-01-02T13:30:56Z

Hello!

I think I've narrowed this down a bit:
With NLTK version 3.5:

>>> detokenize(["He", "said", "''", "heya", "!", "''", "yesterday", "."])
"He said ''heya!''  yesterday."

With NLTK version 3.6.7:

>>> detokenize(["He", "said", "''", "heya", "!", "''", "yesterday", "."])
'He said"heya!" yesterday.'

I'll workshop a quick fix to improve the performance a bit.

tomaarsen · 2023-01-02T13:37:57Z

I've got this quick testing script:

for sentence in ["Hello, you're Tom!",
                 'He said "heya!" yesterday.',
                 'He said \'heya!\' yesterday.',
                 'He said \'\'heya!\'\' yesterday.',
                 'He\'s doing well, I think.',
                 ]:
    token = tokenize(sentence)
    detoken = detokenize(token)
    print(detoken)

The new performance is:

Hello, you're Tom!
He said "heya!" yesterday.
He said 'heya! 'yesterday.
He said "heya!" yesterday.
He's doing well, I think.

versus the old performance:

Hello, you're Tom!
He said"heya!" yesterday.
He said 'heya! 'yesterday.
He said"heya!" yesterday.
He's doing well, I think.

(Note: using NLTK 3.6.7)

Thank you for reporting this!

Closed via f994465

tomaarsen closed this as completed Jan 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No spaces before quotes #31

No spaces before quotes #31

tradingaddict commented Dec 31, 2022

tomaarsen commented Jan 2, 2023

tomaarsen commented Jan 2, 2023 •

edited

Loading

No spaces before quotes #31

No spaces before quotes #31

Comments

tradingaddict commented Dec 31, 2022

tomaarsen commented Jan 2, 2023

tomaarsen commented Jan 2, 2023 • edited Loading

tomaarsen commented Jan 2, 2023 •

edited

Loading