Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No spaces before quotes #31

Closed
tradingaddict opened this issue Dec 31, 2022 · 2 comments
Closed

No spaces before quotes #31

tradingaddict opened this issue Dec 31, 2022 · 2 comments

Comments

@tradingaddict
Copy link

The detokenizer isn't prepending spaces before quotes like it says it should in the Tokenizer.py examples.
If I use one of the examples on the detokenizer:

["He", "said", "''", "heya", "!", "''", "yesterday", "."]

it returns:

He said"heya!" yesterday.

@tomaarsen
Copy link
Owner

Hello!

I think I've narrowed this down a bit:
With NLTK version 3.5:

>>> detokenize(["He", "said", "''", "heya", "!", "''", "yesterday", "."])
"He said ''heya!''  yesterday."

With NLTK version 3.6.7:

>>> detokenize(["He", "said", "''", "heya", "!", "''", "yesterday", "."])
'He said"heya!" yesterday.'

I'll workshop a quick fix to improve the performance a bit.

@tomaarsen
Copy link
Owner

tomaarsen commented Jan 2, 2023

I've got this quick testing script:

for sentence in ["Hello, you're Tom!",
                 'He said "heya!" yesterday.',
                 'He said \'heya!\' yesterday.',
                 'He said \'\'heya!\'\' yesterday.',
                 'He\'s doing well, I think.',
                 ]:
    token = tokenize(sentence)
    detoken = detokenize(token)
    print(detoken)

The new performance is:

Hello, you're Tom!
He said "heya!" yesterday.
He said 'heya! 'yesterday.
He said "heya!" yesterday.
He's doing well, I think.

versus the old performance:

Hello, you're Tom!
He said"heya!" yesterday.
He said 'heya! 'yesterday.
He said"heya!" yesterday.
He's doing well, I think.

(Note: using NLTK 3.6.7)

Thank you for reporting this!

Closed via f994465

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants