User abbreviation not working properly #109

javidalkaruzi · 2017-05-07T15:13:50Z

Good day,

I am trying your library but it is not working as expected. I am trying to use custom abbreviation but it does not replace the abbreviation with the equivalent word. I have even tried your example that is listed in the read me and I got the following output.

['four bdrm', 'house', 'for', 'sale', ',', 'four hundred and fifty nine thousand pounds', 'ONO']

The text was updated successfully, but these errors were encountered:

pikaliov · 2019-04-02T11:54:36Z

I have same problem

derekhsu · 2019-04-07T20:58:03Z

Me, too. Was this fixed already?

EFord36 · 2019-04-09T21:06:01Z

Hi,

Short version

can you try downloading nltk's averaged_perceptron_tagger and universal_tagset in your environment?

import nltk
nltk.download("averaged_perceptron_tagger")
nltk.download("universal_tagset")

Long version

Apologies for slowness with the issue

Sorry for the very slow reply to the original issue - I had been meaning to look into it at the time but never got around to it and forgot about it completely. Unfortunately, maintaining this repo hasn't been a top priority for me. One of the reasons I didn't put more effort into tracking down the issue at the time was that it was something I couldn't reproduce on my laptop and I feared it would be tricky to debug.

Root cause

However, I've just looked into it and I believe I've found the cause of the problem. I found I was able to reproduce the issue in a fresh environment. For me, this bare except was causing issues - it's a last-ditch effort to just ensure that we still get some output even if something's gone wrong, and just return the original 'non-standard word' before we start trying to normalise it. However, this except clause was covering up the fact that nltk resources that the repo expects might not be present.

Follow up actions

If I were writing this program again, I wouldn't rely on this bare except, which can cover up a multitude of sins (as it did here). Unfortunately, I don't have time at the moment for a substantial rewrite, but I should be able to find time to add an extra except clause for a LookupError, which is what is raised by nltk when it doesn't have the expected resources. That should make the required download transparent to other users. Does that sound reasonable as a mitigation?

Thanks everyone on the ticket for pointing out the issue, and sorry again for the slow responce!

Address #109 and #117 : for these, in expand_EXPN, nltk requires 'averaged_perceptron_tagger' and 'universal_tagset' to be downloaded, but expand_EXPN has a bare except that just returns the original nsw if anything fails. This obscured the missing data dependencies and meant that expand_EXPN never did anything. By explicitly raising LookupError , we can surface this issue to the user. It would be better to handle this dependency for the user, but this at least makes the issue transparent

EFord36 · 2019-05-14T20:32:22Z

Hi,

There is a fix for this in release 0.1.8 which is on both pypi and Github. The fix reaises the LookupErrors so that it is clear to the user what needs to be done. There doesn't appear to be a standard pattern for including nltk data dependencies in python packages, but I'll be sure to update this ticket if I find a solution which install the relevant dependencies automatically, which would obviously be a better experience for users than having to run the package repeatedly and deal with the errors.

Thanks for pointing out the issue!

Best,

Elliot

EFord36 mentioned this issue Apr 9, 2019

wrong normalization #117

Closed

EFord36 closed this as completed May 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User abbreviation not working properly #109

User abbreviation not working properly #109

javidalkaruzi commented May 7, 2017

pikaliov commented Apr 2, 2019

derekhsu commented Apr 7, 2019

EFord36 commented Apr 9, 2019 •

edited

Loading

EFord36 commented May 14, 2019

User abbreviation not working properly #109

User abbreviation not working properly #109

Comments

javidalkaruzi commented May 7, 2017

pikaliov commented Apr 2, 2019

derekhsu commented Apr 7, 2019

EFord36 commented Apr 9, 2019 • edited Loading

Short version

Long version

Apologies for slowness with the issue

Root cause

Follow up actions

EFord36 commented May 14, 2019

EFord36 commented Apr 9, 2019 •

edited

Loading