Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistakenly identify 3.2 in "test version 3.2 by tomorrow" as a date using nlp #150

Open
phoebebright opened this issue Dec 9, 2015 · 3 comments
Milestone

Comments

@phoebebright
Copy link

Using Australian locale and then to be sure defining separators:

c = pdt.Constants('en_AU')
c.dateSep = ['-', '/']

But insists that 3.2 is a date. Is there any way to prevent this happening?

@phoebebright
Copy link
Author

The piece of code is in init.py line 2581 where . and - are added as separators. Have removed these for my purposes as I expect the separators I have defined to be respected.

   dateSeps = ''.join(re.escape(s)
                        for s in self.locale.dateSep + ['-', '.'])

@idpaterson
Copy link
Collaborator

I added in the - in all locales awhile back to support yyyy-mm-dd standard format which should be supported in all locales. However, that alternate separator should only be allowed for patterns that include all three components. I will work on fixing that pattern to avoid false positives.

Does anyone know why the . character is added here? It was there before I added the - so I preserved it but now cannot recall whether we discussed the origin of that separator. I suspect that it would have similar behavior matching only yyyy.mm.dd and not x.x as in OP's example but that does not look like a standard format.

@idpaterson
Copy link
Collaborator

There are actually test cases for the Australian locale suggesting that it should match with . as a date separator despite never explicitly defining that as a separator for the locale. There are countries that use that separator but I would be very hesitant to ever include it by default for the high likelihood of false positive matches on decimal numbers.

I am going to proceed by fixing the overzealous expression and removing those Australian test cases. The . is actually included in the default base locale dateSep = ['/', '.'] so even after this change most people will still have issues with decimals parsing as dates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants