Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it expected behaviour that LIKE_URL also matches on e-mail addresses? #1698

Closed
Bri-Will opened this issue Dec 7, 2017 · 5 comments
Closed
Labels
help wanted (easy) Contributions welcome! (also suited for spaCy beginners) help wanted Contributions welcome!

Comments

@Bri-Will
Copy link
Contributor

Bri-Will commented Dec 7, 2017

For the code below, an e-mail address triggers both matching rules:

import en_core_web_sm
from spacy.matcher import Matcher

nlp = en_core_web_sm.load()
matcher = Matcher(nlp.vocab)

matcher.add('EMAIL', None, [{'LIKE_EMAIL': True}])
matcher.add('URL', None, [{'LIKE_URL': True}])

doc = nlp(u'What happens with an email@address.com and how about a URL like http://www.google.com?')
matches = matcher(doc)

print (matches)

Results:
[(17587345535198158200, 4, 5), (2582013287274679728, 4, 5), (2582013287274679728, 11, 12)]

Seems like they should be distinct...I didn't think an "@" symbol could be present in a URL...

Your Environment

  • Operating System: Windows
  • Python Version Used: 3
  • spaCy Version Used: 2.0
  • Environment Information:
@ines
Copy link
Member

ines commented Dec 8, 2017

Thanks for the report 👍 This is definitely not intentional.

For performance reasons, the like_url getter uses simple conditionals instead of regular expressions. This means it should be easy to adjust – maybe even as simple as adding something along the lines of '@' not in text. The relevant source is here:

def like_url(text):
# We're looking for things that function in text like URLs. So, valid URL
# or not, anything they say http:// is going to be good.
if text.startswith('http://') or text.startswith('https://'):
return True
elif text.startswith('www.') and len(text) >= 5:
return True
if text[0] == '.' or text[-1] == '.':
return False
for i in range(len(text)):
if text[i] == '.':
break
else:
return False
tld = text.rsplit('.', 1)[1].split(':', 1)[0]
if tld.endswith('/'):
return True
if tld.isalpha() and tld in _tlds:
return True
return False

I'll add a help wanted label in case anyone wants to give it a go, adjust the getter and add a few tests for it. Otherwise, we're happy to take care of this, too.

@ines ines added help wanted Contributions welcome! help wanted (easy) Contributions welcome! (also suited for spaCy beginners) performance labels Dec 8, 2017
@Bri-Will
Copy link
Contributor Author

Bri-Will commented Dec 8, 2017

FYI, I added an if statement like so to lex_attrs.py:

if '@' in text:
    return False

and it seems to works fine.

I'd love to add the fix, but I'm too much of a newbie to be trusted to do that!

@ines
Copy link
Member

ines commented Dec 9, 2017

@Bri-Will Thanks for updating – nice to hear that it's working!

I'd love to add the fix, but I'm too much of a newbie to be trusted to do that!

Don't worry – of course, you don't have to, but if you'd like to submit your first PR, I'm happy to assist! Your fix looks fine and you don't have to worry about adding tests for now (I'll take care of this later).

The easiest way is to just fork spaCy so it's added to your GitHub account. Then add the fix and push it to your fork, go to the New Pull Request page and hit "Compare across forks". Select your fork as the head repo, and your change should show up. You can then create a new pull request, write a sentence describing the change and submit it 🎉 There's actually very little that can go wrong – all new PRs are tested and we'll review them before merging, so you can't break anything.

@Bri-Will
Copy link
Contributor Author

OK, I've done this and submitted a pull request (I did not add tests for this)

ines added a commit to Bri-Will/spaCy that referenced this issue Dec 12, 2017
@ines ines closed this as completed in 1e61fff Dec 12, 2017
@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted (easy) Contributions welcome! (also suited for spaCy beginners) help wanted Contributions welcome!
Projects
None yet
Development

No branches or pull requests

2 participants