Is it expected behaviour that LIKE_URL also matches on e-mail addresses? #1698

Bri-Will · 2017-12-07T20:28:19Z

For the code below, an e-mail address triggers both matching rules:

import en_core_web_sm
from spacy.matcher import Matcher

nlp = en_core_web_sm.load()
matcher = Matcher(nlp.vocab)

matcher.add('EMAIL', None, [{'LIKE_EMAIL': True}])
matcher.add('URL', None, [{'LIKE_URL': True}])

doc = nlp(u'What happens with an email@address.com and how about a URL like http://www.google.com?')
matches = matcher(doc)

print (matches)

Results:
[(17587345535198158200, 4, 5), (2582013287274679728, 4, 5), (2582013287274679728, 11, 12)]

Seems like they should be distinct...I didn't think an "@" symbol could be present in a URL...

Your Environment

Operating System: Windows
Python Version Used: 3
spaCy Version Used: 2.0
Environment Information:

The text was updated successfully, but these errors were encountered:

ines · 2017-12-08T09:28:22Z

Thanks for the report 👍 This is definitely not intentional.

For performance reasons, the like_url getter uses simple conditionals instead of regular expressions. This means it should be easy to adjust – maybe even as simple as adding something along the lines of '@' not in text. The relevant source is here:

spaCy/spacy/lang/lex_attrs.py

Lines 76 to 95 in 51d3ab2

    
           def like_url(text): 
        
               # We're looking for things that function in text like URLs. So, valid URL 
        
               # or not, anything they say http:// is going to be good. 
        
               if text.startswith('http://') or text.startswith('https://'): 
        
                   return True 
        
               elif text.startswith('www.') and len(text) >= 5: 
        
                   return True 
        
               if text[0] == '.' or text[-1] == '.': 
        
                   return False 
        
               for i in range(len(text)): 
        
                   if text[i] == '.': 
        
                       break 
        
               else: 
        
                   return False 
        
               tld = text.rsplit('.', 1)[1].split(':', 1)[0] 
        
               if tld.endswith('/'): 
        
                   return True 
        
               if tld.isalpha() and tld in _tlds: 
        
                   return True 
        
               return False

I'll add a help wanted label in case anyone wants to give it a go, adjust the getter and add a few tests for it. Otherwise, we're happy to take care of this, too.

Bri-Will · 2017-12-08T18:30:19Z

FYI, I added an if statement like so to lex_attrs.py:

if '@' in text:
    return False

and it seems to works fine.

I'd love to add the fix, but I'm too much of a newbie to be trusted to do that!

ines · 2017-12-09T08:00:49Z

@Bri-Will Thanks for updating – nice to hear that it's working!

I'd love to add the fix, but I'm too much of a newbie to be trusted to do that!

Don't worry – of course, you don't have to, but if you'd like to submit your first PR, I'm happy to assist! Your fix looks fine and you don't have to worry about adding tests for now (I'll take care of this later).

The easiest way is to just fork spaCy so it's added to your GitHub account. Then add the fix and push it to your fork, go to the New Pull Request page and hit "Compare across forks". Select your fork as the head repo, and your change should show up. You can then create a new pull request, write a sentence describing the change and submit it 🎉 There's actually very little that can go wrong – all new PRs are tested and we'll review them before merging, so you can't break anything.

Bri-Will · 2017-12-11T22:47:46Z

OK, I've done this and submitted a pull request (I did not add tests for this)

lock · 2018-05-08T05:55:13Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added help wanted Contributions welcome! help wanted (easy) Contributions welcome! (also suited for spaCy beginners) performance labels Dec 8, 2017

ines mentioned this issue Dec 8, 2017

LIKE_URL property is not working correctly #1606

Closed

Bri-Will mentioned this issue Dec 11, 2017

Update lex_attrs.py. Fix like_url from matching on e-mail #1715

Merged

1 task

ines added a commit to Bri-Will/spaCy that referenced this issue Dec 12, 2017

Add regression test for explosion#1698

9c1ee65

ines closed this as completed in 1e61fff Dec 12, 2017

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it expected behaviour that LIKE_URL also matches on e-mail addresses? #1698

Is it expected behaviour that LIKE_URL also matches on e-mail addresses? #1698

Bri-Will commented Dec 7, 2017

ines commented Dec 8, 2017

Bri-Will commented Dec 8, 2017 •

edited

Loading

ines commented Dec 9, 2017

Bri-Will commented Dec 11, 2017

lock bot commented May 8, 2018

Is it expected behaviour that LIKE_URL also matches on e-mail addresses? #1698

Is it expected behaviour that LIKE_URL also matches on e-mail addresses? #1698

Comments

Bri-Will commented Dec 7, 2017

Your Environment

ines commented Dec 8, 2017

Bri-Will commented Dec 8, 2017 • edited Loading

ines commented Dec 9, 2017

Bri-Will commented Dec 11, 2017

lock bot commented May 8, 2018

Bri-Will commented Dec 8, 2017 •

edited

Loading