Tokenization of URLs needs work #344

rlvoyer · 2016-04-19T17:30:53Z

In [3]: nlp = English()

In [4]: doc = nlp("Do you agree that this is a URL: http://www.nytimes.com/2016/04/20/us/politics/new-york-primary-preview.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=a-lede-package-region&region=top-news&WT.nav=top-news&_r=0")

In [5]: [s.lemma_.lower() for s in doc if not s.like_url]
Out[5]:
['do',
 'you',
 'agree',
 'that',
 'this',
 'be',
 'a',
 'url',
 ':',
 '-',
 'york-primary-preview.html?hp&action=click&pgtype=homepage&clicksource=story-heading&module=a-lede-package-region&region=top-news&wt.nav=top-news&_r=0']

rlvoyer · 2016-04-19T17:32:20Z

In fact, the problem here might be better characterized as a tokenization problem (so I'm going to rename the issue).

lock · 2018-05-09T05:38:33Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

rlvoyer changed the title ~~URL identification needs work~~ Tokenization of URLs needs work Apr 19, 2016

honnibal added the performance label Oct 20, 2016

oroszgy mentioned this issue Dec 21, 2016

Tokenization with exception patterns #700

Merged

9 tasks

honnibal closed this as completed in #700 Jan 2, 2017

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization of URLs needs work #344

Tokenization of URLs needs work #344

rlvoyer commented Apr 19, 2016

rlvoyer commented Apr 19, 2016

lock bot commented May 9, 2018

Tokenization of URLs needs work #344

Tokenization of URLs needs work #344

Comments

rlvoyer commented Apr 19, 2016

rlvoyer commented Apr 19, 2016

lock bot commented May 9, 2018