URL Detection Problem #82

ghost · 2021-01-11T00:57:56Z

I tried to use the module to detect a link in the following string.

Link:https://www.google.com

but it failed to detect that there is a url.

The text was updated successfully, but these errors were encountered:

lipoja · 2021-01-11T11:02:50Z

Hi @Ricardolcm888 thanks for reporting it.
However this is not an easily fixed issue. The text is not typographically correct, there should be space after the first colon sign: Link: https://example.com And yeah I know - internet is full of these mistakes and typos.
Is it possible for you to somehow pre-process the text?

I will think about it, however right now I do not see any general solution for this.

amoldavsky · 2022-03-14T02:43:18Z

Yes this is in fact a problem

from urlextract import URLExtract

extractor = URLExtract()
extractor.find_urls('earn $600 every week, work from home job:https://2.ua/YHfw38')

results in:

["job:https://2.ua/YHfw38"]

@lipoja RE:
The text is not typographically correct, there should be space after the first colon sign
well, what should be and what is, sadly rarely coincide 🤣

I had to fix this for my ML pre-processing, pretty straightforward fix, will submit a PR shortly...

amoldavsky · 2022-03-14T03:06:49Z

Here is the PR
#120

@lipoja I would appreciate if you could merge that in and release so I do not have to release a production model off of my code change in a form of a hack in a git branch :)

lipoja · 2022-03-15T09:19:02Z

@amoldavsky Thank you for contributing! Sure I can merge it and release it. But before I do that I would like to discuss with you few of my ideas so we do not break extraction or unintentionally filter out some URLs which would be extracted with current code. Please have a look to your PR.

amoldavsky · 2022-03-16T15:59:20Z

Yup, I started a discussion in the PR

Stvad · 2022-11-30T16:30:41Z

facing the same issue, was curious what is the state of PR for fixing this! :)

…stop chars ':' fixes #82

amoldavsky mentioned this issue Mar 14, 2022

fix-multiple-protocols-in-url #120

Closed

lipoja added a commit that referenced this issue Dec 14, 2022

Adding the ability to set stop characters inside of scheme - default …

d6c88e3

…stop chars ':' fixes #82

lipoja added a commit that referenced this issue Dec 14, 2022

Adding the ability to set stop characters inside of scheme - default …

6566b09

…stop chars ':' fixes #82

lipoja added a commit that referenced this issue Dec 14, 2022

Adding the ability to set stop characters inside of scheme - default …

fc62fac

…stop chars ':' fixes #82

lipoja added a commit that referenced this issue Dec 14, 2022

Adding the ability to set stop characters inside of scheme - default …

2bffddf

…stop chars ':' fixes #82

lipoja closed this as completed in 6b2b1fc Dec 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL Detection Problem #82

URL Detection Problem #82

ghost commented Jan 11, 2021

lipoja commented Jan 11, 2021

amoldavsky commented Mar 14, 2022

amoldavsky commented Mar 14, 2022

lipoja commented Mar 15, 2022

amoldavsky commented Mar 16, 2022

Stvad commented Nov 30, 2022

URL Detection Problem #82

URL Detection Problem #82

Comments

ghost commented Jan 11, 2021

lipoja commented Jan 11, 2021

amoldavsky commented Mar 14, 2022

amoldavsky commented Mar 14, 2022

lipoja commented Mar 15, 2022

amoldavsky commented Mar 16, 2022

Stvad commented Nov 30, 2022