Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL Detection Problem #82

Closed
ghost opened this issue Jan 11, 2021 · 6 comments
Closed

URL Detection Problem #82

ghost opened this issue Jan 11, 2021 · 6 comments

Comments

@ghost
Copy link

ghost commented Jan 11, 2021

I tried to use the module to detect a link in the following string.

Link:https://www.google.com

but it failed to detect that there is a url.

@lipoja
Copy link
Owner

lipoja commented Jan 11, 2021

Hi @Ricardolcm888 thanks for reporting it.
However this is not an easily fixed issue. The text is not typographically correct, there should be space after the first colon sign: Link: https://example.com And yeah I know - internet is full of these mistakes and typos.
Is it possible for you to somehow pre-process the text?

I will think about it, however right now I do not see any general solution for this.

@amoldavsky
Copy link

Yes this is in fact a problem

from urlextract import URLExtract

extractor = URLExtract()
extractor.find_urls('earn $600 every week, work from home job:https://2.ua/YHfw38')

results in:

["job:https://2.ua/YHfw38"]

@lipoja RE:
The text is not typographically correct, there should be space after the first colon sign
well, what should be and what is, sadly rarely coincide 🤣

I had to fix this for my ML pre-processing, pretty straightforward fix, will submit a PR shortly...

@amoldavsky
Copy link

Here is the PR
#120

@lipoja I would appreciate if you could merge that in and release so I do not have to release a production model off of my code change in a form of a hack in a git branch :)

@lipoja
Copy link
Owner

lipoja commented Mar 15, 2022

@amoldavsky Thank you for contributing! Sure I can merge it and release it. But before I do that I would like to discuss with you few of my ideas so we do not break extraction or unintentionally filter out some URLs which would be extracted with current code. Please have a look to your PR.

@amoldavsky
Copy link

Yup, I started a discussion in the PR

@Stvad
Copy link

Stvad commented Nov 30, 2022

facing the same issue, was curious what is the state of PR for fixing this! :)

lipoja added a commit that referenced this issue Dec 14, 2022
lipoja added a commit that referenced this issue Dec 14, 2022
lipoja added a commit that referenced this issue Dec 14, 2022
lipoja added a commit that referenced this issue Dec 14, 2022
@lipoja lipoja closed this as completed in 6b2b1fc Dec 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants