Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to Crawl JS Dependent Sites #3

Open
Texan1835 opened this issue Jun 4, 2020 · 7 comments
Open

Unable to Crawl JS Dependent Sites #3

Texan1835 opened this issue Jun 4, 2020 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@Texan1835
Copy link

Texan1835 commented Jun 4, 2020

Ran this scraper exactly as it was created, only modified path from logs to a .txt in a Windows folder. Captures about half the email addresses on a given webpage, but never captures phone numbers. Running code against an entire website, not just a single webpage. Error seems to occur even when I run it against a single webpage with multiple phone numbers listed.
Windows 10, python 3.8, pyCharm.
Please note - I'm a newbie to python, so it's possible the error is on my end.

Edit: Ran scraper against this link because it has lots of phone/email: https://www.hamradio.com/contact.cfm

Result:

`Crawling https://www.hamradio.com/contact.cfm

Emails:

Phone Numbers:

Process finished with exit code 0`

@z7r1k3
Copy link
Owner

z7r1k3 commented Jun 4, 2020

Are those phone numbers linked? i.e. If you view the source, does it have an href="tel:1234567890"?

Currently, only linked phones/emails are supported, but I do plan to eventually add support to search the entire page for anything that looks like a phone/email, linked or not.

@Texan1835
Copy link
Author

Texan1835 commented Jun 4, 2020

It is not formatted like that. Uses br tags.

Code for phone looks like this on the webpage:
`


Phone: 713-533-7373


Toll Free: 800-854-6046

`

Email code:

`
anaheim@hamradio.com

`

HamRadioCode

@z7r1k3
Copy link
Owner

z7r1k3 commented Jun 5, 2020

Unfortunately the crawler doesn't support scraping for plaintext phones/emails yet, although that is on the to-do list. For now it has to be an actual tel or mailto link.

As for that .cfm link, since .cfm isn't added to the whitelist, it's treating it as an unsupported filetype. I'll go ahead and add it, but you should be able to put that in as the original scraping URL as a workaround. Is that what you tried and did it still not work?

@z7r1k3 z7r1k3 closed this as completed Jun 5, 2020
@z7r1k3 z7r1k3 reopened this Jun 5, 2020
@z7r1k3
Copy link
Owner

z7r1k3 commented Jun 5, 2020

After debugging, the crawler is unable to view the webpage because it requires JavaScript.

As such it would appear this site (and any site like it) is unsupported. I may add a fix for this in the future if it becomes common enough, but I'll need to deep dive it a bit.

This is all the HTML the crawler gets to see:

<html><title>You are being redirected...</title>
<noscript>Javascript is required. Please enable javascript before you are allowed to see this page.</noscript>
<script>var s={},u,c,U,r,i,l=0,a,e=eval,w=String.fromCharCode,sucuri_cloudproxy_js='',S='eD0nMG1TYicuc3Vic3RyKDMsIDEpICsgJycgKycnKyJlIi5zbGljZSgwLDEpICsgIjRzdWN1ciIuY2hhckF0KDApKyJjbCIuY2hhckF0KDApICsgICcnICsgCiIwc3VjdXIiLmNoYXJBdCgwKSsnYScgKyAgICcnICsnMmEnLnNsaWNlKDEsMikrJzMnICsgICJmIiArICIiICsiM3N1Ii5zbGljZSgwLDEpICsgImNzdWN1ciIuY2hhckF0KDApKyIiICsndUc5Jy5jaGFyQXQoMikrJz1mJy5zbGljZSgxLDIpK1N0cmluZy5mcm9tQ2hhckNvZGUoMHgzNCkgKyAnNicgKyAgJzAnICsgICAnJyArJzQnICsgICI2ayIuY2hhckF0KDApICsgICcnICsgCiI5dyIuY2hhckF0KDApICsgIjgiICsgIjIiICsgIiIgKyczJyArICBTdHJpbmcuZnJvbUNoYXJDb2RlKDB4MzUpICsgIjgiLnNsaWNlKDAsMSkgKyAgJycgKyAKIjRzdWN1ciIuY2hhckF0KDApKyAnJyArJzAnICsgICI1bSIuY2hhckF0KDApICsgICcnICsgCiJhc3UiLnNsaWNlKDAsMSkgKyAiIiArU3RyaW5nLmZyb21DaGFyQ29kZSgweDM0KSArICdWeD4wJy5zdWJzdHIoMywgMSkgKyAnJyArImZzdWN1ciIuY2hhckF0KDApKyJjIiArICAnJyArJyc7ZG9jdW1lbnQuY29va2llPSdzc3VjdXJpJy5jaGFyQXQoMCkgKyAndXMnLmNoYXJBdCgwKSsnYycrJ3VzdScuY2hhckF0KDApICsnc3VjdXJyJy5jaGFyQXQoNSkgKyAnaScrJ19zdScuY2hhckF0KDApICsnc3VjdXJpYycuY2hhckF0KDYpKydsJy5jaGFyQXQoMCkrJ29zdWN1Jy5jaGFyQXQoMCkgICsnc3UnLmNoYXJBdCgxKSsnc3VjdXJkJy5jaGFyQXQoNSkgKyAncCcuY2hhckF0KDApKydyJysnJysnc3VjdXJpbycuY2hhckF0KDYpKyd4c3VjdScuY2hhckF0KDApICArJ3lzJy5jaGFyQXQoMCkrJ18nKyd1JysndScrJ2knKydkJysnX3N1Y3VyJy5jaGFyQXQoMCkrICdiJysnc3VjdXJmJy5jaGFyQXQoNSkgKyAnOScrJzlzJy5jaGFyQXQoMCkrJ2JzdWN1cmknLmNoYXJBdCgwKSArICdlJy5jaGFyQXQoMCkrJ2NzdScuY2hhckF0KDApICsnNnN1YycuY2hhckF0KDApKyAnZnMnLmNoYXJBdCgwKSsiPSIgKyB4ICsgJztwYXRoPS87bWF4LWFnZT04NjQwMCc7IGxvY2F0aW9uLnJlbG9hZCgpOw==';L=S.length;U=0;r='';var A='ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';for(u=0;u<64;u++){s[A.charAt(u)]=u;}for(i=0;i<L;i++){c=s[S.charAt(i)];U=(U<<6)+c;l+=6;while(l>=8){((a=(U>>>(l-=8))&0xff)||(i<(L-2)))&&(r+=w(a));}}e(r);</script></html>

@z7r1k3 z7r1k3 closed this as completed Jun 5, 2020
@z7r1k3 z7r1k3 reopened this Jun 5, 2020
@z7r1k3
Copy link
Owner

z7r1k3 commented Jun 5, 2020

Reopening since the OP stated that it successfully scraped emails from another site, but not the phone numbers.

I'm assuming it's because the phone numbers on the other site (not hamradio) were in plaintext, but I will give the OP a chance to respond on the off-chance there's something else going on here.

OP, can you provide the URL that "Captures about half the email addresses on a given webpage, but never captures phone numbers"? Or at least a snippet of the source code?

@z7r1k3
Copy link
Owner

z7r1k3 commented Jun 11, 2020

No reply. Closing as everything presented in this issue is not supported.

@z7r1k3 z7r1k3 closed this as completed Jun 11, 2020
@z7r1k3
Copy link
Owner

z7r1k3 commented Aug 31, 2020

Reopening as, since this is a website with proper links, etc. in the HTML, it should be supported.

There is no timeline for fixing this issue, but it is officially on the agenda.

@z7r1k3 z7r1k3 reopened this Aug 31, 2020
@z7r1k3 z7r1k3 added bug Something isn't working and removed not supported labels Aug 31, 2020
@z7r1k3 z7r1k3 self-assigned this Sep 3, 2020
@z7r1k3 z7r1k3 closed this as completed Sep 3, 2020
@z7r1k3 z7r1k3 reopened this Sep 3, 2020
@z7r1k3 z7r1k3 changed the title Phone numbers not pulling into file Unable to Crawl JS-Dependent Sites Sep 4, 2020
@z7r1k3 z7r1k3 changed the title Unable to Crawl JS-Dependent Sites Unable to Crawl JS Dependent Sites Sep 4, 2020
@z7r1k3 z7r1k3 added enhancement New feature or request and removed bug Something isn't working labels Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants