-
Notifications
You must be signed in to change notification settings - Fork 2
How To Crawl Locally
Michael Telford edited this page Jul 30, 2020
·
2 revisions
Crawling on your local machine using Wgit is possible, but with a caveat. Consider the following:
require 'wgit'
url = Wgit::Url.new 'http://localhost:3000'
url.valid? # => false
url = Wgit::Url.new 'http://127.0.0.1:3000'
url.valid? # => true
Only URL's that are #valid?
are crawl-able. As you can see from above, localhost
isn't regarded as a valid URL by Wgit whereas 127.0.0.1
is - because it's an IP address. Obviously, the two URL's are equivalent in that they both reference your local machine.
So don't use localhost
; use 127.0.0.1
for crawling content locally.
So why isn't localhost
valid? Because technically... it isn't. You can't curl http://example
successfully, but you can curl http://example.com
- because the latter is a valid host. Wgit applies the same principle to localhost
- it doesn't get special treatment.
- How To Crawl A Website
- How To Crawl Locally
- How To Crawl More Than Just HTML
- How To Derive Crawl Statistics
- How To Extract Content
- How To Handle Redirects
- How To Index
- How To Multi-Thread
- How To Parse A URL
- How To Parse Javascript
- How To Prevent Indexing
- How To Use A Database
- How To Use Last Response
- How To Use The DSL
- How To Use The Executable
- How To Use The Logger
- How To Write A Database Adapter