diff --git a/HISTORY.md b/HISTORY.md index f2bb66a..5df7b83 100644 --- a/HISTORY.md +++ b/HISTORY.md @@ -3,7 +3,8 @@ ### 0.9.5 -- normalization: encode unicode chars, strip common trackers (#58, #60, #65) +- IRI to URI normalization: encode path, query and fragments (#58, #60) +- normalization: strip common trackers (#65) - new function `is_valid_url()` (#63) - hardening of domain filter (#64) diff --git a/README.rst b/README.rst index 316ac81..2a6eaa4 100644 --- a/README.rst +++ b/README.rst @@ -27,9 +27,9 @@ Why coURLan? “Given that the bandwidth for conducting crawls is neither infinite nor free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained.” (Edwards et al. 2001) -This library provides an additional “brain” for web crawling, scraping and management of web archives: +This library provides an additional “brain” for web crawling, scraping and management of web pages: -- Avoid loosing bandwidth capacity and processing time for webpages which are probably not worth the effort. +- Avoid loosing bandwidth capacity and processing time for pages which are probably not worth the effort. - Stay away from pages with little text content or explicitly target synoptic pages to gather links. Using content and language-focused filters, Courlan helps navigating the Web so as to improve the resulting document collections. Additional functions include straightforward domain name extraction and URL sampling. @@ -40,14 +40,14 @@ Features Separate `the wheat from the chaff `_ and optimize crawls by focusing on non-spam HTML pages containing primarily text. -- Heuristics for triage of links - - Targeting spam and unsuitable content-types - - Language-aware filtering - - Crawl management - URL handling - Validation - - Canonicalization/Normalization + - Normalization - Sampling +- Heuristics for link filtering + - Spam, trackers and unsuitable content-types + - Language/Locale-aware processing + - Crawl management (e.g. frontier, scheduling) - Usable with Python or on the command-line @@ -91,14 +91,25 @@ All useful operations chained in ``check_url(url)``: .. code-block:: python >>> from courlan import check_url - # returns url and domain name + + # return url and domain name >>> check_url('https://github.com/adbar/courlan') ('https://github.com/adbar/courlan', 'github.com') - # noisy query parameters can be removed - my_url = 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org' + + # filter out bogus domains + >>> check_url('http://666.0.0.1/') + >>> + + # tracker removal + >>> check_url('http://test.net/foo.html?utm_source=twitter#gclid=123') + ('http://test.net/foo.html', 'test.net') + + # use strict for additional noisy query parameters + >>> my_url = 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org' >>> check_url(my_url, strict=True) ('https://httpbin.org/redirect-to', 'httpbin.org') - # Check for redirects (HEAD request) + + # check for redirects (HEAD request) >>> url, domain_name = check_url(my_url, with_redirects=True) @@ -106,14 +117,17 @@ Language-aware heuristics, notably internationalization in URLs, are available i .. code-block:: python - # optional argument targeting webpages in English or German + # optional language argument >>> url = 'https://www.un.org/en/about-us' + # success: returns clean URL and domain name >>> check_url(url, language='en') ('https://www.un.org/en/about-us', 'un.org') + # failure: doesn't return anything >>> check_url(url, language='de') >>> + # optional argument: strict >>> url = 'https://en.wikipedia.org/' >>> check_url(url, language='de', strict=False) @@ -390,7 +404,7 @@ Software ecosystem: see `this graphic `_ - `ural `_ diff --git a/tests/unit_tests.py b/tests/unit_tests.py index 3c87dde..1053973 100644 --- a/tests/unit_tests.py +++ b/tests/unit_tests.py @@ -1135,6 +1135,11 @@ def test_examples(): "https://github.com/adbar/courlan", "github.com", ) + assert check_url("http://666.0.0.1/") is None + assert check_url("http://test.net/foo.html?utm_source=twitter#gclid=123") == ( + "http://test.net/foo.html", + "test.net", + ) assert check_url( "https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org", strict=True ) == ("https://httpbin.org/redirect-to", "httpbin.org")