update readme

adbar · Nov 27, 2023 · d7cef52 · d7cef52
1 parent ec2bedb
commit d7cef52
Show file tree

Hide file tree

Showing 3 changed files with 34 additions and 14 deletions.
diff --git a/HISTORY.md b/HISTORY.md
@@ -3,7 +3,8 @@
 
 ### 0.9.5
 
-- normalization: encode unicode chars, strip common trackers (#58, #60, #65)
+- IRI to URI normalization: encode path, query and fragments (#58, #60)
+- normalization: strip common trackers (#65)
 - new function `is_valid_url()` (#63)
 - hardening of domain filter (#64)
 

diff --git a/README.rst b/README.rst
@@ -27,9 +27,9 @@ Why coURLan?
     “Given that the bandwidth for conducting crawls is neither infinite nor free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained.” (Edwards et al. 2001)
 
 
-This library provides an additional “brain” for web crawling, scraping and management of web archives:
+This library provides an additional “brain” for web crawling, scraping and management of web pages:
 
-- Avoid loosing bandwidth capacity and processing time for webpages which are probably not worth the effort.
+- Avoid loosing bandwidth capacity and processing time for pages which are probably not worth the effort.
 - Stay away from pages with little text content or explicitly target synoptic pages to gather links.
 
 Using content and language-focused filters, Courlan helps navigating the Web so as to improve the resulting document collections. Additional functions include straightforward domain name extraction and URL sampling.
@@ -40,14 +40,14 @@ Features
 
 Separate `the wheat from the chaff <https://en.wiktionary.org/wiki/separate_the_wheat_from_the_chaff>`_ and optimize crawls by focusing on non-spam HTML pages containing primarily text.
 
-- Heuristics for triage of links
-   - Targeting spam and unsuitable content-types
-   - Language-aware filtering
-   - Crawl management
 - URL handling
    - Validation
-   - Canonicalization/Normalization
+   - Normalization
    - Sampling
+- Heuristics for link filtering
+   - Spam, trackers and unsuitable content-types
+   - Language/Locale-aware processing
+   - Crawl management (e.g. frontier, scheduling)
 - Usable with Python or on the command-line
 
 
@@ -91,29 +91,43 @@ All useful operations chained in ``check_url(url)``:
 .. code-block:: python
 
     >>> from courlan import check_url
-    # returns url and domain name
+
+    # return url and domain name
     >>> check_url('https://github.com/adbar/courlan')
     ('https://github.com/adbar/courlan', 'github.com')
-    # noisy query parameters can be removed
-    my_url = 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org'
+
+    # filter out bogus domains
+    >>> check_url('http://666.0.0.1/')
+    >>>
+
+    # tracker removal
+    >>> check_url('http://test.net/foo.html?utm_source=twitter#gclid=123')
+    ('http://test.net/foo.html', 'test.net')
+
+    # use strict for additional noisy query parameters
+    >>> my_url = 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org'
     >>> check_url(my_url, strict=True)
     ('https://httpbin.org/redirect-to', 'httpbin.org')
-    # Check for redirects (HEAD request)
+
+    # check for redirects (HEAD request)
     >>> url, domain_name = check_url(my_url, with_redirects=True)
 
 
 Language-aware heuristics, notably internationalization in URLs, are available in ``lang_filter(url, language)``:
 
 .. code-block:: python
 
-    # optional argument targeting webpages in English or German
+    # optional language argument
     >>> url = 'https://www.un.org/en/about-us'
+
     # success: returns clean URL and domain name
     >>> check_url(url, language='en')
     ('https://www.un.org/en/about-us', 'un.org')
+
     # failure: doesn't return anything
     >>> check_url(url, language='de')
     >>>
+
     # optional argument: strict
     >>> url = 'https://en.wikipedia.org/'
     >>> check_url(url, language='de', strict=False)
@@ -390,7 +404,7 @@ Software ecosystem: see `this graphic <https://github.com/adbar/trafilatura/blob
 Similar work
 ------------
 
-These Python libraries perform similar normalization tasks but do not entail language or content filters. They also do not focus on crawl optimization:
+These Python libraries perform similar handling and normalization tasks but do not entail language or content filters. They also do not primarily focus on crawl optimization:
 
 - `furl <https://github.com/gruns/furl>`_
 - `ural <https://github.com/medialab/ural>`_

diff --git a/tests/unit_tests.py b/tests/unit_tests.py
@@ -1135,6 +1135,11 @@ def test_examples():
         "https://github.com/adbar/courlan",
         "github.com",
     )
+    assert check_url("http://666.0.0.1/") is None
+    assert check_url("http://test.net/foo.html?utm_source=twitter#gclid=123") == (
+        "http://test.net/foo.html",
+        "test.net",
+    )
     assert check_url(
         "https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org", strict=True
     ) == ("https://httpbin.org/redirect-to", "httpbin.org")