Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
adbar committed Nov 27, 2023
1 parent ec2bedb commit d7cef52
Show file tree
Hide file tree
Showing 3 changed files with 34 additions and 14 deletions.
3 changes: 2 additions & 1 deletion HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@

### 0.9.5

- normalization: encode unicode chars, strip common trackers (#58, #60, #65)
- IRI to URI normalization: encode path, query and fragments (#58, #60)
- normalization: strip common trackers (#65)
- new function `is_valid_url()` (#63)
- hardening of domain filter (#64)

Expand Down
40 changes: 27 additions & 13 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,9 @@ Why coURLan?
“Given that the bandwidth for conducting crawls is neither infinite nor free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained.” (Edwards et al. 2001)


This library provides an additional “brain” for web crawling, scraping and management of web archives:
This library provides an additional “brain” for web crawling, scraping and management of web pages:

- Avoid loosing bandwidth capacity and processing time for webpages which are probably not worth the effort.
- Avoid loosing bandwidth capacity and processing time for pages which are probably not worth the effort.
- Stay away from pages with little text content or explicitly target synoptic pages to gather links.

Using content and language-focused filters, Courlan helps navigating the Web so as to improve the resulting document collections. Additional functions include straightforward domain name extraction and URL sampling.
Expand All @@ -40,14 +40,14 @@ Features

Separate `the wheat from the chaff <https://en.wiktionary.org/wiki/separate_the_wheat_from_the_chaff>`_ and optimize crawls by focusing on non-spam HTML pages containing primarily text.

- Heuristics for triage of links
- Targeting spam and unsuitable content-types
- Language-aware filtering
- Crawl management
- URL handling
- Validation
- Canonicalization/Normalization
- Normalization
- Sampling
- Heuristics for link filtering
- Spam, trackers and unsuitable content-types
- Language/Locale-aware processing
- Crawl management (e.g. frontier, scheduling)
- Usable with Python or on the command-line


Expand Down Expand Up @@ -91,29 +91,43 @@ All useful operations chained in ``check_url(url)``:
.. code-block:: python
>>> from courlan import check_url
# returns url and domain name
# return url and domain name
>>> check_url('https://github.com/adbar/courlan')
('https://github.com/adbar/courlan', 'github.com')
# noisy query parameters can be removed
my_url = 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org'
# filter out bogus domains
>>> check_url('http://666.0.0.1/')
>>>
# tracker removal
>>> check_url('http://test.net/foo.html?utm_source=twitter#gclid=123')
('http://test.net/foo.html', 'test.net')
# use strict for additional noisy query parameters
>>> my_url = 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org'
>>> check_url(my_url, strict=True)
('https://httpbin.org/redirect-to', 'httpbin.org')
# Check for redirects (HEAD request)
# check for redirects (HEAD request)
>>> url, domain_name = check_url(my_url, with_redirects=True)
Language-aware heuristics, notably internationalization in URLs, are available in ``lang_filter(url, language)``:

.. code-block:: python
# optional argument targeting webpages in English or German
# optional language argument
>>> url = 'https://www.un.org/en/about-us'
# success: returns clean URL and domain name
>>> check_url(url, language='en')
('https://www.un.org/en/about-us', 'un.org')
# failure: doesn't return anything
>>> check_url(url, language='de')
>>>
# optional argument: strict
>>> url = 'https://en.wikipedia.org/'
>>> check_url(url, language='de', strict=False)
Expand Down Expand Up @@ -390,7 +404,7 @@ Software ecosystem: see `this graphic <https://github.com/adbar/trafilatura/blob
Similar work
------------

These Python libraries perform similar normalization tasks but do not entail language or content filters. They also do not focus on crawl optimization:
These Python libraries perform similar handling and normalization tasks but do not entail language or content filters. They also do not primarily focus on crawl optimization:

- `furl <https://github.com/gruns/furl>`_
- `ural <https://github.com/medialab/ural>`_
Expand Down
5 changes: 5 additions & 0 deletions tests/unit_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -1135,6 +1135,11 @@ def test_examples():
"https://github.com/adbar/courlan",
"github.com",
)
assert check_url("http://666.0.0.1/") is None
assert check_url("http://test.net/foo.html?utm_source=twitter#gclid=123") == (
"http://test.net/foo.html",
"test.net",
)
assert check_url(
"https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org", strict=True
) == ("https://httpbin.org/redirect-to", "httpbin.org")
Expand Down

0 comments on commit d7cef52

Please sign in to comment.