Atra - The smaller way to crawl

_________________
        |
|       |       |
|  |    |    |  |
 \ |  /°°°\  | /
  \| /  A  \ |/
   \ \  T  / /
   /\/  R  \/\
  / /\  A  /\ \
 / /  Oo_oO  \ \
| |   ´` ´`   | |
|               |

       ???
/TTTTTTTTTTTTTTT\

Atra is a novel web crawling solution, implemented in Rust, designed with the primary goal of scraping websites as comprehensively as possible, while ensuring ease of use and accessibility.

Your first crawl

Download the precompiled executable and run the following command:

Windows: ./atra.exe single -s test_crawl -d 2 --absolute https://choosealicense.com
Linux: ./atra single -s test_crawl -d 2 --absolute https://choosealicense.com You will then find a folder atra_data on the level of the binary.

Crawling more

Create a file with the name seeds.txt
- Add a single url per line
- Put it in the directory with the atra binary
Call ./atra.exe --generate-example-config or ./atra --generate-example-config
- Modify the values to meet your needs
- rename them to atra.ini and crawl.yaml
Call ./atra.exe multi --log-to-file file:seeds.txt or ./atra multi --log-to-file file:seeds.txt

How to build?

In order to build Atra you need Rust.

Git Submodules

Initialize the submodules:

git submodule update --init --recursive

Update the submodules:

git submodule update --remote --recursive

Windows

After installing Rust you need LLVM and MSVC, with the proper environment paths set.

Linux

working as long as the pdf feature is not activated

After installing rust you need pkg-config, libssl-dev, clang, and llvm in order to compile Atra. You can also use the docker container to build a binary.

Due to the dynamic linking you will need libc6, openssl, and ca-certificates installed on you system.

General Informations

Core Principles

Atra is designed with a clear set of guiding principles to ensure simplicity, efficiency, and longevity:

Minimal Dependencies: Atra focuses on a streamlined architecture, ensuring only necessary components are included.
Sustainability and Durability: Atra is built for long-term reliability, prioritizing robustness and stability.
Unified Configuration: All settings are managed within a single configuration file, simplifying setup and operation.

You like Atra?

If you appreciate Atra, please consider supporting the Zeehondencentrum Pietersburen. Even a small donation of 10€ can provide food for a seal for an entire day.

The Zeehondencentrum Pietersburen is the largest seal hospital in Europe. Since the 1970s, they have been dedicated to the care of sick, injured, or weakened seals. After recovery, the seals are released back into the wild, with no animal remaining in permanent care. The center operates solely on donations, gifts, and entrance fees.

Why is the crawler named Atra?

Atra takes its name from Erigone atra, a species of dwarf spider measuring 1.8mm to 2.8mm in length. These spiders play an essential role in natural pest control, particularly in agriculture, and are known for their unique ability to travel long distances by "ballooning," or kiting.

For more information, visit Wikipedia.

Exits Codes

The exit codes returned by Atra when some kind or error happens:

Code	Meaning
0	Success
1	Unknown Error
2	Some kind of file was not found or was not able to interact with the file system.
3	The config was faulty in some way.
4	Was not able to deserialize the config.json
5	The directory already exists.
10	Atra was not able to initialize the context by some unknown error.
11	Atra was not able to initialize the context due to some IO problem.
12	Atra was not able to open the database.
13	Atra had an RocksDB error, most probably a problem with the layout of the database.
14	Atra was not able to work with the queue file.
15	Atra failed to read the blacklist.
16	Atra had an error with initializing the SVM
17	Atra had an error while initializing the webgraph
18	Atra failed to serialize/deserialize some kind of data.
40	Atra failed to initialize a worker context
50	Atra failed to fill the queue
70	Atra failed serialize some data while dumping
100	The crawl failed in some unexpected way.
101	Failed to store some crawl informations in the database.
102	Failed to handle some link informations.
103	Failed to deserialize the link state for a url
104	Failed to update/read the link state from the database
105	Failed to write a crawl result to the database and/or filesystem.
106	Failed to read/write from/to the queue.
107	The client failed for some reason.
108	Failed to execute a request.
109	Failed to interact with the file system.

Config

Atra is configured by using json-configs and environment variables. The following table shows the configs written as qualified paths for a json.

Path	format	Explanation
system	JSON	Contains all configs regarding the system. Usually caches.
system.robots_cache_size	uInt /wo 0; Element Count	The cache size of the robots manager. (default: 32)
system.web_graph_cache_size	uInt /wo 0; Element Count	The cache size of the webgraph manager (default: 20.000)
system.max_file_size_in_memory	uLong; in Byte	Max size of the files stored in memory. (default: 100MB). If set to 0 nothing will be stored in memory.
system.max_temp_file_size_on_disc	uLong; in Byte	Max size of a temp file on the disc. (default: 16384 Pebibyte). If set to 0 nothing will be stored on the disc.
system.log_level	String; Enum (see Log Level)	The log level of the crawler. (default: Info)
system.log_to_file	boolean	Log to a file and not to console. (default: false)
paths	JSON	Contains all configs regarding the paths.
paths.root	String; Path	The root path where the application runs. (default: "./atra_data")
paths.directories	JSON
paths.directories.database	String; Path	Path to the database directory. (default: root/rocksdb)
paths.directories.big_files	String; Path	Path to the big files directory. (default: root/big_files)
paths.files	JSON
paths.files.queue	String; Path	Path to the queue file (if one is needed) (default: root/queue.tmp)
paths.files.blacklist	String; Path	Path to the blacklist (default: root/blacklist.txt)
paths.files.web_graph	String; Path	Path to the web graph generated by atra (default: root/web_graph.ttl)
session	JSON	The config of the session
session.service	String	The name of the service (default: "atra")
session.collection	String	The name of the collection created (default: "unnamed")
session.crawl_job_id	uInt	The crawl job id. To differentiate for the same service and collection.
session.warc_compression_level	uInt	- unused -
crawl	JSON
crawl.user_agent	String; Enum (see User Agents)	The user agent used by the crawler. (default: Default)
crawl.respect_robots_txt	boolean	Respect robots.txt file and not scrape not allowed files. This may slow down crawls if robots.txt file has a delay included. (default: true)
crawl.generate_web_graph	boolean	If set Atra generates the webgraph. This can impact the overall performance of the crawl. (default: true)
crawl.respect_nofollow	boolean	Respect the nofollow attribute during the link extraction (default: true)
crawl.crawl_embedded_data	boolean	Extract links to embedded data like audio/video files for the crawl-queue (default: false)
crawl.crawl_forms	boolean	Extract links from form action. (default: false)
crawl.crawl_javascript	boolean	Extract links to/from javascript files for the crawl-queue (default: true)
crawl.crawl_onclick_by_heuristic	boolean	Try to extract links from tags with onclick attribute for the crawl-queue (default: false)
crawl.apply_gdbr_filter_if_possible	boolean	Tries to apply an gdbr filter, if one was properly configured.
crawl.store_only_html_in_warc	boolean	Only store html-files in the warc
crawl.store_big_file_hints_in_war	boolean	Store the big file hints also in the warc
crawl.max_file_size	uInt/null; in Byte	The maximum size to download. If null there is no limit. (default: null)
crawl.max_robots_age	String/null; "`[whole_seconds].[whole_nanoseconds]`"	The maximum age of a cached robots.txt. If null, it never gets too old.
crawl.ignore_sitemap	boolean	Prevent including the sitemap links with the crawl. (default: false)
crawl.subdomains	boolean	Allow sub-domains. (default: false)
crawl.cache	boolean	Cache the page following HTTP caching rules. (default: false)
crawl.use_cookies	boolean	Use cookies (default: false)
crawl.cookies	JSON/null; (see Cookie Settings)	Domain bound cookie config. (default: null)
crawl.headers	JSON/null; `{"- header_name -": "- header_value -"}`	Headers to include with requests. (default: null)
crawl.proxies	List; `["- proxy -", "- proxy -"]`	Use proxy list for performing network request. (default: null)
crawl.tld	boolean	Allow all tlds for domain. (default: false)
crawl.delay	String; "`[whole_seconds].[whole_nanoseconds]`"	Polite crawling delay (default: 1 second)
crawl.budget	JSON/null; (see Crawl Budget)	The budget settings for this crawl.
crawl.max_queue_age	uInt	How often can we fail to crawl an entry in the queue until it is dropped? (0 means never drop) (default: 20)
crawl.redirect_limit	uInt	The max redirections allowed for request. (default: 5 like Google-Bot)
crawl.redirect_policy	String; Enum (see Redirection Policy)	The redirect policy type to use. (default: Loose)
crawl.accept_invalid_certs	boolean	Dangerously accept invalid certficates (default: false)
crawl.link_extractors	JSON; `[- Command -, - Command -]` (see Link Extractor Settings)	A custom configuration of extractors. (default: see Extractor Settings)
crawl.decode_big_files_up_to	uInt/null; in Byte	If this value is set Atra tries to decode and process files that are only downloaded as blob but do not overstep this provided size (in Bytes). Null means off. (default: null)
crawl.use_default_stopwords	boolean	If this is set all stopwords inside the default stopwords known to atra are used (default: true)
crawl.stopword_registry	JSON/null; (see Stopword Registry)	Used to configure the global registry for stopwords.
crawl.gbdr	JSON/null; (see GDBR Filter)	Used to configure the SVM for filtering GBRS. The model used is the L2R_L2LOSS_SVR.

Log Level

Level	Explanation
Off	Logging off
Error	Log only errors
Warn	Log errors and warnings
Info	Log errors, warnings and infos
Debug	Log errors, warnings, infos and debug informations
Trace	Log everything.

User Agents

Name	Value	Explanation
Spoof	"Spoof"	Uses a static agent that looks like a browser.
Default	"Default"	Uses "Crawler/Atra/-version-"
Custom	`{ "Custom": "-content-" }`	Uses a custom user agent

Cookie Settings

Sub-Path	Value	Explanation
default	String	Cookie string to use for network requests e.g.: "foo=bar; Domain=blog.spider"
per_domain	JSON/null; `{"- domain -": "- cookie string -"}`	A map between domains and cookie string. A domain is e.g.: "ebay.de" or an IP address

Exemplary entry in a JSON:

{
   "cookies": {
      "default": "foo=bar; Domain=blog.spider",
      "per_host": {
         "ebay.de": "foo=cat; Domain=blog.spider2"
      }
   }
}

Crawl Budget

Sub-Path	Value	Explanation
default	JSON;BudgetSetting; see Budget Setting	The default budget setting for a crawl.
per_host	JSON/null; `{"- domain/host -": - BudgetSetting - }`	A map between domains and budget settings. A domain is e.g.: "ebay.de" or an IP address

Exemplary entry in a JSON:

{
   "budget": {
      "default": {
         "depth_on_website": 2,
         "depth": 1,
         "recrawl_interval": null,
         "request_timeout": null
      },
      "per_host": {
         "ebay.de": {
            "depth": 3,
            "recrawl_interval": "360.000000000",
            "request_timeout": null
         }
      }
   }
}

Budget Setting

Budget settings exists in 3 different kinds:

SinglePage: Only crawls the provided seed. Is used when both depths are null.
SeedOnly: Only crawls the seed domains
Normal: Crawls the seed and follows external links
Absolute: Crawls the seed and follows external links, but only follows until a specific amout of jumps is reached.

The kind is decided by the presence of the fields. As described in the following table.

Sub-Path	used in	Value	Explanation
depth_on_website	SeedOnly, Normal	uInt/null	The max depth to crawl on a website. (default: null)
depth	Normal, Absolute	uInt/null	The maximum depth of websites, outgoing from the seed. (default: null)
recrawl_interval	SeedOnly, Normal, Absolute, SinglePage	String/null; "`[whole_seconds].[whole_nanoseconds]`"	Crawl interval (if set to null crawl only once) (default: null)
request_timeout	SeedOnly, Normal, Absolute, SinglePage	String/null; "`[whole_seconds].[whole_nanoseconds]`"	Request max timeout per page. Set to null to disable. (default: 15.000000000)

Redirection Policy

Name	Value	Explanation
Loose	"Loose"	A loose policy that allows all request up to the redirect limit.
Strict	"Strict"	A strict policy only allowing request that match the domain set for crawling.

Link Extractor Settings

The extractor settings are a list of commands.

Sub-Path	Value	Explanation
extractor_method	String; Enum (see Apply When)	The method used to extract something.
apply_when	String; Enum (see Apply When)	The maximum depth of websites, outgoing from the seed.

Example:

{
   "link_extractor": [
      {
         "extractor_method": "HtmlV1",
         "apply_when": "IfSuitable"
      },
      {
         "extractor_method": "JSV1",
         "apply_when": "IfSuitable"
      },
      {
         "extractor_method": "PlainText",
         "apply_when": "IfSuitable"
      },
      {
         "extractor_method": "RawV1",
         "apply_when": "IfSuitable"
      }
   ]
}

Link Extractor Method

Name	Value	Explanation
HtmlV1	"HtmlV1"/"HTML_v1"	Extracts links from an HTML. Can respect NO_FOLLOW and is capable of resolving must of the common references of HTML.
JSV1	"JSV1"/"js_v1"/"JavaScript_v1"/"JS_v1"	Extracts links from JavaScript by searching for href identifiers.
PlainText	"PlainText"/"PlainText_v1"/"PT_v1"/"Plain_v1"	Extracts links from a plaintext by using linkify. link
RawV1	"RawV1"/"RAW_v1"	Tries to extract links from raw bytes by using a modified linkify version for raw data. Relatively robust. Can theoretically process anything that can be decoded by atra.
Rtf	"rtf_v1"	Extracts links from an RTF.
Ooxml	"ooxml"	Extracts links from Office Open XMLs.
Odf	"odf"	Extracts links from Open Document Formats.
Exif	"image"	Extracts links from any kind of image, as long as there exists EXIF data.
Xml	"xml"	Extracts links from an XML.
Svg	"svg"	Extracts links from an SVG.
Xlink	"xlink"	Extracts links from an XML with XLINK.
PDF	"pdf_v1"	Extracts links from an HTML. (Currently deactivated due to a compiler bug.)

Apply When

Decides when to apply a link-extractor on some kind of data.

Name	Value	Explanation
Always	"Always"	Always applies this extractor.
IfSuitable	"IfSuitable"	Only applies the extractor iff the file is of a fitting type
Fallback	"Fallback"	If everything fails, try this extractor.

Stopword Registry

Consists of a list of stopword repository configurations, can be one of the following:

Sub-Path	Value	Explanation
with_iso_default	bool	If this is set, the registry uses the default iso stopwords in addition to the provided stopword. (default: false) If neither a directory nor a file with language is set, it activates the default stopwords as independent stopwords.
dir/directory	Path/Null	Defines a folder containing txt-files named after like .txt. The files contains in each line a specific stopword.
file	Path/Null	Points to a txt file. The files contains in each line a specific stopword. The language id defined by the required "language" field in the config.
language	String; Isolang Name	A language hint for file based stopword lists.

GBDR Filter

Sub-Path	Value	Explanation
default	JSON/Null; IdentifierConfig; see GBDR Identifier Config	The default gdbr filter for the crawler.
by_language	JSON/null; `{"- Iso Lang -": - IdentifierConfig - }`	A map between languages and language specific gdbr identifier.

GBDR Identifier Config

Sub-Path	Value	Explanation
threshold	f64	Sets the minimum score needed for a sucessfull SVM prediction. (default: 0.1)
filter_threshold	f64	Sets the score needed for a high confident SVM prediction. (default: 0.5)
filter_by	String; FilterMode; see GBDR FilterMode	Configures what score is used to identify the correct node in the html.
svm	JSON; SvmConfig; see SVM Config	Configures the SVM

GBDR FilterMode

Name	Value	Explanation
OnScore	"OnScore"	Identify the target node by the score of the current node.
OnMaxScore	"OnMaxScore"	Identify the target node by the maximum score in one of the children.
OnAverageScore	"OnAverageScore"	Calculate the average between the score and max score in the child. The result will be used.

SVM Config

The SVM can be configured in three ways:

Load: Tries to load a pretrained model
Train: Tries to train a model
All: Tries to load a model, if it fails it tries to train one.

Sub-Path	used in	Value	Explanation
language	Load, Train, All	String; Iso Language	Sets the language for the svm
retrain_if_possible	All	bool/Null	Retrains even if there is something to load.
tf	Train, All	String/Null;TF; see TF	The TF used for the vectorisation.
idf	Train, All	String/Null;IDF; see IDF	The IDF used for the vectorisation.
tf_idf_data	Train, All	Path/Null; TF-IDF-Data; see SVM Data Formats	A path to some train data for the tf-idf-vectorizer.
train_data	Train, All	Path/Null; TRAIN-DATA; see SVM Data Formats	A path to some train data for the svm.
test_data	Load, Train, All	Path/Null; -undefined-	- unused -
trained_svm	Load, All	Path/Null	Path to a stored svm. Will be in the atra root, if a relative path is provided.
normalize_tokens	Train, All	bool/null	Normalizes the tokens according to Unicode Standard Annex #15
filter_stopwords	Train, All	bool/null	Filters the tokens by a stopword filter.
stemmer	Train, All	String; StemmerName; see Stemmer Names	Stemms the tokens with a provided snowball stemmer.
parameters	Train, All	JSON; SvmParameters; see SVM Parameters	The parameters used to configue the Liblinear SVM.
min_doc_length	Load, Train, All	uInt/null	The minimum length of a document needed to be used for training/production.
min_vector_length	Load, Train, All	uInt/null	The minimum length of a vector needed to be used for training/production.

SVM Parameters

Sub-Path	used in	Value	Explanation
epsilon	L2R_L2LOSS_SVR	f64/null	Set tolerance of termination criterion for optimization (parameter `e`). (default: 0.01)
cost	L2R_L2LOSS_SVR	f64/null	Set cost of constraints violation (parameter `C`). Rules the trade-off between regularization and correct classification on data. It can be seen as the inverse of a regularization constant. (default: 1.0)
p	L2R_L2LOSS_SVR	f64/null	Set the tolerance margin/loss sensitivity of support vector regression (parameter `p`). (default: 0.1)
nu		f64/null	Set the fraction of data that is to be classified as outliers (parameter `nu`). (default: 0.5)
cost_penalty	L2R_L2LOSS_SVR	Array<[Int, f64]>/null; `json {"cost_penalty": [[1, 0.5], [2, 3.4]]}`	Set weights to adjust the cost of constraints violation for specific classes. Each element is a tuple where the first value is the label and the second its corresponding weight penalty. Useful when training classifiers on unbalanced input data or with asymmetric mis-classification cost. (default: null)
initial_solutions	L2R_L2LOSS_SVR	Array/null	Set the initial solution specification. (default: null)
bias	L2R_L2LOSS_SVR	f64/null	Set the bias of the training data. If `bias >= 0`, it's appended to the feature vector of each training data instance. (default: -1.0)
regularize_bias	L2R_L2LOSS_SVR	bool	Toggle bias regularization during training. f set to `false`, the bias value will automatically be set to `1`. (default: true)

TF

See https://en.wikipedia.org/wiki/Tf%E2%80%93idf for the explanations:

Name	Value	Explanation
Binary	"Binary"	${0,1}$
RawCount	"RawCount"	$f_{t,d}$
TermFrequency	"TermFrequency"	$f_{t,d} \Bigg/ {\sum_{t' \in d}{f_{t',d}}}$
LogNormalization	"LogNormalization"	$\log (1 + f_{t,d})$
DoubleNormalization	"DoubleNormalization"	$0.5 + 0.5 \cdot \frac { f_{t,d} }{\max_{\{t' \in d\}} {f_{t',d}}}$

IDF

See https://en.wikipedia.org/wiki/Tf%E2%80%93idf for the explanations:

Name	Value	Explanation
Unary	"Unary"	$1$
InverseDocumentFrequency	"InverseDocumentFrequency"	$\log \frac {N} {n_t} = - \log \frac {n_t} {N}$
InverseDocumentFrequencySmooth	"InverseDocumentFrequencySmooth"	$\log \left( \frac {N} {1 + n_t}\right)+ 1$
InverseDocumentFrequencyMax	"InverseDocumentFrequencyMax"	$\log \left(\frac {\max_{\{t' \in d\}} n_{t'}} {1 + n_t}\right)$
ProbabilisticInverseDocumentFrequency	"ProbabilisticInverseDocumentFrequency"	$\log \frac {N - n_t} {n_t}$

German GDBR SVM Config

This parameters proved very robust for german gdbr recognition.

{
   "language": "Deu",
   "tf": "TermFrequency",
   "idf": "InverseDocumentFrequency",
   "normalize_tokens": true,
   "filter_stopwords": true,
   "stemmer": "German",
   "parameters": {
      "epsilon": 0.0003,
      "p": 0.1,
      "cost": 10.0
   },
   "min_doc_length": 5,
   "min_vector_length": 5
}

SVM Data Formats

TF-IDF-Data

A plain textfile with line wise traindata.

<document 1>
<document 2>
<document 3>

TRAIN-Data

A csv textfile with line wise traindata. The first value is a boolean indication if it is in the class "has_gdbr" or not. The second is an escaped document.

is_gdbr,text
true, "<document 1>"
false, "<document 2>"
true, "<document 3>"

Stemmer Names

Stemmers are available in the following languages, the name for the parameters are the same:

Arabic
Danish
Dutch
English
Finnish
French
German
Greek
Hungarian
Italian
Norwegian
Portuguese
Romanian
Russian
Spanish
Swedish
Tamil
Turkish

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
.cargo		.cargo
atra		atra
documents_and_references		documents_and_references
external		external
iso_stopwords		iso_stopwords
svm		svm
text_processing		text_processing
warc		warc
.gitignore		.gitignore
.gitmodules		.gitmodules
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING		CONTRIBUTING
CONTRIBUTORS		CONTRIBUTORS
COPYRIGHT		COPYRIGHT
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
NOTICE		NOTICE
README.md		README.md
ideas		ideas
logo.txt		logo.txt
rustfmt.toml		rustfmt.toml
todos		todos

License

Licenses found

FelixEngl/atra

Folders and files

Latest commit

History

Repository files navigation