_________________
|
| | |
| | | | |
\ | /°°°\ | /
\| / A \ |/
\ \ T / /
/\/ R \/\
/ /\ A /\ \
/ / Oo_oO \ \
| | ´` ´` | |
| |
???
/TTTTTTTTTTTTTTT\
Atra is a novel web crawling solution, implemented in Rust, designed with the primary goal of scraping websites as comprehensively as possible, while ensuring ease of use and accessibility.
Download the precompiled executable and run the following command:
- Windows:
./atra.exe single -s test_crawl -d 2 --absolute https://choosealicense.com
- Linux:
./atra single -s test_crawl -d 2 --absolute https://choosealicense.com
You will then find a folderatra_data
on the level of the binary.
- Create a file with the name seeds.txt
- Add a single url per line
- Put it in the directory with the atra binary
- Call
./atra.exe --generate-example-config
or./atra --generate-example-config
- Modify the values to meet your needs
- rename them to
atra.ini
andcrawl.yaml
- Call
./atra.exe multi --log-to-file file:seeds.txt
or./atra multi --log-to-file file:seeds.txt
In order to build Atra you need Rust.
Initialize the submodules:
git submodule update --init --recursive
Update the submodules:
git submodule update --remote --recursive
After installing Rust you need LLVM and MSVC, with the proper environment paths set.
working as long as the pdf feature is not activated
After installing rust you need pkg-config
, libssl-dev
, clang
, and llvm
in order to compile Atra.
You can also use the docker container to build a binary.
Due to the dynamic linking you will need libc6
, openssl
,
and ca-certificates
installed on you system.
Atra is designed with a clear set of guiding principles to ensure simplicity, efficiency, and longevity:
- Minimal Dependencies: Atra focuses on a streamlined architecture, ensuring only necessary components are included.
- Sustainability and Durability: Atra is built for long-term reliability, prioritizing robustness and stability.
- Unified Configuration: All settings are managed within a single configuration file, simplifying setup and operation.
If you appreciate Atra, please consider supporting the Zeehondencentrum Pietersburen. Even a small donation of 10€ can provide food for a seal for an entire day.
The Zeehondencentrum Pietersburen is the largest seal hospital in Europe. Since the 1970s, they have been dedicated to the care of sick, injured, or weakened seals. After recovery, the seals are released back into the wild, with no animal remaining in permanent care. The center operates solely on donations, gifts, and entrance fees.
Atra takes its name from Erigone atra, a species of dwarf spider measuring 1.8mm to 2.8mm in length. These spiders play an essential role in natural pest control, particularly in agriculture, and are known for their unique ability to travel long distances by "ballooning," or kiting.
For more information, visit Wikipedia.
The exit codes returned by Atra when some kind or error happens:
Code | Meaning |
---|---|
0 | Success |
1 | Unknown Error |
2 | Some kind of file was not found or was not able to interact with the file system. |
3 | The config was faulty in some way. |
4 | Was not able to deserialize the config.json |
5 | The directory already exists. |
10 | Atra was not able to initialize the context by some unknown error. |
11 | Atra was not able to initialize the context due to some IO problem. |
12 | Atra was not able to open the database. |
13 | Atra had an RocksDB error, most probably a problem with the layout of the database. |
14 | Atra was not able to work with the queue file. |
15 | Atra failed to read the blacklist. |
16 | Atra had an error with initializing the SVM |
17 | Atra had an error while initializing the webgraph |
18 | Atra failed to serialize/deserialize some kind of data. |
40 | Atra failed to initialize a worker context |
50 | Atra failed to fill the queue |
70 | Atra failed serialize some data while dumping |
100 | The crawl failed in some unexpected way. |
101 | Failed to store some crawl informations in the database. |
102 | Failed to handle some link informations. |
103 | Failed to deserialize the link state for a url |
104 | Failed to update/read the link state from the database |
105 | Failed to write a crawl result to the database and/or filesystem. |
106 | Failed to read/write from/to the queue. |
107 | The client failed for some reason. |
108 | Failed to execute a request. |
109 | Failed to interact with the file system. |
Atra is configured by using json-configs and environment variables. The following table shows the configs written as qualified paths for a json.
Path | format | Explanation |
---|---|---|
system | JSON | Contains all configs regarding the system. Usually caches. |
system.robots_cache_size | uInt /wo 0; Element Count | The cache size of the robots manager. (default: 32) |
system.web_graph_cache_size | uInt /wo 0; Element Count | The cache size of the webgraph manager (default: 20.000) |
system.max_file_size_in_memory | uLong; in Byte | Max size of the files stored in memory. (default: 100MB). If set to 0 nothing will be stored in memory. |
system.max_temp_file_size_on_disc | uLong; in Byte | Max size of a temp file on the disc. (default: 16384 Pebibyte). If set to 0 nothing will be stored on the disc. |
system.log_level | String; Enum (see Log Level) | The log level of the crawler. (default: Info) |
system.log_to_file | boolean | Log to a file and not to console. (default: false) |
paths | JSON | Contains all configs regarding the paths. |
paths.root | String; Path | The root path where the application runs. (default: "./atra_data") |
paths.directories | JSON | |
paths.directories.database | String; Path | Path to the database directory. (default: root/rocksdb) |
paths.directories.big_files | String; Path | Path to the big files directory. (default: root/big_files) |
paths.files | JSON | |
paths.files.queue | String; Path | Path to the queue file (if one is needed) (default: root/queue.tmp) |
paths.files.blacklist | String; Path | Path to the blacklist (default: root/blacklist.txt) |
paths.files.web_graph | String; Path | Path to the web graph generated by atra (default: root/web_graph.ttl) |
session | JSON | The config of the session |
session.service | String | The name of the service (default: "atra") |
session.collection | String | The name of the collection created (default: "unnamed") |
session.crawl_job_id | uInt | The crawl job id. To differentiate for the same service and collection. |
session.warc_compression_level | uInt | - unused - |
crawl | JSON | |
crawl.user_agent | String; Enum (see User Agents) | The user agent used by the crawler. (default: Default) |
crawl.respect_robots_txt | boolean | Respect robots.txt file and not scrape not allowed files. This may slow down crawls if robots.txt file has a delay included. (default: true) |
crawl.generate_web_graph | boolean | If set Atra generates the webgraph. This can impact the overall performance of the crawl. (default: true) |
crawl.respect_nofollow | boolean | Respect the nofollow attribute during the link extraction (default: true) |
crawl.crawl_embedded_data | boolean | Extract links to embedded data like audio/video files for the crawl-queue (default: false) |
crawl.crawl_forms | boolean | Extract links from form action. (default: false) |
crawl.crawl_javascript | boolean | Extract links to/from javascript files for the crawl-queue (default: true) |
crawl.crawl_onclick_by_heuristic | boolean | Try to extract links from tags with onclick attribute for the crawl-queue (default: false) |
crawl.apply_gdbr_filter_if_possible | boolean | Tries to apply an gdbr filter, if one was properly configured. |
crawl.store_only_html_in_warc | boolean | Only store html-files in the warc |
crawl.store_big_file_hints_in_war | boolean | Store the big file hints also in the warc |
crawl.max_file_size | uInt/null; in Byte | The maximum size to download. If null there is no limit. (default: null) |
crawl.max_robots_age | String/null; "[whole_seconds].[whole_nanoseconds] " |
The maximum age of a cached robots.txt. If null, it never gets too old. |
crawl.ignore_sitemap | boolean | Prevent including the sitemap links with the crawl. (default: false) |
crawl.subdomains | boolean | Allow sub-domains. (default: false) |
crawl.cache | boolean | Cache the page following HTTP caching rules. (default: false) |
crawl.use_cookies | boolean | Use cookies (default: false) |
crawl.cookies | JSON/null; (see Cookie Settings) | Domain bound cookie config. (default: null) |
crawl.headers | JSON/null; {"- header_name -": "- header_value -"} |
Headers to include with requests. (default: null) |
crawl.proxies | List; ["- proxy -", "- proxy -"] |
Use proxy list for performing network request. (default: null) |
crawl.tld | boolean | Allow all tlds for domain. (default: false) |
crawl.delay | String; "[whole_seconds].[whole_nanoseconds] " |
Polite crawling delay (default: 1 second) |
crawl.budget | JSON/null; (see Crawl Budget) | The budget settings for this crawl. |
crawl.max_queue_age | uInt | How often can we fail to crawl an entry in the queue until it is dropped? (0 means never drop) (default: 20) |
crawl.redirect_limit | uInt | The max redirections allowed for request. (default: 5 like Google-Bot) |
crawl.redirect_policy | String; Enum (see Redirection Policy) | The redirect policy type to use. (default: Loose) |
crawl.accept_invalid_certs | boolean | Dangerously accept invalid certficates (default: false) |
crawl.link_extractors | JSON; [- Command -, - Command -] (see Link Extractor Settings) |
A custom configuration of extractors. (default: see Extractor Settings) |
crawl.decode_big_files_up_to | uInt/null; in Byte | If this value is set Atra tries to decode and process files that are only downloaded as blob but do not overstep this provided size (in Bytes). Null means off. (default: null) |
crawl.use_default_stopwords | boolean | If this is set all stopwords inside the default stopwords known to atra are used (default: true) |
crawl.stopword_registry | JSON/null; (see Stopword Registry) | Used to configure the global registry for stopwords. |
crawl.gbdr | JSON/null; (see GDBR Filter) | Used to configure the SVM for filtering GBRS. The model used is the L2R_L2LOSS_SVR. |
Level | Explanation |
---|---|
Off | Logging off |
Error | Log only errors |
Warn | Log errors and warnings |
Info | Log errors, warnings and infos |
Debug | Log errors, warnings, infos and debug informations |
Trace | Log everything. |
Name | Value | Explanation |
---|---|---|
Spoof | "Spoof" | Uses a static agent that looks like a browser. |
Default | "Default" | Uses "Crawler/Atra/-version-" |
Custom | { "Custom": "-content-" } |
Uses a custom user agent |
Sub-Path | Value | Explanation |
---|---|---|
default | String | Cookie string to use for network requests e.g.: "foo=bar; Domain=blog.spider" |
per_domain | JSON/null; {"- domain -": "- cookie string -"} |
A map between domains and cookie string. A domain is e.g.: "ebay.de" or an IP address |
Exemplary entry in a JSON:
{
"cookies": {
"default": "foo=bar; Domain=blog.spider",
"per_host": {
"ebay.de": "foo=cat; Domain=blog.spider2"
}
}
}
Sub-Path | Value | Explanation |
---|---|---|
default | JSON;BudgetSetting; see Budget Setting | The default budget setting for a crawl. |
per_host | JSON/null; {"- domain/host -": - BudgetSetting - } |
A map between domains and budget settings. A domain is e.g.: "ebay.de" or an IP address |
Exemplary entry in a JSON:
{
"budget": {
"default": {
"depth_on_website": 2,
"depth": 1,
"recrawl_interval": null,
"request_timeout": null
},
"per_host": {
"ebay.de": {
"depth": 3,
"recrawl_interval": "360.000000000",
"request_timeout": null
}
}
}
}
Budget settings exists in 3 different kinds:
- SinglePage: Only crawls the provided seed. Is used when both depths are null.
- SeedOnly: Only crawls the seed domains
- Normal: Crawls the seed and follows external links
- Absolute: Crawls the seed and follows external links, but only follows until a specific amout of jumps is reached.
The kind is decided by the presence of the fields. As described in the following table.
Sub-Path | used in | Value | Explanation |
---|---|---|---|
depth_on_website | SeedOnly, Normal | uInt/null | The max depth to crawl on a website. (default: null) |
depth | Normal, Absolute | uInt/null | The maximum depth of websites, outgoing from the seed. (default: null) |
recrawl_interval | SeedOnly, Normal, Absolute, SinglePage | String/null; "[whole_seconds].[whole_nanoseconds] " |
Crawl interval (if set to null crawl only once) (default: null) |
request_timeout | SeedOnly, Normal, Absolute, SinglePage | String/null; "[whole_seconds].[whole_nanoseconds] " |
Request max timeout per page. Set to null to disable. (default: 15.000000000) |
Name | Value | Explanation |
---|---|---|
Loose | "Loose" | A loose policy that allows all request up to the redirect limit. |
Strict | "Strict" | A strict policy only allowing request that match the domain set for crawling. |
The extractor settings are a list of commands.
Sub-Path | Value | Explanation |
---|---|---|
extractor_method | String; Enum (see Apply When) | The method used to extract something. |
apply_when | String; Enum (see Apply When) | The maximum depth of websites, outgoing from the seed. |
Example:
{
"link_extractor": [
{
"extractor_method": "HtmlV1",
"apply_when": "IfSuitable"
},
{
"extractor_method": "JSV1",
"apply_when": "IfSuitable"
},
{
"extractor_method": "PlainText",
"apply_when": "IfSuitable"
},
{
"extractor_method": "RawV1",
"apply_when": "IfSuitable"
}
]
}
Name | Value | Explanation |
---|---|---|
HtmlV1 | "HtmlV1"/"HTML_v1" | Extracts links from an HTML. Can respect NO_FOLLOW and is capable of resolving must of the common references of HTML. |
JSV1 | "JSV1"/"js_v1"/"JavaScript_v1"/"JS_v1" | Extracts links from JavaScript by searching for href identifiers. |
PlainText | "PlainText"/"PlainText_v1"/"PT_v1"/"Plain_v1" | Extracts links from a plaintext by using linkify. link |
RawV1 | "RawV1"/"RAW_v1" | Tries to extract links from raw bytes by using a modified linkify version for raw data. Relatively robust. Can theoretically process anything that can be decoded by atra. |
Rtf | "rtf_v1" | Extracts links from an RTF. |
Ooxml | "ooxml" | Extracts links from Office Open XMLs. |
Odf | "odf" | Extracts links from Open Document Formats. |
Exif | "image" | Extracts links from any kind of image, as long as there exists EXIF data. |
Xml | "xml" | Extracts links from an XML. |
Svg | "svg" | Extracts links from an SVG. |
Xlink | "xlink" | Extracts links from an XML with XLINK. |
"pdf_v1" | Extracts links from an HTML. (Currently deactivated due to a compiler bug.) |
Decides when to apply a link-extractor on some kind of data.
Name | Value | Explanation |
---|---|---|
Always | "Always" | Always applies this extractor. |
IfSuitable | "IfSuitable" | Only applies the extractor iff the file is of a fitting type |
Fallback | "Fallback" | If everything fails, try this extractor. |
Consists of a list of stopword repository configurations, can be one of the following:
Sub-Path | Value | Explanation |
---|---|---|
with_iso_default | bool | If this is set, the registry uses the default iso stopwords in addition to the provided stopword. (default: false) If neither a directory nor a file with language is set, it activates the default stopwords as independent stopwords. |
dir/directory | Path/Null | Defines a folder containing txt-files named after like .txt. The files contains in each line a specific stopword. |
file | Path/Null | Points to a txt file. The files contains in each line a specific stopword. The language id defined by the required "language" field in the config. |
language | String; Isolang Name | A language hint for file based stopword lists. |
Sub-Path | Value | Explanation |
---|---|---|
default | JSON/Null; IdentifierConfig; see GBDR Identifier Config | The default gdbr filter for the crawler. |
by_language | JSON/null; {"- Iso Lang -": - IdentifierConfig - } |
A map between languages and language specific gdbr identifier. |
Sub-Path | Value | Explanation |
---|---|---|
threshold | f64 | Sets the minimum score needed for a sucessfull SVM prediction. (default: 0.1) |
filter_threshold | f64 | Sets the score needed for a high confident SVM prediction. (default: 0.5) |
filter_by | String; FilterMode; see GBDR FilterMode | Configures what score is used to identify the correct node in the html. |
svm | JSON; SvmConfig; see SVM Config | Configures the SVM |
Name | Value | Explanation |
---|---|---|
OnScore | "OnScore" | Identify the target node by the score of the current node. |
OnMaxScore | "OnMaxScore" | Identify the target node by the maximum score in one of the children. |
OnAverageScore | "OnAverageScore" | Calculate the average between the score and max score in the child. The result will be used. |
The SVM can be configured in three ways:
- Load: Tries to load a pretrained model
- Train: Tries to train a model
- All: Tries to load a model, if it fails it tries to train one.
Sub-Path | used in | Value | Explanation |
---|---|---|---|
language | Load, Train, All | String; Iso Language | Sets the language for the svm |
retrain_if_possible | All | bool/Null | Retrains even if there is something to load. |
tf | Train, All | String/Null;TF; see TF | The TF used for the vectorisation. |
idf | Train, All | String/Null;IDF; see IDF | The IDF used for the vectorisation. |
tf_idf_data | Train, All | Path/Null; TF-IDF-Data; see SVM Data Formats | A path to some train data for the tf-idf-vectorizer. |
train_data | Train, All | Path/Null; TRAIN-DATA; see SVM Data Formats | A path to some train data for the svm. |
test_data | Load, Train, All | Path/Null; -undefined- | - unused - |
trained_svm | Load, All | Path/Null | Path to a stored svm. Will be in the atra root, if a relative path is provided. |
normalize_tokens | Train, All | bool/null | Normalizes the tokens according to Unicode Standard Annex #15 |
filter_stopwords | Train, All | bool/null | Filters the tokens by a stopword filter. |
stemmer | Train, All | String; StemmerName; see Stemmer Names | Stemms the tokens with a provided snowball stemmer. |
parameters | Train, All | JSON; SvmParameters; see SVM Parameters | The parameters used to configue the Liblinear SVM. |
min_doc_length | Load, Train, All | uInt/null | The minimum length of a document needed to be used for training/production. |
min_vector_length | Load, Train, All | uInt/null | The minimum length of a vector needed to be used for training/production. |
Sub-Path | used in | Value | Explanation |
---|---|---|---|
epsilon | L2R_L2LOSS_SVR | f64/null | Set tolerance of termination criterion for optimization (parameter e ). (default: 0.01) |
cost | L2R_L2LOSS_SVR | f64/null | Set cost of constraints violation (parameter C ). Rules the trade-off between regularization and correct classification on data. It can be seen as the inverse of a regularization constant. (default: 1.0) |
p | L2R_L2LOSS_SVR | f64/null | Set the tolerance margin/loss sensitivity of support vector regression (parameter p ). (default: 0.1) |
nu | f64/null | Set the fraction of data that is to be classified as outliers (parameter nu ). (default: 0.5) |
|
cost_penalty | L2R_L2LOSS_SVR | Array<[Int, f64]>/null; json {"cost_penalty": [[1, 0.5], [2, 3.4]]} |
Set weights to adjust the cost of constraints violation for specific classes. Each element is a tuple where the first value is the label and the second its corresponding weight penalty. Useful when training classifiers on unbalanced input data or with asymmetric mis-classification cost. (default: null) |
initial_solutions | L2R_L2LOSS_SVR | Array/null | Set the initial solution specification. (default: null) |
bias | L2R_L2LOSS_SVR | f64/null | Set the bias of the training data. If bias >= 0 , it's appended to the feature vector of each training data instance. (default: -1.0) |
regularize_bias | L2R_L2LOSS_SVR | bool | Toggle bias regularization during training. f set to false , the bias value will automatically be set to 1 . (default: true) |
See https://en.wikipedia.org/wiki/Tf%E2%80%93idf for the explanations:
Name | Value | Explanation |
---|---|---|
Binary | "Binary" | |
RawCount | "RawCount" | |
TermFrequency | "TermFrequency" | |
LogNormalization | "LogNormalization" | |
DoubleNormalization | "DoubleNormalization" |
See https://en.wikipedia.org/wiki/Tf%E2%80%93idf for the explanations:
Name | Value | Explanation |
---|---|---|
Unary | "Unary" | |
InverseDocumentFrequency | "InverseDocumentFrequency" | |
InverseDocumentFrequencySmooth | "InverseDocumentFrequencySmooth" | |
InverseDocumentFrequencyMax | "InverseDocumentFrequencyMax" | |
ProbabilisticInverseDocumentFrequency | "ProbabilisticInverseDocumentFrequency" |
This parameters proved very robust for german gdbr recognition.
{
"language": "Deu",
"tf": "TermFrequency",
"idf": "InverseDocumentFrequency",
"normalize_tokens": true,
"filter_stopwords": true,
"stemmer": "German",
"parameters": {
"epsilon": 0.0003,
"p": 0.1,
"cost": 10.0
},
"min_doc_length": 5,
"min_vector_length": 5
}
A plain textfile with line wise traindata.
<document 1>
<document 2>
<document 3>
A csv textfile with line wise traindata. The first value is a boolean indication if it is in the class "has_gdbr" or not. The second is an escaped document.
is_gdbr,text
true, "<document 1>"
false, "<document 2>"
true, "<document 3>"
Stemmers are available in the following languages, the name for the parameters are the same:
- Arabic
- Danish
- Dutch
- English
- Finnish
- French
- German
- Greek
- Hungarian
- Italian
- Norwegian
- Portuguese
- Romanian
- Russian
- Spanish
- Swedish
- Tamil
- Turkish