Skip to content

Commit

Permalink
Added a URL options flag to control content duplicates for redirects.
Browse files Browse the repository at this point in the history
  • Loading branch information
sonnykt committed Jun 22, 2021
1 parent 88e1727 commit 1e8a8a6
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 4 deletions.
3 changes: 2 additions & 1 deletion docs/URLOptions.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,10 +82,11 @@ There are a number of options that can apply to the URL list. These options are
| `include_query` | Will include the **query** part of the URL in the request. If set to false, the crawler will only fetch the path component of the URL. | boolean `false` |
| `include_fragment` | Will include the **fragment** part of the URL. If set to false, the crawler will only fetch the path component of the URL. | boolean `false` |
| `find_content_duplicates ` | Will check for **content** duplicates. This will create a file called `url-content-duplicates.json` that contains a list of URLs that appear to resolve to the same content. This is to avoid content duplication in the target system as well as provide a way to easily generate aliases. | boolean `true` |
| `count_redirects_as_content_duplicates` | Will also check redirects for content duplicates. If the `hash_selector` is set to body, *all* redirects will be counted as the duplicates of a single hash due to redirects do not have body content. | boolean `true` |
| `hash_selector` | This is an **XPath** selector that is used to generate the hash of content that is used to detect duplicates. By default `sha1` is used as the hash algorithm and uses the `<body>` tag of the page as the determining content.| string `"//body"` |
| `hash_exclude_nodes ` | This is an array of **XPath** selectors to *exclude* when generating the hash to detect duplicates. This could include elements that may appear on the page that might be metadata/cache busters or contain timestamps etc that can be safely excluded from building a hash for duplicate detection. By default all `<script>`, `<!-- Comment -->`, `<style>`, `<input>` and `<head>` tags will be ignored. | array |
| `urls` | This is an associative array of urls and their corresponding `include_query` and `include_fragment` settings (as above) to override the global setting, if required.| array |
| `raw_strip_script_tags` | Uses a regular expression on the raw fetched content to strip script tags before being read by the DOM library. These script tags can somtimes cause unexpected rewriting by the library if it considers them to be non-conforming markup. | `false` |
| `raw_strip_script_tags` | Uses a regular expression on the raw fetched content to strip script tags before being read by the DOM library. These script tags can sometimes cause unexpected rewriting by the library if it considers them to be non-conforming markup. | `false` |
| `raw_pattern_replace` | Assocative array with keys `pattern` and `replace`. Uses a regular expression on the raw fetched content to do a search and replace. You must specify both keys for it to be enabled. | array |


Expand Down
16 changes: 13 additions & 3 deletions src/Fetcher/FetcherBase.php
Original file line number Diff line number Diff line change
Expand Up @@ -232,14 +232,24 @@ public function processContent(string $url, string $html, array $redirect=[]) {
}//end if

// Check if duplicate if we are doing that.
$duplicate = false;
$duplicate = FALSE;
// Only records the duplicate if this is not a redirect,
// or redirect with count_redirects_as_content_duplicates enabled.
$count_redirect_as_duplicates = ($this->config->get('url_options')['count_redirects_as_content_duplicates'] ?? true);
$is_real_redirect = !empty($redirect['redirect'])
|| !empty($redirect['redirect_count'])
|| !empty($redirect['status_code_original'])
|| (!empty($redirect['status_code']) && ($redirect['status_code'] >= 300 && $redirect['status_code'] < 400));

if ($this->hashes instanceof ContentHash) {
$duplicate = $this->hashes->put($url, $html);
if (empty($is_real_redirect) || (!empty($is_real_redirect) && $count_redirect_as_duplicates)) {
$duplicate = $this->hashes->put($url, $html);
}
}

if ($duplicate === false) {

if (($redirect['redirect'] ?? false)) {
if ($is_real_redirect) {
// Add a property to the row for checking on redirects
$row->_redirected_from = $url;
}
Expand Down

0 comments on commit 1e8a8a6

Please sign in to comment.