Skip to content

Commit

Permalink
[#25] Redirect links with configuration rules
Browse files Browse the repository at this point in the history
Problem: We previously changed the default behaviour of Xrefcheck when
following link redirects, but did not provide a way to configure it.

Solution: We are adding a new field in the configuration file to allow
writing a list of redirect rules that will be applied to links that
match them.
  • Loading branch information
aeqz committed Dec 30, 2022
1 parent a4dc29b commit e84dc9c
Show file tree
Hide file tree
Showing 28 changed files with 980 additions and 106 deletions.
28 changes: 16 additions & 12 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ Unreleased

* [#176](https://github.com/serokell/xrefcheck/pull/176)
+ Enabled `autolink` extension for `cmark-gfm`, so now we're finding strings
like `www.google.com` or `https://google.com`, treating them as links
and checking.
like `www.google.com` or `https://google.com`, treating them as links
and checking.
* [#175](https://github.com/serokell/xrefcheck/pull/175)
+ Reorganize top-level config keys.
* [#178](https://github.com/serokell/xrefcheck/pull/178)
Expand All @@ -19,13 +19,13 @@ Unreleased
+ Add support for image links.
* [#199](https://github.com/serokell/xrefcheck/pull/199)
+ Now annotation
`<!-- xrefcheck: ignore all -->` instead of `<!-- xrefcheck: ignore file -->`
should be used to disable checking for links in file, so it's clearer that
file itself is not ignored (and links can target it).
`<!-- xrefcheck: ignore all -->` instead of `<!-- xrefcheck: ignore file -->`
should be used to disable checking for links in file, so it's clearer that
file itself is not ignored (and links can target it).
* [#215](https://github.com/serokell/xrefcheck/pull/215)
+ Now we notify user when there are scannable files that were not added to Git
yet. Also added CLI option `--include-untracked` to scan such files and treat
as existing.
yet. Also added CLI option `--include-untracked` to scan such files and treat
as existing.
* [#191](https://github.com/serokell/xrefcheck/pull/191)
+ Now we consider slash `/` (and only it) as path separator in local links for all OS,
so xrefcheck's report is OS-independent
Expand All @@ -40,10 +40,14 @@ Unreleased
redirect responses (i.e. 301 and 308) and passes for temporary ones (i.e. 302, 303, 307).
* [#231](https://github.com/serokell/xrefcheck/pull/231)
+ Anchor analysis takes now into account the appropriate case-sensitivity depending on
the configured Markdown flavour.
the configured Markdown flavour.
* [#254](https://github.com/serokell/xrefcheck/pull/254)
+ Now the `dump-config` command does not overwrite a file unless explicitly told with a
`--force` flag. Also, a `--stdout` flag allows to print the config to stdout instead.
the configured Markdown flavour.
* [#250](https://github.com/serokell/xrefcheck/pull/250)
+ Now the redirect behavior for external references can be modified via rules in the
configuration file with the `externalRefRedirects` parameter.

0.2.2
==========
Expand Down Expand Up @@ -95,7 +99,7 @@ Unreleased
+ Make possible to specify whether ignore localhost links, use
`check-localhost` CLA argument (by default localhost links will not be checked).
+ Make possible to ignore auth failures (assume 'protected' links
valid), use `ignoreAuthFailures` parameter of config.
valid), use `ignoreAuthFailures` parameter of config.
* [#66](https://github.com/serokell/xrefcheck/pull/66)
+ Added support for ftp links.
* [#74](https://github.com/serokell/xrefcheck/pull/83)
Expand Down Expand Up @@ -144,10 +148,10 @@ Unreleased
+ Switch to lts-17.3.
* [#53](https://github.com/serokell/xrefcheck/pull/53)
+ Make possible to include a regular expression in
`ignoreRefs` parameter of config to ignore external
references.
`ignoreRefs` parameter of config to ignore external
references.
+ Add support of right in-place ignoring annotations
such as `ignore file`, `ignore paragraph` and `ignore link`.
such as `ignore file`, `ignore paragraph` and `ignore link`.

0.1.2
=======
Expand Down
34 changes: 31 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,10 +141,38 @@ There are several ways to fix this:
* This behavior can be disabled by setting `ignoreAuthFailures: false` in the config file.

1. How does `xrefcheck` handle redirects?
* Permanent redirects (i.e. 301 and 308) are reported as errors.
* Temporary redirects (i.e. 302, 303 and 307) are assumed to be valid.
* The rules from the default configuration are as follows:
* Permanent redirects (i.e. 301 and 308) are reported as errors.
* Temporary redirects (i.e. 302, 303 and 307) are assumed to be valid.
* Redirect rules can be specified with the `externalRefRedirects` parameter within `networking`, which accepts an array of
rules with keys `from`, `to`, `on` and `outcome`. The rule applied is the first one that matches with
the `from`, `to` and `on` fields, if any, where
* `from` is a regular expression, as in `ignoreExternalRefsTo`, for the source link in a single redirection step. Its absence means that
every link matches.
* `to` is a regular expression for the target link in a single redirection step. Its absence also means that every link matches.
* `on` accepts `temporary`, `permanent` or a specific redirect HTTP code. Its absence also means that
every response code matches.
* The `outcome` parameter accepts `valid`, `invalid` or `follow`. The last one follows the redirect by applying the
same configuration rules.

For example, this configuration forbids 307 redirects to a specific domain and makes redirections from HTTP to HTTPS to be followed:

1. How does `xrefcheck` handle localhost links?
```
externalRefRedirects:
- to: "https?://forbidden.com.*"
on: 307
outcome: invalid
- from: "^http://.*"
to: "^https://.*"
outcome: follow
```

The first one applies if both of them match.

* The number of redirects allowed in a single redirect chain is limited and can be configured with the
`maxRedirectFollows` parameter, also within `networking`. A number smaller than 0 disables the limit.

2. How does `xrefcheck` handle localhost links?
* By default, `xrefcheck` will ignore links to localhost.
* This behavior can be disabled by removing the corresponding entry from the `ignoreExternalRefsTo` list in the config file.

Expand Down
12 changes: 6 additions & 6 deletions ftp-tests/Test/Xrefcheck/FtpLinks.hs
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ import Test.Tasty (TestTree, askOption, testGroup)
import Test.Tasty.HUnit (assertBool, assertFailure, testCase, (@?=))
import Test.Tasty.Options as Tasty (IsOption (..), OptionDescription (Option), safeRead)

import Xrefcheck.Config (Config, cExclusionsL, defConfig)
import Xrefcheck.Config
import Xrefcheck.Core (Flavor (GitHub))
import Xrefcheck.Scan (ecIgnoreExternalRefsToL)
import Xrefcheck.Verify (VerifyError (..), checkExternalResource)
Expand Down Expand Up @@ -48,27 +48,27 @@ test_FtpLinks = askOption $ \(FtpHostOpt host) -> do
testGroup "Ftp links handler"
[ testCase "handles correct link to file" $ do
let link = host <> "/pub/file_exists.txt"
result <- runExceptT $ checkExternalResource config link
result <- runExceptT $ checkExternalResource emptyChain config link
result @?= Right ()

, testCase "handles empty link (host only)" $ do
let link = host
result <- runExceptT $ checkExternalResource config link
result <- runExceptT $ checkExternalResource emptyChain config link
result @?= Right ()

, testCase "handles correct link to non empty directory" $ do
let link = host <> "/pub/"
result <- runExceptT $ checkExternalResource config link
result <- runExceptT $ checkExternalResource emptyChain config link
result @?= Right ()

, testCase "handles correct link to empty directory" $ do
let link = host <> "/empty/"
result <- runExceptT $ checkExternalResource config link
result <- runExceptT $ checkExternalResource emptyChain config link
result @?= Right ()

, testCase "throws exception when file not found" $ do
let link = host <> "/pub/file_does_not_exists.txt"
result <- runExceptT $ checkExternalResource config link
result <- runExceptT $ checkExternalResource emptyChain config link
case result of
Right () ->
assertFailure "No exception was raised, FtpEntryDoesNotExist expected"
Expand Down
18 changes: 17 additions & 1 deletion src/Xrefcheck/Config.hs
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

module Xrefcheck.Config
( module Xrefcheck.Config
, module Xrefcheck.Data.Redirect
, defConfigText
) where

Expand All @@ -15,11 +16,12 @@ import Universum
import Control.Lens (makeLensesWith)
import Data.Aeson (genericParseJSON)
import Data.Yaml (FromJSON (..), decodeEither', prettyPrintParseException, withText)

import Text.Regex.TDFA.Text ()
import Time (KnownRatName, Second, Time (..), unitsP)

import Xrefcheck.Config.Default
import Xrefcheck.Core
import Xrefcheck.Data.Redirect
import Xrefcheck.Scan
import Xrefcheck.Scanners.Markdown
import Xrefcheck.Util (Field, aesonConfigOption, postfixFields)
Expand Down Expand Up @@ -78,8 +80,16 @@ data NetworkingConfig' f = NetworkingConfig
-- this `maxTimeoutRetries` option limits only the number of retries
-- caused by timeouts, and `maxRetries` limits the number of retries
-- caused both by 429s and timeouts.
, ncMaxRedirectFollows :: Field f Int
-- ^ Maximum number of links that can be followed in a single redirect
-- chain.
, ncExternalRefRedirects :: Field f RedirectConfig
-- ^ Rules to override the redirect behavior for external references.
} deriving stock (Generic)

-- | A list of custom redirect rules.
type RedirectConfig = [RedirectRule]

-- | Type alias for ScannersConfig' with all required fields.
type ScannersConfig = ScannersConfig' Identity

Expand Down Expand Up @@ -119,6 +129,7 @@ overrideConfig config
defScanners = cScanners $ defConfig flavor
defExclusions = cExclusions $ defConfig flavor
defNetworking = cNetworking $ defConfig flavor
defRedirectConfig = []

overrideExclusions exclusionConfig
= ExclusionConfig
Expand All @@ -138,11 +149,16 @@ overrideConfig config
, ncDefaultRetryAfter = overrideField ncDefaultRetryAfter
, ncMaxRetries = overrideField ncMaxRetries
, ncMaxTimeoutRetries = overrideField ncMaxTimeoutRetries
, ncMaxRedirectFollows = overrideField ncMaxRedirectFollows
, ncExternalRefRedirects = externalRefRedirects
}
where
overrideField :: (forall f. NetworkingConfig' f -> Field f a) -> a
overrideField field = fromMaybe (field defNetworking) $ field networkingConfig

externalRefRedirects :: RedirectConfig
externalRefRedirects = fromMaybe defRedirectConfig $ ncExternalRefRedirects networkingConfig

-----------------------------------------------------------
-- Yaml instances
-----------------------------------------------------------
Expand Down
61 changes: 47 additions & 14 deletions src/Xrefcheck/Config/Default.hs
Original file line number Diff line number Diff line change
Expand Up @@ -65,22 +65,55 @@ networking:
# On other errors xrefcheck fails immediately, without retrying.
maxRetries: 3

# Querying a given domain that ever returned 429 before,
# this defines how many timeouts are allowed during retries.
#
# For such domains, timeouts likely mean hitting the rate limiter,
# and so xrefcheck considers timeouts in the same way as 429 errors.
#
# For other domains, a timeout results in a respective error, no retry
# attempts will be performed. Use `externalRefCheckTimeout` option
# to increase the time after which timeout is declared.
#
# This option is similar to `maxRetries`, the difference is that
# this `maxTimeoutRetries` option limits only the number of retries
# caused by timeouts, and `maxRetries` limits the number of retries
# caused both by 429s and timeouts.
# Querying a given domain that ever returned 429 before,
# this defines how many timeouts are allowed during retries.
#
# For such domains, timeouts likely mean hitting the rate limiter,
# and so xrefcheck considers timeouts in the same way as 429 errors.
#
# For other domains, a timeout results in a respective error, no retry
# attempts will be performed. Use `externalRefCheckTimeout` option
# to increase the time after which timeout is declared.
#
# This option is similar to `maxRetries`, the difference is that
# this `maxTimeoutRetries` option limits only the number of retries
# caused by timeouts, and `maxRetries` limits the number of retries
# caused both by 429s and timeouts.
maxTimeoutRetries: 1

# Maximum number of links that can be followed in a single redirect
# chain.
#
# The link is considered as invalid if the limit is exceeded.
maxRedirectFollows: 10

# Rules to override the redirect behavior for external references that
# match, where
# - 'from' is a regular expression for the source link in a single
# redirection step. Its absence means that every link matches.
# - 'to' is a regular expression for the target link in a single
# redirection step. Its absence also means that every link matches.
# - 'on' accepts 'temporary', 'permanent' or a specific redirect HTTP code.
# Its absence also means that every response code matches.
# - 'outcome' accepts 'valid', 'invalid' or 'follow'. The last one follows
# the redirect by applying the same configuration rules so, for instance,
# exclusion rules would also apply to the following links.
#
# The first one that matches is applied, and the link is considered
# as valid if none of them does match.
#
# If a value is provided for 'networking' but not for 'externalRefRedirects',
# then it will default to an empty list of rules and every redirect will pass.
externalRefRedirects:
- from: .*
to: .*
on: permanent
outcome: invalid
- from: .*
to: .*
on: temporary
outcome: valid

# Parameters of scanners for various file types.
scanners:
# On 'anchor not found' error, how much similar anchors should be displayed as
Expand Down
Loading

0 comments on commit e84dc9c

Please sign in to comment.