Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#25] Redirect links with configuration rules #250

Merged
merged 2 commits into from
Dec 30, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 16 additions & 12 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ Unreleased

* [#176](https://github.com/serokell/xrefcheck/pull/176)
+ Enabled `autolink` extension for `cmark-gfm`, so now we're finding strings
like `www.google.com` or `https://google.com`, treating them as links
and checking.
like `www.google.com` or `https://google.com`, treating them as links
and checking.
* [#175](https://github.com/serokell/xrefcheck/pull/175)
+ Reorganize top-level config keys.
* [#178](https://github.com/serokell/xrefcheck/pull/178)
Expand All @@ -19,13 +19,13 @@ Unreleased
+ Add support for image links.
* [#199](https://github.com/serokell/xrefcheck/pull/199)
+ Now annotation
`<!-- xrefcheck: ignore all -->` instead of `<!-- xrefcheck: ignore file -->`
should be used to disable checking for links in file, so it's clearer that
file itself is not ignored (and links can target it).
`<!-- xrefcheck: ignore all -->` instead of `<!-- xrefcheck: ignore file -->`
should be used to disable checking for links in file, so it's clearer that
file itself is not ignored (and links can target it).
* [#215](https://github.com/serokell/xrefcheck/pull/215)
+ Now we notify user when there are scannable files that were not added to Git
yet. Also added CLI option `--include-untracked` to scan such files and treat
as existing.
yet. Also added CLI option `--include-untracked` to scan such files and treat
as existing.
* [#191](https://github.com/serokell/xrefcheck/pull/191)
+ Now we consider slash `/` (and only it) as path separator in local links for all OS,
so xrefcheck's report is OS-independent
Expand All @@ -40,10 +40,14 @@ Unreleased
redirect responses (i.e. 301 and 308) and passes for temporary ones (i.e. 302, 303, 307).
* [#231](https://github.com/serokell/xrefcheck/pull/231)
+ Anchor analysis takes now into account the appropriate case-sensitivity depending on
the configured Markdown flavour.
the configured Markdown flavour.
* [#254](https://github.com/serokell/xrefcheck/pull/254)
+ Now the `dump-config` command does not overwrite a file unless explicitly told with a
`--force` flag. Also, a `--stdout` flag allows to print the config to stdout instead.
the configured Markdown flavour.
* [#250](https://github.com/serokell/xrefcheck/pull/250)
+ Now the redirect behavior for external references can be modified via rules in the
configuration file with the `externalRefRedirects` parameter.

0.2.2
==========
Expand Down Expand Up @@ -95,7 +99,7 @@ Unreleased
+ Make possible to specify whether ignore localhost links, use
`check-localhost` CLA argument (by default localhost links will not be checked).
+ Make possible to ignore auth failures (assume 'protected' links
valid), use `ignoreAuthFailures` parameter of config.
valid), use `ignoreAuthFailures` parameter of config.
* [#66](https://github.com/serokell/xrefcheck/pull/66)
+ Added support for ftp links.
* [#74](https://github.com/serokell/xrefcheck/pull/83)
Expand Down Expand Up @@ -144,10 +148,10 @@ Unreleased
+ Switch to lts-17.3.
* [#53](https://github.com/serokell/xrefcheck/pull/53)
+ Make possible to include a regular expression in
`ignoreRefs` parameter of config to ignore external
references.
`ignoreRefs` parameter of config to ignore external
references.
+ Add support of right in-place ignoring annotations
such as `ignore file`, `ignore paragraph` and `ignore link`.
such as `ignore file`, `ignore paragraph` and `ignore link`.

0.1.2
=======
Expand Down
32 changes: 30 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,8 +141,36 @@ There are several ways to fix this:
* This behavior can be disabled by setting `ignoreAuthFailures: false` in the config file.

1. How does `xrefcheck` handle redirects?
* Permanent redirects (i.e. 301 and 308) are reported as errors.
* Temporary redirects (i.e. 302, 303 and 307) are assumed to be valid.
* The rules from the default configuration are as follows:
* Permanent redirects (i.e. 301 and 308) are reported as errors.
* Temporary redirects (i.e. 302, 303 and 307) are assumed to be valid.
Martoon-00 marked this conversation as resolved.
Show resolved Hide resolved
* Redirect rules can be specified with the `externalRefRedirects` parameter within `networking`, which accepts an array of
rules with keys `from`, `to`, `on` and `outcome`. The rule applied is the first one that matches with
the `from`, `to` and `on` fields, if any, where
* `from` is a regular expression, as in `ignoreExternalRefsTo`, for the source link in a single redirection step. Its absence means that
every link matches.
* `to` is a regular expression for the target link in a single redirection step. Its absence also means that every link matches.
* `on` accepts `temporary`, `permanent` or a specific redirect HTTP code. Its absence also means that
every response code matches.
* The `outcome` parameter accepts `valid`, `invalid` or `follow`. The last one follows the redirect by applying the
same configuration rules.

Martoon-00 marked this conversation as resolved.
Show resolved Hide resolved
For example, this configuration forbids 307 redirects to a specific domain and makes redirections from HTTP to HTTPS to be followed:

```
externalRefRedirects:
- to: "https?://forbidden.com.*"
on: 307
outcome: invalid
- from: "http://.*"
to: "https://.*"
outcome: follow
```
Martoon-00 marked this conversation as resolved.
Show resolved Hide resolved

The first one applies if both of them match.

* The number of redirects allowed in a single redirect chain is limited and can be configured with the
`maxRedirectFollows` parameter, also within `networking`. A number smaller than 0 disables the limit.

1. How does `xrefcheck` handle localhost links?
* By default, `xrefcheck` will ignore links to localhost.
Expand Down
12 changes: 6 additions & 6 deletions ftp-tests/Test/Xrefcheck/FtpLinks.hs
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ import Test.Tasty (TestTree, askOption, testGroup)
import Test.Tasty.HUnit (assertBool, assertFailure, testCase, (@?=))
import Test.Tasty.Options as Tasty (IsOption (..), OptionDescription (Option), safeRead)

import Xrefcheck.Config (Config, cExclusionsL, defConfig)
import Xrefcheck.Config
import Xrefcheck.Core (Flavor (GitHub))
import Xrefcheck.Scan (ecIgnoreExternalRefsToL)
import Xrefcheck.Verify (VerifyError (..), checkExternalResource)
Expand Down Expand Up @@ -48,27 +48,27 @@ test_FtpLinks = askOption $ \(FtpHostOpt host) -> do
testGroup "Ftp links handler"
[ testCase "handles correct link to file" $ do
let link = host <> "/pub/file_exists.txt"
result <- runExceptT $ checkExternalResource config link
result <- runExceptT $ checkExternalResource emptyChain config link
result @?= Right ()

, testCase "handles empty link (host only)" $ do
let link = host
result <- runExceptT $ checkExternalResource config link
result <- runExceptT $ checkExternalResource emptyChain config link
result @?= Right ()

, testCase "handles correct link to non empty directory" $ do
let link = host <> "/pub/"
result <- runExceptT $ checkExternalResource config link
result <- runExceptT $ checkExternalResource emptyChain config link
result @?= Right ()

, testCase "handles correct link to empty directory" $ do
let link = host <> "/empty/"
result <- runExceptT $ checkExternalResource config link
result <- runExceptT $ checkExternalResource emptyChain config link
result @?= Right ()

, testCase "throws exception when file not found" $ do
let link = host <> "/pub/file_does_not_exists.txt"
result <- runExceptT $ checkExternalResource config link
result <- runExceptT $ checkExternalResource emptyChain config link
case result of
Right () ->
assertFailure "No exception was raised, FtpEntryDoesNotExist expected"
Expand Down
17 changes: 15 additions & 2 deletions src/Xrefcheck/Config.hs
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

module Xrefcheck.Config
( module Xrefcheck.Config
, module Xrefcheck.Data.Redirect
, defConfigText
) where

Expand All @@ -15,11 +16,12 @@ import Universum
import Control.Lens (makeLensesWith)
import Data.Aeson (genericParseJSON)
import Data.Yaml (FromJSON (..), decodeEither', prettyPrintParseException, withText)

import Text.Regex.TDFA.Text ()
import Time (KnownRatName, Second, Time (..), unitsP)

import Xrefcheck.Config.Default
import Xrefcheck.Core
import Xrefcheck.Data.Redirect
import Xrefcheck.Scan
import Xrefcheck.Scanners.Markdown
import Xrefcheck.Util (Field, aesonConfigOption, postfixFields)
Expand Down Expand Up @@ -78,8 +80,16 @@ data NetworkingConfig' f = NetworkingConfig
-- this `maxTimeoutRetries` option limits only the number of retries
-- caused by timeouts, and `maxRetries` limits the number of retries
-- caused both by 429s and timeouts.
, ncMaxRedirectFollows :: Field f Int
-- ^ Maximum number of links that can be followed in a single redirect
-- chain.
, ncExternalRefRedirects :: Field f RedirectConfig
-- ^ Rules to override the redirect behavior for external references.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually make it optional and treat it as if the user said: I don't care about redirects, handle them somehow. And I would default this to [].

And later when we see a redirect that matched no rule, we would treat that as OK, since we have to be optimistic.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we would be changing the default behavior that we previously implemented. Right? It is now provided with these two default rules:

- to: .*
  on: permanent
  outcome: invalid
- to: .*
  on: temporary
  outcome: valid

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I think we really want two things:

  1. By default, suggest to the user those defaults you implemented in the previous PR: temporary=ok, permanent=failure.
  2. Give the user a simple way to say "don't bother me with redirects".

And I think, making ncExternalRefRedirects optional and defaulting it to [] that would then be interpreted as "always ok" is the best way to handle the second point.

And IMO insisting on the old behaviour is not so important, different users may find it weird depending on the use cases they face in real life. It is easy to obtain our recommended defaults for config, and I think that's enough.

A made a little detour with my reasoning for this, but this is how I see it now. There are other options to approach all this, but to me [] default yet seems to be the winning one.

Copy link
Contributor

@YuriRomanowski YuriRomanowski Dec 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use [] for the defaults, but add our previous behaviour as an example in the default config comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something. If we default ncExternalRefRedirects to [] (in the default config I guess) then, how do we suggest the previously implemented defaults? With a section in the README file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, maybe we could write it as a comment in the default config, as our "recommendation" :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under "using [] as default" I understood - if ncExternalRefRedirects is missing in the config, parse it as if it was [] (it really required clarification). And our recommendation for this field will be in the default config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't considered that option. It sounds good 👍

} deriving stock (Generic)

-- | A list of custom redirect rules.
type RedirectConfig = [RedirectRule]
Martoon-00 marked this conversation as resolved.
Show resolved Hide resolved

-- | Type alias for ScannersConfig' with all required fields.
type ScannersConfig = ScannersConfig' Identity

Expand Down Expand Up @@ -118,7 +128,8 @@ overrideConfig config

defScanners = cScanners $ defConfig flavor
defExclusions = cExclusions $ defConfig flavor
defNetworking = cNetworking $ defConfig flavor
defNetworking = cNetworking (defConfig flavor)
& ncExternalRefRedirectsL .~ []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍


overrideExclusions exclusionConfig
= ExclusionConfig
Expand All @@ -138,6 +149,8 @@ overrideConfig config
, ncDefaultRetryAfter = overrideField ncDefaultRetryAfter
, ncMaxRetries = overrideField ncMaxRetries
, ncMaxTimeoutRetries = overrideField ncMaxTimeoutRetries
, ncMaxRedirectFollows = overrideField ncMaxRedirectFollows
, ncExternalRefRedirects = overrideField ncExternalRefRedirects
}
where
overrideField :: (forall f. NetworkingConfig' f -> Field f a) -> a
Expand Down
75 changes: 58 additions & 17 deletions src/Xrefcheck/Config/Default.hs
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ module Xrefcheck.Config.Default where

import Universum

import Fmt (Builder)
import Text.Interpolation.Nyan

import Xrefcheck.Core
Expand Down Expand Up @@ -36,7 +37,7 @@ exclusions:
# List of POSIX extended regular expressions.
ignoreExternalRefsTo:
# Ignore localhost links by default
- ^(https?|ftps?)://(localhost|127\\.0\\.0\\.1).*
- (https?|ftps?)://(localhost|127\\.0\\.0\\.1).*

# Networking parameters.
networking:
Expand Down Expand Up @@ -65,22 +66,45 @@ networking:
# On other errors xrefcheck fails immediately, without retrying.
maxRetries: 3

# Querying a given domain that ever returned 429 before,
# this defines how many timeouts are allowed during retries.
#
# For such domains, timeouts likely mean hitting the rate limiter,
# and so xrefcheck considers timeouts in the same way as 429 errors.
#
# For other domains, a timeout results in a respective error, no retry
# attempts will be performed. Use `externalRefCheckTimeout` option
# to increase the time after which timeout is declared.
#
# This option is similar to `maxRetries`, the difference is that
# this `maxTimeoutRetries` option limits only the number of retries
# caused by timeouts, and `maxRetries` limits the number of retries
# caused both by 429s and timeouts.
# Querying a given domain that ever returned 429 before,
# this defines how many timeouts are allowed during retries.
#
# For such domains, timeouts likely mean hitting the rate limiter,
# and so xrefcheck considers timeouts in the same way as 429 errors.
#
# For other domains, a timeout results in a respective error, no retry
# attempts will be performed. Use `externalRefCheckTimeout` option
# to increase the time after which timeout is declared.
#
# This option is similar to `maxRetries`, the difference is that
# this `maxTimeoutRetries` option limits only the number of retries
# caused by timeouts, and `maxRetries` limits the number of retries
# caused both by 429s and timeouts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

maxTimeoutRetries: 1

# Maximum number of links that can be followed in a single redirect
# chain.
#
# The link is considered as invalid if the limit is exceeded.
maxRedirectFollows: 10
Martoon-00 marked this conversation as resolved.
Show resolved Hide resolved

# Rules to override the redirect behavior for external references that
# match, where
# - 'from' is a regular expression for the source link in a single
# redirection step. Its absence means that every link matches.
# - 'to' is a regular expression for the target link in a single
# redirection step. Its absence also means that every link matches.
# - 'on' accepts 'temporary', 'permanent' or a specific redirect HTTP code.
# Its absence also means that every response code matches.
# - 'outcome' accepts 'valid', 'invalid' or 'follow'. The last one follows
# the redirect by applying the same configuration rules so, for instance,
# exclusion rules would also apply to the following links.
#
# The first one that matches is applied, and the link is considered
# as valid if none of them does match.
externalRefRedirects:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved the default behaviour from code to default rules.

#{interpolateIndentF 4 externalRefRedirects}

# Parameters of scanners for various file types.
scanners:
# On 'anchor not found' error, how much similar anchors should be displayed as
Expand All @@ -95,7 +119,7 @@ scanners:
|]
where
ignoreLocalRefsFrom :: NonEmpty Text
ignoreLocalRefsFrom = fromList $ case flavor of
ignoreLocalRefsFrom = fromList $ case flavor of
GitHub ->
[ ".github/pull_request_template.md"
, ".github/issue_template.md"
Expand All @@ -108,7 +132,7 @@ scanners:
]

ignoreLocalRefsTo :: NonEmpty Text
ignoreLocalRefsTo = fromList $ case flavor of
ignoreLocalRefsTo = fromList $ case flavor of
GitHub ->
[ "../../../issues"
, "../../../issues/*"
Expand All @@ -121,3 +145,20 @@ scanners:
, "../../merge_requests"
, "../../merge_requests/*"
]

externalRefRedirects :: Builder
externalRefRedirects = case flavor of
GitHub ->
[int||
- on: permanent
outcome: invalid|]
GitLab ->
[int||
- on: permanent
outcome: invalid
# GitLab redirects non-existing files to the repository's main page
# with a 302 code instead of answering with a 404 response.
- from: https?://gitlab.com/.*/-/blob/.*
to: https?://gitlab.com/.*
on: 302
outcome: invalid|]
Martoon-00 marked this conversation as resolved.
Show resolved Hide resolved
Loading