Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ignored behaving like notScanned #118

Closed
dcastro opened this issue Jul 22, 2022 · 10 comments
Closed

[BUG] ignored behaving like notScanned #118

dcastro opened this issue Jul 22, 2022 · 10 comments

Comments

@dcastro
Copy link
Member

dcastro commented Jul 22, 2022

Description

In the default config file, generated by dump-config, the traversal.ignored option is documented as follows:

  # Files and folders which we pretend do not exist
  # (so they are neither analyzed nor can be referenced).

The --help text also says the following:

--ignored FILEPATH       Files and folders which we pretend do not exist.

However, it seems like this is not entirely true. If a file contains a reference to an --ignored file, that reference will still be verified (see demo below).

So the question now is: is the implementation of ignored wrong? Or is the implementation right, and the documentation/help text wrong?

If it's the latter (the implementation is right, the docs are wrong), then how is ignored different from notScanned? They seem to have the same behavior at the moment... If a file is in either ignored or in notScanned, then we will not check that its references are valid, but references to those files will still be checked.

To Reproduce

Steps to reproduce the behavior:

Create a folder with these 2 files

mkdir temp && cd temp
echo "[b](b.md)" >> a.md
touch b.md

Run xrefcheck and tell it to "pretend b.md doesn't exist":

$ xrefcheck --ignored b.md 
All repository links are valid.

Expected behavior

I expected the link from a.md to b.md to fail.

Environment

@dcastro
Copy link
Member Author

dcastro commented Jul 22, 2022

cc'ing @Martoon-00

@dcastro
Copy link
Member Author

dcastro commented Jul 22, 2022

Also, it's extremely confusing:

  • That notScanned is in the verification section of the config, and not in the traversal section as I would expect
  • The vcNotScanned option is used in the Verify.hs module, and not in the Scan.hs module?

Furthermore, it seems we used the terms "traversal" and "scan" in the documentation/config/source code to refer to the same thing.. I've opened #119 to solve this particular problem.

@dcastro
Copy link
Member Author

dcastro commented Jul 22, 2022

Okay, I think I found the difference between ignored and notScanned.

(Note: there's currently a bug in the latest master, see #120, so I'm using an older version from d6fb95f)

mkdir temp && cd temp
echo "[b](b.md#invalid-anchor)" >> a.md
echo "[?](invalid-file.md)" >> b.md

To sum up, there are 3 things we need to check:

  1. The reference in a.md to b.md is valid, i.e. the file b.md exists
    1.1. The anchor #invalid-anchor exists in b.md
  2. The reference in b.md to invalid-file.md is valid, i.e. the file invalid-file.md exists
  • If we run xrefcheck with the default config, all 3 checks will fail, as expected. It'll complain about both the invalid anchor in a.md and the invalid link in b.md:
=== Invalid references found ===                                                                                                           

  ➥  In file ./a.md
     bad reference (relative) at src:1:1-24:
       - text: "b"
       - link: b.md
       - anchor: invalid-anchor
     
     ⛀  Anchor 'invalid-anchor' is not present
     
     
  ➥  In file ./b.md
     bad reference (relative) at src:1:1-20:
       - text: "?"
       - link: invalid-file.md
       - anchor: -
     
     ⛀  File does not exist:
        ./invalid-file.md
     
     
Invalid references dumped, 2 in total.
  • Adding b.md to notScanned will tell xrefcheck to not check any references in b.md. BUT the reference in a.md to b.md#invalid-anchor will still be checked. In other words, check (2) will be skipped, but not (1) or (1.1).
== Invalid references found ===

  ➥  In file ./a.md
     bad reference (relative) at src:1:1-24:
       - text: "b"
       - link: b.md
       - anchor: invalid-anchor
     
     ⛀  Anchor 'invalid-anchor' is not present
     
     
Invalid references dumped, 1 in total.
  • Adding b.md to ignored will tell xrefcheck to not check any references in b.md AND not parse any anchors in b.md. BUT we will still check that the reference in a.md to b.md is valid. In other words, checks (2) and (1.1) will be skipped, but not (1).
All repository links are valid.

To re-cap:

  • If a link references a file in either ignored or notScanned, we will always check that the file exists
  • If a link references a specific anchor in a file:
    • If that file is in ignored, we will not check the anchor
    • If that file is in notScanned, we will check the anchor

My conclusions:

  • The docs/help text for ignored say we "pretend the file doesn't exist", but we don't. We still check the file does exist. So this is definitely wrong.
  • The difference between ignored and notScanned is extremely subtle (it really only affects anchors) and not documented anywhere.
  • I don't see how ignored is useful... what's the point of checking that the file exists, but not check the anchor? In what situation would someone want to use ignored instead of notScanned?

Given the above, I think the simplest solution may be to just delete ignored and stick to notScanned.

@Martoon-00
Copy link
Member

Ohhh, this is an extremely weird situation. I now realize that I got an inconsistent vision on this project, and this issue is the result 😒

The documentation is correct about the meaning of notScanned and ignored, and the implementation is incorrect. The difference was not assumed to be only in the anchor being checked or not.

What I imagined initially

I thought that we will have two stages - collection of files and verification. Initially verification had to account only for the files collected at the previous stage, and all other files - they like did not exist in the filesystem. If there was a reference to a file not from the list of collected files, we would throw an error.

In this sense, the fact that ignored appears in traversal section of config and notScanned in verification section - makes sense. ignored affects only traversal stage - which files we collect and thus later account for, and notScanned affects verification - is something verified or not.

Though I now admit that this separation in config relies on understanding of how the tool works, and this is not good.

And the current implementation of ignored is written with the mentioned model in mind.

What I got

Eventually I made verification seek the referred files not in the list of the files collected at traversal stage, but in the filysystem using IO. Probably I was motivated by two reasons:

  1. It was simpler to implement. We have to recognize that reference to ../../a/b can refer to a file with path /q/a/b, but I didn't yet write a proper canonicalization. This causes problems, like too many .. will make the tool to read the file outside of the project and this in fact sounds very bad.
  2. This allowed to collect only .md files at the traversal stage and not keep in memory all the filepaths of the project.

And now

Maybe first it would be worth re-assessing, whether do we want to collect all the files in RepoInfo (analyzing only .md files, and for other files just remember that they are present), and use that in verification instead of IO. This may be more performant, and will likely fix the ignored's flag logic automatically.

If we don't want to change this part, then we should include ignored into the verification logic.

Overall interface

Actually, given these problems, I would not be surprised if all the exclusion settings (ignored, notScanned, vritualFiles) will be completely revised eventually and re-added from scratch.

I think the difference between ignored and notScanned (as per documentation) matters and can be useful. A use case:

  • I copied some file verbatim from somewhere else, e.g. it may be an example of something or a resource for tests. Then I don't want to analyse references in it (they are likely invalid), but any reference to it should be checked.

Also, we need at least one option that would say "this directory should not be even traversed", otherwise things like .stack-word and node-modules will consume most of the time to traverse.

So at least ignored vs notScanned distinction (as they are documented) seems to make some sense.

What do you think?

@dcastro
Copy link
Member Author

dcastro commented Sep 9, 2022

Thank you for the deep clarification @Martoon-00, much appreciated! 😄

One more re-cap!

How the options should be working, as per the docs:

  • ignoreRefs (list of regexes)
    • Do not verify external references to these links.
  • virtualFiles (glob patterns)
    • Do not verify local references to these files.
  • notScanned (glob patterns)
    • Do not verify local references from these files.
  • ignored (glob patterns):
    • Pretend these files don't exist.
    • References to these files will fail.
    • References from these files will not be verified.

How they work at the moment:

  • ignoreRefs: (works as intended)
  • virtualFiles: (works as intended)
  • notScanned: (works as intended)
  • ignored:
    • We accidentally verify references to these files. We do not verify anchors.
    • References from these files will not be verified (works as intended).

Though I now admit that this separation in config relies on understanding of how the tool works, and this is not good.

Agreed. The user only cares about "what" the application does, not "how" it does it. I think that in the config file, we should get rid of the verification and traversal top-level keys, and just move everything (externalRefCheckTimeout, virtualFiles, etc) to the top-level. And also make sure each option's description is very clear about what it does / does not do. What do you think?

I copied some file verbatim from somewhere else, e.g. it may be an example of something or a resource for tests. Then I don't want to analyse references in it (they are likely invalid), but any reference to it should be checked.

Just to clarify, in the situation you described, the user would use notScanned, right?

And a use-case for ignored would be to ignore build-related files (like dist-newstyle and node_modules), right?

So at least ignored vs notScanned distinction (as they are documented) seems to make some sense.
What do you think?

I agree, both use cases seem useful.


Yet another problem I hadn't thought of before: the meaning of the word "ignore" in the ignored and in the ignoreRefs options is different.

  • In ignored, "ignore" means "if you reference this file, verification will fail"
  • In ignoreRefs, "ignore" means "if you reference this URL, verification will skip it and succeed"

What I would like to propose:

  • skipExternalRefsTo (regexes): does what ignoreRefs currently does.
  • skipLocalRefsTo (glob patterns): does what virtualFiles currently does.
  • skipRefsFrom (glob patterns): does what notScanned currently does.
  • ignore (glob patterns): does what ignored is supposed to do.

This solves the naming inconsistencies I mentioned above. Furthermore, these names describe more accurately "what" the options do, and do not mention "how" they do it.

@dcastro
Copy link
Member Author

dcastro commented Sep 9, 2022

Furthermore, I now realize the recently-introduced checkLocalhost option could instead just be another item in ignoreRefs/skipExternalRefsTo 🤔

@dcastro
Copy link
Member Author

dcastro commented Sep 9, 2022

This causes problems, like too many .. will make the tool to read the file outside of the project and this in fact sounds very bad.

That does sound bad - just created an issue for this: #138

@Martoon-00
Copy link
Member

Martoon-00 commented Sep 10, 2022

One more re-cap!

  • virtualFiles is rather: do not verify local references from these files.

And that seems like another point of weirdness, since this mirrors the behaviour of ignored or notScanned.

I think will benefit from an option that is "do not verify local references to these files" too though. And there seems to be two sub-options here:

  • Fail on refs to the given existing files. Not sure when this is useful.
  • Do not fail on refs to non-existing files. In theory, this is useful if I'm writing an instruction and want to provide a link to files that are built in the process. But probably a local ignore command via a comment suits better here.

Other items seems to be correct 👍


I think that in the config file, we should get rid of the verification and traversal top-level keys, and just move everything (externalRefCheckTimeout, virtualFiles, etc) to the top-level.

I fully agree about getting rid of verification and traversal keys. But I'm also afraid that making the config flat would turn it into a mess where it is hard to find a necessary option, given how many options it now has.

Maybe introduce at least the following sections?:

  • Exclusions: ignored, notScanned, ...
  • Networking options: externalRefCheckTimeout (externalRef prefix might be stripped at this point), ignoreAuthFailures, defaultRetryAfter.

Just to clarify, in the situation you described, the user would use notScanned, right?

And a use-case for ignored would be to ignore build-related files (like dist-newstyle and node_modules), right?

Yes to both.


Yet another problem I hadn't thought of before: the meaning of the word "ignore" in the ignored and in the ignoreRefs options is different.

Hmm, really. Also there is <--! xrefcheck: ignore X --> command that can be inserted in the file, it also has semantics as of ignoreRefs - skipping verification.


What I would like to propose:

Sounds good, yeah 👍

Maybe also distinguish skipLocalRefsTo (verification succeeds) and failLocalRefsTo (verification fails saying that file does not exist) or something like that.

@dcastro
Copy link
Member Author

dcastro commented Sep 11, 2022

virtualFiles is rather: do not verify local references from these files.

I think it's "to", not "from". The docs say:

  # Glob patterns describing the files which do not physically exist in the
  # repository but should be treated as existing nevertheless.
  virtualFiles:
    - ../../../issues
    - ../../../issues/*
    - ../../../pulls
    - ../../../pulls/*

And, in the implementation, it seems to be used to "skip" checking whether a target file exists.


Fail on refs to the given existing files. Not sure when this is useful.

Hmm let's not implement this then. At least not until a real-world use case presents itself.


Maybe introduce at least the following sections?

Agreed, let's have exclusions, networking and scanners. I'll make a follow up issue.


Hmm, really. Also there is <--! xrefcheck: ignore X --> command

Oh ouch, good catch, completely forgot about that one!

What's worse is that --ignored <file> and <--! xrefcheck: ignore file --> really look like they should do the same thing, but they don't!

<--! xrefcheck: ignore file --> doesn't ignore the file itself. It only ignores references from this file, not to this file.

How about this: if we (1) rename the ignore file annotation to ignore all, and (2) change our perspective a little bit and look at it from another angle, then the semantics can be unified once again:

  • Config:
    • ignoreExternalRefsTo: "ignore" means "pretend references to these external resources don't exist"
    • ignoreLocalRefsTo: "ignore" means "pretend references to these files don't exist"
    • ignoreRefsFrom: "ignore" means "pretend references from these files don't exist"
    • ignore: "ignore" means "pretend these files don't exist at all"
  • Annotations
    • ignore link: "ignore" means "pretend this reference doesn't exist"
    • ignore paragraph: "ignore" means "pretend references in this paragraph don't exist"
    • ignore all: "ignore" means "pretend all references in this file don't exist"

@Martoon-00
Copy link
Member

I think it's "to", not "from". The docs say:

Ah right, that's true.

How about this: if we (1) rename the ignore file annotation to ignore all, and (2) change our perspective a little bit and look at it from another angle, then the semantics can be unified once again:

Sounds good to me 👍

I only have one comment, will leave it in #171.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants