Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anchor links are encoded in html-proofer output #757

Closed
mattcone opened this issue Sep 8, 2022 · 11 comments
Closed

Anchor links are encoded in html-proofer output #757

mattcone opened this issue Sep 8, 2022 · 11 comments

Comments

@mattcone
Copy link

mattcone commented Sep 8, 2022

Hey there! In our translated files, Japanese anchor links are being encoded in the html-proofer output even though the anchor links are not encoded in our HTML. So, for example, we have the following in an HTML file:

<a href="#authorization-ヘッダーを作成する">Authorization ヘッダーを作成する</a>

But see the following in the html-proofer output:

internally linking to #authorization-%E3%83%98%E3%83%83%E3%83%80%E3%83%BC%E3%82%92%E4%BD%9C%E6%88%90%E3%81%99%E3%82%8B; the file exists, but the hash ‘authorization-%E3%83%98%E3%83%83%E3%83%80%E3%83%BC%E3%82%92%E4%BD%9C%E6%88%90%E3%81%99%E3%82%8B’ does not

I'm not sure if this is a bug with html-proofer or a configuration problem on our end, but I don't recall seeing this before. Thanks for any help you can provide.

@gjtorikian
Copy link
Owner

Oh I’m sorry I missed this. Was the problem fixed?

@ahpook
Copy link
Contributor

ahpook commented Apr 12, 2024

I am seeing this same problem w/ html-proofer 5.0.8

* At _site/de/best-practices/index.html:591:

  internally linking to building-community/#teilen-sie-die-eigent%C3%BCmerschaft-an-ihrem-projekt, which does not exist

I note the encoding is correct in the debug output as it's checking:

(354 / 1737) Internal link ../building-community/#teilen-sie-die-eigentümerschaft-an-ihrem-projekt: Checking 2 references

ahpook added a commit to ahpook/html-proofer that referenced this issue Apr 30, 2024
There wasn't a specific test for this, and it fails on some setups.

The link in the fixture is a direct copy-paste from one that fails on my project.

But the test passes on this test suite... So there is something else going on.

Relates to gjtorikian#757
@ahpook
Copy link
Contributor

ahpook commented Apr 30, 2024

I have done some more investigation on this issue. I think it's caused by an upstream dependency rather than something in html-proofer itself.

You can see the error in action on this test run : https://github.com/github/opensource.guide/actions/runs/8887912548/job/24403938507?pr=3214

I added one of the failing links from that project as a test fixture, and added a test to exercise it, which you can see on my fork:
ahpook@7da8d04

That test passes on my branch. And I'm using the same ruby version on both html-proofer and opensource.guide. Looking through the Gemfile, one possible difference is that html-proofer is using unicode-display_width (2.5.0) and opensource.guide is at unicode-display_width (1.8.0), I think due to a jekyll dependency, but manually pinning that new version did not fix the problem.

@gjtorikian
Copy link
Owner

Yo, thanks for the nice write-up and repros. Always a huge help to see.

I'll clone the guides repo and see what I can find, if not an error in this project than at least to help someone else pinpoint this down the line.

@gjtorikian
Copy link
Owner

I can't seem to repro this. I created some text to get this outputted into the site build:

<p><a href="#eigentümerschaft">Hash mit ein umlaut</a></p>

<h2 id="eigentümerschaft">Hier gib's</h2>

Running script/test let this pass.

I do note that both the Ruby (2.7.5) and Node (12.14.0) are pretty wildly out of date, so it's possible that some Ruby gem dependency which needs to be updated, hasn't been, because of the old Ruby version.

@ahpook
Copy link
Contributor

ahpook commented May 2, 2024

OK I was able to make got a reproducible test case on my fork of html-proofer : ahpook@bd8e0d4

As I put in that commit, only links to different page using "../page/#häsh" syntax failed. rake test now gives me:

Failures:

  1) HTMLProofer::Check::Links passes for internal links with non-ASCII characters from implicit indexes
     Failure/Error: expect(proofer.failed_checks).to(eq([]))

       expected: []
            got: [#<HTMLProofer::Failure:0x00000001067d3988 @path="spec/html-proofer/fixtures/links/hash_nonascii_dir/...inking to target/#ei
gent%C3%BCmerschaft, which does not exist", @line=3, @status=nil, @content=nil>]

       (compared using ==)

       Diff:
       @@ -1 +1 @@
       -[]
       +[#<HTMLProofer::Failure:0x00000001067d3988 @path="spec/html-proofer/fixtures/links/hash_nonascii_dir/source/index.html", @check_name=
"Links > Internal", @description="internally linking to target/#eigent%C3%BCmerschaft, which does not exist", @line=3, @status=nil, @content=
nil>]

     # ./spec/html-proofer/links_spec.rb:52:in `block (2 levels) in <top (required)>'

The second anchor link in the file is found ok.

@ahpook
Copy link
Contributor

ahpook commented May 2, 2024

I added a little debug line to lib/html_proofer/url_validator/internal.rb to see how that url is getting constructed and the difference between the two anchors - note the umlaut one comes back without a ../ prefix...

         matched_files.each do |metadata|
            url = HTMLProofer::Attribute::Url.new(@runner, link, base_url: metadata[:base_url], source: metadata[:source], filename: metadata[:filename])
            @logger.log(:debug, "Internal link #{link}: base_url: #{metadata[:base_url]}, source: #{metadata[:source]}, filename: #{metadata[:filename]} -- resulting url: #{url}")
❯ /opt/homebrew/Cellar/ruby@3.2/3.2.2/bin/ruby exe/htmlproofer --log_level=debug spec/html-proofer/fixtures/links/hash_nonascii_dir
Checking 2 internal links
(1 / 2) Internal link ../target/#eigentümerschaft: Checking 1 reference
Internal link ../target/#eigentümerschaft: base_url: , source: spec/html-proofer/fixtures/links/hash_nonascii_dir, filename: spec/html-proofer/fixtures/links/hash_nonascii_dir/source/index.html -- resulting url: target/#eigent%C3%BCmerschaft
(2 / 2) Internal link ../target/#mitarbeit: Checking 1 reference
Internal link ../target/#mitarbeit: base_url: , source: spec/html-proofer/fixtures/links/hash_nonascii_dir, filename: spec/html-proofer/fixtures/links/hash_nonascii_dir/source/index.html -- resulting url: ../target/#mitarbeit

...

@ahpook
Copy link
Contributor

ahpook commented May 2, 2024

So, in the clean_url! method, with only ascii characters, the regexp matches and the method returns early.

If there are non-ascii characters, the URL is passed through to Addressable::URI's normalize method, which is (wrongly, i think) stripping out the leading ../ and causing the links not to be found because they are now relative to the current directory.

https://github.com/gjtorikian/html-proofer/blob/main/lib/html_proofer/attribute/url.rb#L223-L228

Pry session:

[5] pry(#<HTMLProofer::Attribute::Url>)> @url
=> "../target/#eigentümerschaft"
[3] pry(#<HTMLProofer::Attribute::Url>)> @url =~ /^([!#{Regexp.last_match(0)}-;=?-\[\]_a-z~]|%[0-9a-fA-F]{2})+$/
=> nil
...
[1] pry(#<HTMLProofer::Attribute::Url>)> @url
=> "../target/#mitarbeit"
[2] pry(#<HTMLProofer::Attribute::Url>)> @url =~ /^([!#{Regexp.last_match(0)}-;=?-\[\]_a-z~]|%[0-9a-fA-F]{2})+$/
=> 0
[3] pry(#<HTMLProofer::Attribute::Url>)> Regexp.last_match
=> #<MatchData "../target/#mitarbeit" 1:"t">

@ahpook
Copy link
Contributor

ahpook commented May 13, 2024

@gjtorikian Hi, sending a friendly ping on this - do you have a suggestion how to proceed? I think either (a) the regexp check in clean_url! ought to accept UTF-8 characters rather than only urlencoded + ascii, or (b) some deeper debugging is needed to figure out why the normalize method is stripping relative pathing if there are non-ascii characters.

I'm blocked on working a patch up because not sure what kinds of "bad" urls you've seen and are trying to catch with the regexp. Also, maybe I am misunderstanding this, but where is the previous regexp that the string interpolation #{Regexp.last_match(0)} is catching?

@gjtorikian
Copy link
Owner

Sorry but can you create a PR with a reproducible test case? I remember digging through your comments and commits but I got distracted, and having a single reproducible issue would help in my fix resolution. Thank you!

@ahpook
Copy link
Contributor

ahpook commented May 23, 2024

I'll make a PR with just my test changes to avoid regressions, but the refactor you did in #820 fixed this issue as well!

ahpook added a commit to ahpook/html-proofer that referenced this issue May 23, 2024
There wasn't a specific test for this, and it fails on some setups.

The link in the fixture is a direct copy-paste from one that fails on my project.

But the test passes on this test suite... So there is something else going on.

Relates to gjtorikian#757
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants