Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roffit: fix special characters and broken links #37

Closed
wants to merge 0 commits into from

Conversation

jhauga
Copy link

@jhauga jhauga commented Jul 8, 2023

Pull Summary

Tested this with files matching issues this pull request resolves and with repos that have an extensive amount of files roffit can convert to html. Made two supporting repos with test results. They are:

  1. File Test - roffit-test
  2. Repo Test - roffit-full-test

The two tests used more or less the same procedure, and focused on:

  1. Files with differences
  2. Build time

The two support repos have full details of the procedure, but in short the steps were:

  1. Extract data and ready test elements
  2. Clone source (bagder) and pull (jhauga) roffit repo.
  3. Run roffit and time each runs' build time.
  4. Determine files with differences.
  5. Calculate stats for files with difference and the build times.
  6. Output results to html file.

jhauga

This comment was marked as outdated.

Copy link
Owner

@bagder bagder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any actual roffit changes here? Does it only add tests? How can it then solve any problems?

Also, this has to be possible to test with much less data added. 23K of data is too excessive.

@jhauga

This comment was marked as outdated.

@bagder
Copy link
Owner

bagder commented Jul 11, 2023

I approve of extending the tests to test for more problems, but I think this PR goes about it the wrong way.

A first take would be to extend the existing test case with more text for the problems you have identified. Not adding thousands of lines of new files.

roffit is not curl specific, it converts man pages to html so it has no knowledge of or special handling for curl related issues.

@jhauga jhauga changed the title Special Characters and Broken Links roffit: fix special characters and broken links Jul 13, 2023
Copy link
Author

@jhauga jhauga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with comments on Makefile, fixBrokenLink, fixSpecialCharacters, and roffit files.

@jhauga jhauga requested a review from bagder July 14, 2023 20:17
@bagder
Copy link
Owner

bagder commented Jul 14, 2023

Sorry, but this PR does not fix roffit as described. It just fixes two minor static HTML texts.

@jhauga
Copy link
Author

jhauga commented Jul 14, 2023

Sorry, but this PR does not fix roffit as described. It just fixes two minor static HTML texts.

That's fine. Thank you for considering it. Adding two bash scripts was the only way I was able to resolve issue 36. Learning more about perl is on my todo list, so if the issue is still up in several months I may try again.

Regardless, this has been a very enjoyable and rewarding process!

@bagder
Copy link
Owner

bagder commented Jul 14, 2023

Separate scripts cannot solve this issue. roffit is the program that runs that converts nroff files, so you need to fix that file. Having separate shell scripts is not a solution to the problem!

@jhauga
Copy link
Author

jhauga commented Jul 14, 2023

Separate scripts cannot solve this issue. roffit is the program that runs that converts nroff files, so you need to fix that file. Having separate shell scripts is not a solution to the problem!

Ok. Closing this for now and will reopen when able to implement in roffit.

@jhauga jhauga closed this Jul 14, 2023
@jhauga jhauga reopened this Jul 16, 2023
Copy link
Owner

@bagder bagder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is totally unacceptable code. Make it clean pure perl and do not generate temp files.

Copy link
Author

@jhauga jhauga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lines 259 to 285 - perl changes

roffit Outdated
@@ -256,6 +256,33 @@ sub linkfile {
# convert https://, http:// and ftp:// URLs to <a href> links
$field =~ s/(^|\W)((https|http|ftp):\/\/[a-z0-9\-._~%:\/?\#\[\]\@!\$&'()*+,;=]+)/$1<a href=\"$2\">$2<\/a>/gi;

# remove punctuation characters at end of url
$field =~ s/(href=\")([^\"]*)(&quot;)/$1$2/g;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not remove punctuation. What is this for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&quot; seemed to be a troublemaker so for following lines to work it had to be removed first in process.

roffit Outdated
@@ -256,6 +256,33 @@ sub linkfile {
# convert https://, http:// and ftp:// URLs to <a href> links
$field =~ s/(^|\W)((https|http|ftp):\/\/[a-z0-9\-._~%:\/?\#\[\]\@!\$&'()*+,;=]+)/$1<a href=\"$2\">$2<\/a>/gi;

# remove punctuation characters at end of url
$field =~ s/(href=\")([^\"]*)(&quot;)/$1$2/g;
$field =~ s/(href=\"(http:\/\/|https:\/\/|ftp:\/\/)\")*([,.;()*])*(\">)/$1$4/g;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this rather correct the pattern on line 257 to not include punctuation as terminating symbols?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried with variations of pattern. Either my corrections were flawed or the URLs need to be built first.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this rather correct the pattern on line 257 to not include punctuation as terminating symbols?

Kept getting unexpected results like <a href="https://curl">https://curl</a>.se/, or inline punctuations being removed.

roffit Outdated Show resolved Hide resolved
roffit Outdated
# add anchor and anchor links to special character options
my $specialcharacters = "`~!@\$%^*()-_=+{};:\'\\|,.?";
$field =~ s/(<a name=)\"-\"(><\/a>)(<span class=\"nroffip\">-)([$specialcharacters]|[\[\]])(,|<\/span>)/$1\"-$4\"$2$3$4/g;
$field =~ s/(<a class=\"emphasis\") href=\"#-\">-([$specialcharacters]|[\[\]])(,|<\/a>)/$1 href=\"#-$2\">-$2/g;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these changes should rather be done as updates to text2name() and do_encode() instead of adding this extra stage of edits.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unable to achieve in subroutines, either anchor links were removed, all anchors (name attribute) removed, or no effect.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these changes should rather be done as updates to text2name() and do_encode() instead of adding this extra stage of edits.

With text2name and do_encode new things break. Some examples:

  • name="socks5h://" (should be name="socks5h")
  • name="AUNDERSCORE" (shoud be name="A_UNDERSCORE")
  • name=""--any-option" (2 "" at start).

roffit Outdated
$field =~ s/(<a name=)\"-\"(><\/a>)(<span class=\"nroffip\">-)($htmlentity)(,|<\/span>)/$1\"-$4\"$2$3$4/g;
$field =~ s/&#35;/hash/ | $field =~ s/&amp;/ampersand/ | $field =~ s/&#39;/single-quote/ |
$field =~ s/&lt;/less-than/ | $field =~ s/&gt;/greater-than/;
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these ones could also be included into that change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created subroutine field_anchor to handle repeating blocks. Required $field and $htmlentity to be defined globally.

roffit Outdated Show resolved Hide resolved
testpage.1 Outdated

.IP "-?, --special-char"
Options with special characters will be included in anchors.
Such as this option with special character \fI\-?, \-\-special\-char\fP.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any code handling this new option. Plus, I don't think this needs to be managed by an option - I think it makes sense to improve the general handling of anchors and links so it should probably be used by default.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the test build. Moved into description. Anyway options are named wrong (--mumbo and --jumbo), which gave me impression that it was only for build test purposes. Does not seem like an issue though (maybe good for someone to find who has never contributed on github also) .

testpage.1 Outdated Show resolved Hide resolved
@jhauga
Copy link
Author

jhauga commented Jul 21, 2023

Regarding some suggested edits:

Unable to achieve fix with text2name and do_encode.
In short - new things break.

Quick takeaway ( where - (1) should be => (2) becomes ):
name="--any-option" => name=""--any-option" ("" at start)
name="ANY_NAMES" => name="ANYNAMES" (_ is removed)

For details see:

collapsed section

Detailed report (explains above):


Since curl.1 is a good use case for thoroughly testing a roffit
build I made two files, running "diff" command on them:

  1. with current badger roffit
  2. with (multiple) edited roffit

Below are sample outcomes from two instances of build, and then results
from using current edits.

do_encode --- example case removes ":

sub do_encode($) {
    return encode_entities(shift, q{<>&'#});
}

results in:
name=""--any-option" (2 "" at start)

text2name --- example case adds special characters to $text:

sub text2name { .....
   $text =~ s/[^a-zA-Z0-9-`~!@\$%^*()-_=+{};:\'\\|,.?]//g;
   ...}

results in:
name="socks5h://" (should be name="socks5h").
or
name="AUNDERSCORE" (shoud be name="A_UNDERSCORE").

Several variations of the above edits were done with similar (if not the same) outcome.

So in conclusion; either my approaches were flawed, or I need to study the roffit
file some more.

When "diff" ran using -- 2. with current pull roffit --

  • roffit built as expected, nothing new broken
  • Outputted difference resolved roffit #36 (didn't know that'd close it) and #42.

Edits

  • Reordered htmlentity variable to be same order as call in do_encode (excluding ").

  • Edited regular expression for special character options to fix a missed bug
    that removed inline comma in option text. (i.e. "-? --opt" => "-?, --opt").

  • Removed condition in field_anchor subroutine to reduce line numbers,
    moving line needed to line number 289.

@jhauga jhauga requested a review from bagder July 21, 2023 21:38
@jhauga jhauga marked this pull request as draft August 5, 2023 21:08
@jhauga jhauga marked this pull request as ready for review August 26, 2023 23:27
@jhauga jhauga marked this pull request as draft September 3, 2023 17:36
@jhauga jhauga marked this pull request as ready for review September 4, 2023 01:03
@jhauga jhauga closed this Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants