Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apply unicode normalization in help mode #41086

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

epithet
Copy link
Contributor

@epithet epithet commented Jun 4, 2021

This adapts the help mode to recent changes from #40948 and #25157.

Before:

help?> −
"−" can be typed by \minus<tab>

search:

  No documentation found.

  Binding − does not exist.

Now:

help?> −
"−" can be typed by \minus<tab>

search: - ->

  -(x)

  Unary minus operator.
  ...

Same for help?> \cdotp<tab> when \cdot has been bound.

Note that help?> 1e−2 already worked.

@mbauman mbauman added unicode Related to unicode characters and encodings docsystem The documentation building system labels Jun 4, 2021
macro repl(ex, brief::Bool=false) repl(ex; brief=brief) end
macro repl(io, ex, brief) repl(io, ex; brief=brief) end

function repl(io::IO, s::Symbol; brief::Bool=true)
str_orig = string(s)
s = normalize_symbol(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is constructing symbols explicitly, without having parsed them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's quite early in _helpmode(io::IO, line::AbstractString). The whole input line is made into a symbol (apart from stripped white space at the ends):

assym = Symbol(line)

In the case of \minus, \cdot, and \cdotp, the if branch directly afterwards is executed because Base.isoperator(assym) == true, and so @repl is called with assym, which in turn calls repl(io::IO, s::Symbol; brief::Bool=true).

Copy link
Contributor Author

@epithet epithet Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The information which exact characters were entered is needed at that point for repl_latex($io, $str_orig), which shows

"−" can be typed by \minus<tab>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just dug a bit deeper and found #19464. µ (U+00B5 micro) and μ (U+03BC greek small letter mu) both show

"μ" can be typed by \mu<tab>

which is technically only true for the latter. This is because the code goes through the else branch, so the symbol is not assym but x which comes from Meta.parse. But the binding is shown correctly, which is much more important.

In it's current state, this PR is really only a quick fix for specific cases which behave clearly wrong in my opinion. For anything further, the desired behavior should be decided on first.

Copy link
Contributor Author

@epithet epithet Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I found out why the symbols are constructed explicitly without parsing. That was introduced in 08663d4 because of keywords and is also needed for += and .=:

julia> x = Meta.parse("try", raise = false, depwarn = false)
:($(Expr(:incomplete, "incomplete: premature end of input")))

julia> x = Meta.parse("+=", raise = false, depwarn = false)
:($(Expr(:error, "invalid identifier name \"+=\"")))

julia> x = Meta.parse(".=", raise = false, depwarn = false)
:($(Expr(:error, "invalid identifier name \".=\"")))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, it's better to call repl_latex earlier, when we still have the line? I don't find it easy to make changes to help mode since test coverage seems to be low.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this could be better for += and .=, what do you say?

x = Meta.parse(line, raise = false, depwarn = false)
if Meta.isexpr(x, :error)
    asinfix = Meta.parse("x "*line*" x", raise = false, depwarn = false)
    if asinfix isa Expr && length(asinfix.args) == 2 && asinfix.args[1] == asinfix.args[2] == :x    
        x = asinfix.head
    end   
end   

I'm not entirely sure if that breaks other things.

@epithet
Copy link
Contributor Author

epithet commented Jun 8, 2021

I cleaned up a bit according to my findings. It seems to work well. We get LaTeX tab-completion help now for more cases:

help?> "λ"
""λ"" can be typed by "\lambda<tab>"

Base.codeunits
...

help?> 1e−2
"1e−2" can be typed by 1e\minus<tab>2

  Float64(x [, mode::RoundingMode])
...

µ (U+00B5 micro) and μ (U+03BC greek small letter mu) are technically more correct, now. I'm not sure what should be displayed in this case. We could also use Base.Unicode.normalize(line, :NFKC) (I think that's the correct one):

help?> µμ
"µμ" can be typed by µ\mu<tab>
...

−= is also fixed, but unfortunately, broadcasting operators are broken now. I'll fix this tomorrow.

@epithet epithet force-pushed the helpmode-unicode-normalization branch from 3d1e701 to 1eea0b4 Compare June 9, 2021 07:30
@epithet
Copy link
Contributor Author

epithet commented Jun 9, 2021

Now, the following cases are handled: , −=, .−, .−=, 1e−2.
And I also fixed ⊻=, which was already broken in 1.6.1.

Open question: should the LaTeX completion help show how the original input can be typed, or how a normalized version can be typed? If normalized, only something like NFKC (e.g. µ (micro) -> \mu) or including custom substitutions like \cdotp -> \cdot and \minus -> -?

@epithet epithet force-pushed the helpmode-unicode-normalization branch 2 times, most recently from 44a0619 to a0a9fc9 Compare June 9, 2021 10:58
@epithet epithet force-pushed the helpmode-unicode-normalization branch from a0a9fc9 to 034fffd Compare June 9, 2021 20:01
stdlib/REPL/src/docview.jl Outdated Show resolved Hide resolved
stdlib/REPL/src/docview.jl Outdated Show resolved Hide resolved
base/docs/Docs.jl Outdated Show resolved Hide resolved
stdlib/REPL/src/docview.jl Outdated Show resolved Hide resolved
stdlib/REPL/src/docview.jl Outdated Show resolved Hide resolved
Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions to improve clarity in some places, but otherwise should be great to merge now. Thanks!

stdlib/REPL/src/docview.jl Outdated Show resolved Hide resolved
@vtjnash vtjnash added the merge me PR is reviewed. Merge when all tests are passing label Jul 19, 2021
@vtjnash
Copy link
Member

vtjnash commented Jul 19, 2021

Decided simply to apply my comments, since it was all minor stuff, so that this can be merged as soon as CI finishes (if I didn't type something wrong)

@DilumAluthge
Copy link
Member

DilumAluthge commented Aug 3, 2021

I'm removing the merge me label, because the CI failures are not unrelated or intermittent failures but are rather related to the contents of the PR.

@DilumAluthge DilumAluthge removed the merge me PR is reviewed. Merge when all tests are passing label Aug 3, 2021
@stevengj
Copy link
Member

stevengj commented Oct 8, 2021

Ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docsystem The documentation building system unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants