Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedding Markup inside markup #60

Closed
finanalyst opened this issue Nov 23, 2024 · 14 comments
Closed

Embedding Markup inside markup #60

finanalyst opened this issue Nov 23, 2024 · 14 comments

Comments

@finanalyst
Copy link
Contributor

@thoughtstream @lizmat This issue arises with :allow and in particular some examples in the RakuDoc specification.

Several markup codes, such as E and X have more complex structures. The examples use B<> that then create invalid code.
Consider

=for code :allow<B>
 1. an entity E<B<this is fallback |> raquo>
 2. an indexed item X<B<this is display text |>one, two>
 3. an alias A<B<this is fallback |>ALIAS_NAME>

The first two are difficult to render, but the third will render easily. The problem is that the once the B<...> is rendered into a string - which may include extra elements for the output format, and then replaced into the embedding markup, the contents of embedding markup is no longer valid, and the inner structure of the contents is lost.

The following, however, seem to be OK because whatever is on the left of the | is a string.

=for code :allow<B>
 1. an entity E<B<this is fallback> | raquo>
 2. an indexed item X<B<this is display text> | one, two>
 3. an alias A<B<this is fallback >|ALIAS_NAME>

@thoughtstream is this analysis correct?

FYI, We are close to removing most of the remaining RakuDoc issues.

@thoughtstream
Copy link

I confess that I'm a little confused by this question.

By default, everything inside a code block should be parsed as literal vanilla plaintext
(i.e. not any kind of RakuDoc markup), except when a particular formatting code
has been specifically enabled via an :allow.

So I would have expected that your first “invalid code” example would generate an AST something like:

    RakuAST::Doc::Block.new(
        type       => "code",
        paragraphs => (
            "1. an entity E<",
            RakuAST::Doc::Markup.new(
                letter => "B",
                opener => "<",
                closer => ">",
                atoms  => ( "this is fallback |" )
            ),
            " raquo>\n2. an indexed item X<",
            RakuAST::Doc::Markup.new(
                letter => "B",
                opener => "<",
                closer => ">",
                atoms  => ( "this is display text |" )
            ),
            "one, two>\n3. an alias A<",
            RakuAST::Doc::Markup.new(
                letter => "B",
                opener => "<",
                closer => ">",
                atoms  => ( "this is fallback |" )
            ),
            "ALIAS_NAME>\n"
        )
    )

...which I would have thought would be relatively easy to render correctly.

Is that not the AST you get?
If not, why not?

(BTW, I certainly agree that, if the example code were actual raw RakuDoc, rather than the contents of a code block,
then placing the closing angle of the B<> after the | is definitely an error. But that doesn't seem to be
what you're asking about here.)

@finanalyst
Copy link
Contributor Author

@thoughtstream @lizmat See the snippet for what we get at the present:

$ raku -e 'say "tmp/test.rakudoc".IO.slurp,"\nGENERATES\n","tmp/test.rakudoc".IO.slurp.AST'
=begin rakudoc
=for code :allow<B>
 1. an entity E<B<this is fallback |>raquo>
 2. an indexed item X<B<this is display text |>one, two>
 3. an alias A<B<this is fallback |>ALIAS_NAME>
=end rakudoc
GENERATES
RakuAST::StatementList.new(
  RakuAST::Doc::Block.new(
    type       => "rakudoc",
    paragraphs => (
      RakuAST::Doc::Block.new(
        type       => "code",
        for        => True,
        config     => ${:allow(RakuAST::QuotedString.new(
          processors => <words val>,
          segments   => (
            RakuAST::StrLiteral.new("B"),
          )
        ))},
        paragraphs => (
          RakuAST::Doc::Paragraph.new(
            " 1. an entity ",
            RakuAST::Doc::Markup.new(
              letter => "E",
              opener => "<",
              closer => ">",
              meta   => (
                :raquo("»"),
              )
            ),
            "\n 2. an indexed item ",
            RakuAST::Doc::Markup.new(
              letter => "X",
              opener => "<",
              closer => ">",
              atoms  => (
                RakuAST::Doc::Markup.new(
                  letter => "B",
                  opener => "<",
                  closer => ">",
                  atoms  => (
                    "this is display text |",
                  )
                ),
                "one, two",
              )
            ),
            "\n 3. an alias ",
            RakuAST::Doc::Markup.new(
              letter => "A",
              opener => "<",
              closer => ">",
              atoms  => (
                RakuAST::Doc::Markup.new(
                  letter => "B",
                  opener => "<",
                  closer => ">",
                  atoms  => (
                    "this is fallback |",
                  )
                ),
                "ALIAS_NAME",
              )
            ),
            "\n"
          ),
        )
      ),
    )
  )
)

@lizmat
Copy link

lizmat commented Nov 24, 2024

Question: does the | (pipe) symbol have meaning inside a B<>? I am living under the assumption that it doesn't.

@thoughtstream
Copy link

Question: does the | (pipe) symbol have meaning inside a B<>? I am living under the assumption that it doesn't.

Your assumption is correct. An interior | has no special meaning to a B<>.

Though, of course, it might still have special meaning to another formatting code that is itself contained within the B<>.
For example, if a code block had an :allow<B E>, then the pipe symbol would be special (but only to the internal E<>)
inside the following B<>:

=for code :allow<B E>
    B<# Take special care if you see the symbol E<for radiation hazards | ATOM SYMBOL> in the input>

@lizmat
Copy link

lizmat commented Nov 25, 2024

Right, but E<B<for radiation hazards |> ATOM SYMBOL> would be wrong, because then the pipe is part of the B<>. It should be E<B<for radiation hazards >| ATOM SYMBOL>, aka the pipe after the closing > of the B<>.

Right?

@thoughtstream
Copy link

Right, but E<B<for radiation hazards |> ATOM SYMBOL> would be wrong, because then the pipe is part of the B<>.
It should be E<B<for radiation hazards >| ATOM SYMBOL>, aka the pipe after the closing > of the B<>.

Right?

If it's actual independent RakuDoc source code, yes, that's correct:
E<B<for radiation hazards |> ATOM SYMBOL> would be wrong.

But the case we're dealing with here is when something like that is inside a code block:

=for code :allow<B>
    if you see E<B<for radiation hazards |> ATOM SYMBOL>

In this case, the entire construct is perfectly valid, because that's not actually an E<> there.
It's just some non-special component of the verbatim code.

As far as RakuDoc is concerned, in terms of things that are special inside the code block,
that's pretty much the equivalent of:

=for code :allow<B>
?????????????????B<???????????????????????>?????????????

(where ? represents something that isn’t special or meaningful to RakuDoc).

@finanalyst
Copy link
Contributor Author

@thoughtstream I get the feeling that =for code :allow< A B U> is introducing another dialect because RakuDoc inside a code block is special.
At this point I am unsure quite how special.
My feeling here is that arbitrary embedding of markup within markup is leading to insoluble conditions.
My first assumption was that we use the normal grammar to parse the contents of the code block, which means that if a letter is allowed, the renderer returns the content of that markup as rendered output, while unallowed letters embed the contents within a string consisting of the letter and the surrounding brackets.
This is leading to conflicts when the markup itself is changing the meaning of the content as far as the outer markup.
So, I think that an allowed markup letter can only be used if the rendered contents are valid inside another letter.
Hence E<B<left double quotes |>raquo> is invalid because the contents of E cannot be parsed. However, E<B<left double quotes>|raquo> is valid because the contents of E can be parsed.
I think that parsing of the entire string is important because we may want to allow some code with :allow<B L X>, eg

=for code :allow<B L X V>
 Some alias with a link and indexed item L<X<A<B<ALIAS_LINK> V«L<some link|#header1>»>|sample link>|#header2>

I think that is correct code. The idea is to create a clickable link to header2 inside a code block, whilst also indexing the Alias. But no Alias is actually created. The renderer would process B L X and V, but not A.

The question then is what happens when any of the letters BLXV are then taken out of the allow list?

@thoughtstream
Copy link

@finanalyst, you raise some good points.

My view of the matter is as follows...

Normally, everything inside a code block is treated as “something foreign to RakuDoc
so we don’t care about the internal structure of it”
. Which means those contents
need not be parsed at all...just matched with a minimal .*?.

In other words, the default rules to parse a code block would be something like:

rule code-block {

    ^^ \h* '=code' >>
    <code-contents>
    <blank-line>
  |
    ^^ \h* '=for'   'code' <metaoption>* \n
    <code-contents>
    <blank-line>
  |
    ^^ \h* '=begin' 'code' <metaoption>* \n
    <code-contents>
    ^^ \h* '=end'   'code' \h*           \n

}

token code-contents {
    .*?
}

token blank-line {
    ^^ \h* $$
}

But if an :allow is one of the metaoptions, then the value of that metaoption
is supposed to configure the way contents are matched, so the parser needs
to be somewhat more sophisticated. Something like:

rule code-block {

    ^^ \h* '=code' >>
    <code-contents>
    <blank-line>
  |
    # Capture :allow option separately...
    ^^ \h* '=for'   'code' [ <allow-option> | <metaoption>]* \n

    # Then pass the allowed values to the contents parser...
    <code-contents($<allow-option><value>)>

    <blank-line>
  |
    # And the same here...
    ^^ \h* '=begin' 'code' [ <allow-option> | <metaoption>]* \n
    <code-contents($<allow-option><value>)>
    ^^ \h* '=end'   'code' \h*           \n

}

token code-contents ($allowedoption) {
    [
        @($allowed.words)                # Match any of the allowed format code letters
        '<'                              # ...then the left delimiter
        <code-contents($allow-option)>   # ...then any nested code contents
        '>'                              # ...then the right delimiter
    |
        .                                # Or else any single non-special character
    ]*?
}

In other words, after we parse an :allow option we pass that option’s values into the
code-contents parser, so that it can parse those particular formatting codes specially.

Of course, in the real parser, the parsing of allowed formatting codes would have to
be more sophisticated, to account for the different structures of various formatting
codes, and to allow for different delimiters than just a single <..> pair.

But the above example (which, BTW, I haven’t verified!) should at least illustrate the
approach I had presupposed for parsing code blocks with :allow exceptions.

Having said that, I begin to wonder whether :allow is just to difficult to (ahem) allow.
In reality, the only formatting codes people are likely to actually want to place inside
a code block are: B<>, I<>, U<>, H<>, J<>, T<>, K<>, O<>, R<>, and V<>.
All of which have no special internal structure.

So now I’m wondering whether, instead of a full :allow mechanism, what if we just
defined a second kind of code block (perhaps a formcode block), which automatically
allows all of those formatting codes.

In which case, our problematical example would become:

=formcode
 1. an entity E<B<this is fallback |> raquo>
 2. an indexed item X<B<this is display text |>one, two>
 3. an alias A<B<this is fallback |>ALIAS_NAME>

Now, since E<>, X<>, and A<> are never special in a formcode block,
there’s no need for special parsing of them. Or, in other words,
the rules for parsing a formcode can be hardwired, without any
run-time reconfiguration of the contents parser:

rule formcode-block {

    ^^ \h* '=formcode' >>
    <formcode-contents>
    <blank-line>
  |
    ^^ \h* '=for'   'formcode'  <metaoption>* \n
    <formcode-contents>
    <blank-line>
  |
    ^^ \h* '=begin' 'formcode'  <metaoption>* \n
    <formcode-contents>
    ^^ \h* '=end'   'formcode'  \h*           \n

}

token formcode-contents ($allowedoption) {
    [
        # Fixed set of permitted formatting codes (possibly nested)...
        <[BHIJKORTUV]>  '<'  <formcode-contents>  '>'
    |
        # Anything else...
        .
    ]*?
}

It wouldn’t be as powerful or as flexible as the :allow option,
but it might be a whole lot simpler to implement (and to use).

The only real downside is that, if you didn’t want one or more of the
permitted formatting codes to actually behave like a formatting code
(i.e. you wanted it to just be literal contents), you’d have to escape
that particular code with a V<>.

For example:

=begin code :allow< B R > :lang<raku>
sub demo {
    B<say> 'Hello R<name>';
    I<note> 'The I format is not recognised';
    U<warn> 'The U format is not recognised either';
}
=end code

...would then have to be written:

=begin formcode :lang<raku>
sub demo {
    B<say> 'Hello R<name>';
    V<I><note> 'The I format is not recognised';
    V<U><warn> 'The U format is not recognised either';
}
=end code

That would probably be more annoying (and error-prone) for those of us who really like to mark-up our code,
but maybe that occasional inconvenience would be an acceptable price to pay in order to get the feature at all.

@finanalyst
Copy link
Contributor Author

finanalyst commented Nov 26, 2024 via email

@thoughtstream
Copy link

Suppose we simply restrict allow to the format codes? They by definition have no internal structure.

I'd be perfectly fine with specifying that :allow is restricted to allowing only one or more of
B<>, I<>, U<>, H<>, J<>, T<>, K<>, O<>, R<>, and V<>.

@patrickbkr
Copy link
Member

There might be evil edge cases, but Rainbow (a mere highlighter, not a full parser) has just gained support for :allow. Since it as a highlighter doesn't care about the semantics of the formatting codes and the syntax is universal it is working with all types of formatting codes in :allow.

But I do wonder, if the parser used to parse =begin code blocks is the same parser that parses regular rakudoc, only with a tiny bit of runtime configuration to reduce the set of formatting codes, why is that more difficult than a fixed set of allowed formatting codes?

@finanalyst
Copy link
Contributor Author

@patrickbkr It is the evil edge cases that cause the problems. :)

  • I need to look at how you handle :allow to see what you are doing
  • It does seem that this specification should have a clever way to implement it
  • The problem for the full parser is that HTML is not the only output format to be considered, and Rainbow is not the only highlighter.
    • Take for example a code block with :lang<Haskell> where the syntax highlighting is done in-browser using highlight-js, and :allow< B L > is also enabled. L has internal structure that is intended to turn into a link.

@patrickbkr
Copy link
Member

patrickbkr commented Nov 27, 2024

  • The problem for the full parser is that HTML is not the only output format to be considered, and Rainbow is not the only highlighter.
    • Take for example a code block with :lang<Haskell> where the syntax highlighting is done in-browser using highlight-js, and :allow< B L > is also enabled. L has internal structure that is intended to turn into a link.

Yeah. That issue I don't have. In Rainbow I only have to spit out a flat list of tokens, not at tree. As such I don't care about composability. Rainbow does it like this:

  1. Parse the block contents as if it'd be usual RakuDoc (with :allowed markup).
  2. Split the resulting token stream into text and markup, taking note where in the text the markup was found.
  3. Tokenize the text again using a suitable tokenizer (currently only Raku code is recognized, but other tokenizers could be plugged in).
  4. Reinsert the RakuDoc markup removed in step two. (It's sometimes necessary to split a token into two if the to be inserted RakuDoc token should be put in the middle of some token.)

This does get tricky when the output is a tree and not a flat token list, because you then need to untangle partially overlapping nodes. I.e. <i>Italic<b>bold-italic</i>bold</b>. I believe this is doable, but annoying. It boils down to merging two independent syntax trees into one. I'd guess this is a well-known problem in computer science. Ping @antononcube, this seems to be an issue right up your alley. Does this ring a bell?

finanalyst added a commit that referenced this issue Nov 29, 2024
finanalyst added a commit that referenced this issue Nov 30, 2024
* relates to issue #60

* re-order sequence
@finanalyst
Copy link
Contributor Author

I'm closing this issue here but mentioning it in the suggestions for V3. The specification has some commented out examples related in some way to this issue, but not quite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants