Embedding Markup inside markup #60

finanalyst · 2024-11-23T15:45:47Z

@thoughtstream @lizmat This issue arises with :allow and in particular some examples in the RakuDoc specification.

Several markup codes, such as E and X have more complex structures. The examples use B<> that then create invalid code.
Consider

=for code :allow<B>
 1. an entity E<B<this is fallback |> raquo>
 2. an indexed item X<B<this is display text |>one, two>
 3. an alias A<B<this is fallback |>ALIAS_NAME>

The first two are difficult to render, but the third will render easily. The problem is that the once the B<...> is rendered into a string - which may include extra elements for the output format, and then replaced into the embedding markup, the contents of embedding markup is no longer valid, and the inner structure of the contents is lost.

The following, however, seem to be OK because whatever is on the left of the | is a string.

=for code :allow<B>
 1. an entity E<B<this is fallback> | raquo>
 2. an indexed item X<B<this is display text> | one, two>
 3. an alias A<B<this is fallback >|ALIAS_NAME>

@thoughtstream is this analysis correct?

FYI, We are close to removing most of the remaining RakuDoc issues.

The text was updated successfully, but these errors were encountered:

thoughtstream · 2024-11-24T03:54:34Z

I confess that I'm a little confused by this question.

By default, everything inside a code block should be parsed as literal vanilla plaintext
(i.e. not any kind of RakuDoc markup), except when a particular formatting code
has been specifically enabled via an :allow.

So I would have expected that your first “invalid code” example would generate an AST something like:

    RakuAST::Doc::Block.new(
        type       => "code",
        paragraphs => (
            "1. an entity E<",
            RakuAST::Doc::Markup.new(
                letter => "B",
                opener => "<",
                closer => ">",
                atoms  => ( "this is fallback |" )
            ),
            " raquo>\n2. an indexed item X<",
            RakuAST::Doc::Markup.new(
                letter => "B",
                opener => "<",
                closer => ">",
                atoms  => ( "this is display text |" )
            ),
            "one, two>\n3. an alias A<",
            RakuAST::Doc::Markup.new(
                letter => "B",
                opener => "<",
                closer => ">",
                atoms  => ( "this is fallback |" )
            ),
            "ALIAS_NAME>\n"
        )
    )

...which I would have thought would be relatively easy to render correctly.

Is that not the AST you get?
If not, why not?

(BTW, I certainly agree that, if the example code were actual raw RakuDoc, rather than the contents of a code block,
then placing the closing angle of the B<> after the | is definitely an error. But that doesn't seem to be
what you're asking about here.)

finanalyst · 2024-11-24T10:12:00Z

@thoughtstream @lizmat See the snippet for what we get at the present:

$ raku -e 'say "tmp/test.rakudoc".IO.slurp,"\nGENERATES\n","tmp/test.rakudoc".IO.slurp.AST'
=begin rakudoc
=for code :allow<B>
 1. an entity E<B<this is fallback |>raquo>
 2. an indexed item X<B<this is display text |>one, two>
 3. an alias A<B<this is fallback |>ALIAS_NAME>
=end rakudoc
GENERATES
RakuAST::StatementList.new(
  RakuAST::Doc::Block.new(
    type       => "rakudoc",
    paragraphs => (
      RakuAST::Doc::Block.new(
        type       => "code",
        for        => True,
        config     => ${:allow(RakuAST::QuotedString.new(
          processors => <words val>,
          segments   => (
            RakuAST::StrLiteral.new("B"),
          )
        ))},
        paragraphs => (
          RakuAST::Doc::Paragraph.new(
            " 1. an entity ",
            RakuAST::Doc::Markup.new(
              letter => "E",
              opener => "<",
              closer => ">",
              meta   => (
                :raquo("»"),
              )
            ),
            "\n 2. an indexed item ",
            RakuAST::Doc::Markup.new(
              letter => "X",
              opener => "<",
              closer => ">",
              atoms  => (
                RakuAST::Doc::Markup.new(
                  letter => "B",
                  opener => "<",
                  closer => ">",
                  atoms  => (
                    "this is display text |",
                  )
                ),
                "one, two",
              )
            ),
            "\n 3. an alias ",
            RakuAST::Doc::Markup.new(
              letter => "A",
              opener => "<",
              closer => ">",
              atoms  => (
                RakuAST::Doc::Markup.new(
                  letter => "B",
                  opener => "<",
                  closer => ">",
                  atoms  => (
                    "this is fallback |",
                  )
                ),
                "ALIAS_NAME",
              )
            ),
            "\n"
          ),
        )
      ),
    )
  )
)

lizmat · 2024-11-24T20:07:23Z

Question: does the | (pipe) symbol have meaning inside a B<>? I am living under the assumption that it doesn't.

thoughtstream · 2024-11-24T22:05:25Z

Question: does the | (pipe) symbol have meaning inside a B<>? I am living under the assumption that it doesn't.

Your assumption is correct. An interior | has no special meaning to a B<>.

Though, of course, it might still have special meaning to another formatting code that is itself contained within the B<>.
For example, if a code block had an :allow, then the pipe symbol would be special (but only to the internal E<>)
inside the following B<>:

=for code :allow<B E>
    B<# Take special care if you see the symbol E<for radiation hazards | ATOM SYMBOL> in the input>

lizmat · 2024-11-25T00:13:29Z

Right, but E<B<for radiation hazards |> ATOM SYMBOL> would be wrong, because then the pipe is part of the B<>. It should be E<B<for radiation hazards >| ATOM SYMBOL>, aka the pipe after the closing > of the B<>.

Right?

thoughtstream · 2024-11-25T05:28:34Z

Right, but E<B<for radiation hazards |> ATOM SYMBOL> would be wrong, because then the pipe is part of the B<>.
It should be E<B<for radiation hazards >| ATOM SYMBOL>, aka the pipe after the closing > of the B<>.

Right?

If it's actual independent RakuDoc source code, yes, that's correct:
E<B<for radiation hazards |> ATOM SYMBOL> would be wrong.

But the case we're dealing with here is when something like that is inside a code block:

=for code :allow<B>
    if you see E<B<for radiation hazards |> ATOM SYMBOL>

In this case, the entire construct is perfectly valid, because that's not actually an E<> there.
It's just some non-special component of the verbatim code.

As far as RakuDoc is concerned, in terms of things that are special inside the code block,
that's pretty much the equivalent of:

=for code :allow<B>
?????????????????B<???????????????????????>?????????????

(where ? represents something that isn’t special or meaningful to RakuDoc).

finanalyst · 2024-11-25T09:53:39Z

@thoughtstream I get the feeling that =for code :allow< A B U> is introducing another dialect because RakuDoc inside a code block is special.
At this point I am unsure quite how special.
My feeling here is that arbitrary embedding of markup within markup is leading to insoluble conditions.
My first assumption was that we use the normal grammar to parse the contents of the code block, which means that if a letter is allowed, the renderer returns the content of that markup as rendered output, while unallowed letters embed the contents within a string consisting of the letter and the surrounding brackets.
This is leading to conflicts when the markup itself is changing the meaning of the content as far as the outer markup.
So, I think that an allowed markup letter can only be used if the rendered contents are valid inside another letter.
Hence E<B<left double quotes |>raquo> is invalid because the contents of E cannot be parsed. However, E<B<left double quotes>|raquo> is valid because the contents of E can be parsed.
I think that parsing of the entire string is important because we may want to allow some code with :allow, eg

=for code :allow<B L X V>
 Some alias with a link and indexed item L<X<A<B<ALIAS_LINK> V«L<some link|#header1>»>|sample link>|#header2>

I think that is correct code. The idea is to create a clickable link to header2 inside a code block, whilst also indexing the Alias. But no Alias is actually created. The renderer would process B L X and V, but not A.

The question then is what happens when any of the letters BLXV are then taken out of the allow list?

thoughtstream · 2024-11-26T04:39:41Z

@finanalyst, you raise some good points.

My view of the matter is as follows...

Normally, everything inside a code block is treated as “something foreign to RakuDoc
so we don’t care about the internal structure of it”. Which means those contents
need not be parsed at all...just matched with a minimal .*?.

In other words, the default rules to parse a code block would be something like:

rule code-block {

    ^^ \h* '=code' >>
    <code-contents>
    <blank-line>
  |
    ^^ \h* '=for'   'code' <metaoption>* \n
    <code-contents>
    <blank-line>
  |
    ^^ \h* '=begin' 'code' <metaoption>* \n
    <code-contents>
    ^^ \h* '=end'   'code' \h*           \n

}

token code-contents {
    .*?
}

token blank-line {
    ^^ \h* $$
}

But if an :allow is one of the metaoptions, then the value of that metaoption
is supposed to configure the way contents are matched, so the parser needs
to be somewhat more sophisticated. Something like:

rule code-block {

    ^^ \h* '=code' >>
    <code-contents>
    <blank-line>
  |
    # Capture :allow option separately...
    ^^ \h* '=for'   'code' [ <allow-option> | <metaoption>]* \n

    # Then pass the allowed values to the contents parser...
    <code-contents($<allow-option><value>)>

    <blank-line>
  |
    # And the same here...
    ^^ \h* '=begin' 'code' [ <allow-option> | <metaoption>]* \n
    <code-contents($<allow-option><value>)>
    ^^ \h* '=end'   'code' \h*           \n

}

token code-contents ($allowedoption) {
    [
        @($allowed.words)                # Match any of the allowed format code letters
        '<'                              # ...then the left delimiter
        <code-contents($allow-option)>   # ...then any nested code contents
        '>'                              # ...then the right delimiter
    |
        .                                # Or else any single non-special character
    ]*?
}

In other words, after we parse an :allow option we pass that option’s values into the
code-contents parser, so that it can parse those particular formatting codes specially.

Of course, in the real parser, the parsing of allowed formatting codes would have to
be more sophisticated, to account for the different structures of various formatting
codes, and to allow for different delimiters than just a single <..> pair.

But the above example (which, BTW, I haven’t verified!) should at least illustrate the
approach I had presupposed for parsing code blocks with :allow exceptions.

Having said that, I begin to wonder whether :allow is just to difficult to (ahem) allow.
In reality, the only formatting codes people are likely to actually want to place inside
a code block are: B<>, I<>, U<>, H<>, J<>, T<>, K<>, O<>, R<>, and V<>.
All of which have no special internal structure.

So now I’m wondering whether, instead of a full :allow mechanism, what if we just
defined a second kind of code block (perhaps a formcode block), which automatically
allows all of those formatting codes.

In which case, our problematical example would become:

=formcode
 1. an entity E<B<this is fallback |> raquo>
 2. an indexed item X<B<this is display text |>one, two>
 3. an alias A<B<this is fallback |>ALIAS_NAME>

Now, since E<>, X<>, and A<> are never special in a formcode block,
there’s no need for special parsing of them. Or, in other words,
the rules for parsing a formcode can be hardwired, without any
run-time reconfiguration of the contents parser:

rule formcode-block {

    ^^ \h* '=formcode' >>
    <formcode-contents>
    <blank-line>
  |
    ^^ \h* '=for'   'formcode'  <metaoption>* \n
    <formcode-contents>
    <blank-line>
  |
    ^^ \h* '=begin' 'formcode'  <metaoption>* \n
    <formcode-contents>
    ^^ \h* '=end'   'formcode'  \h*           \n

}

token formcode-contents ($allowedoption) {
    [
        # Fixed set of permitted formatting codes (possibly nested)...
        <[BHIJKORTUV]>  '<'  <formcode-contents>  '>'
    |
        # Anything else...
        .
    ]*?
}

It wouldn’t be as powerful or as flexible as the :allow option,
but it might be a whole lot simpler to implement (and to use).

The only real downside is that, if you didn’t want one or more of the
permitted formatting codes to actually behave like a formatting code
(i.e. you wanted it to just be literal contents), you’d have to escape
that particular code with a V<>.

For example:

=begin code :allow< B R > :lang<raku>
sub demo {
    B<say> 'Hello R<name>';
    I<note> 'The I format is not recognised';
    U<warn> 'The U format is not recognised either';
}
=end code

...would then have to be written:

=begin formcode :lang<raku>
sub demo {
    B<say> 'Hello R<name>';
    V<I><note> 'The I format is not recognised';
    V<U><warn> 'The U format is not recognised either';
}
=end code

That would probably be more annoying (and error-prone) for those of us who really like to mark-up our code,
but maybe that occasional inconvenience would be an acceptable price to pay in order to get the feature at all.

finanalyst · 2024-11-26T08:40:39Z

Suppose we simply restrict allow to the format codes? They by definition have no internal structure.

…

On Tue, 26 Nov 2024, 04:40 thoughtstream, ***@***.***> wrote: @finanalyst <https://github.com/finanalyst>, you raise some good points. My view of the matter is as follows... Normally, everything inside a code block is treated as *“something foreign to RakuDoc so we don’t care about the internal structure of it”*. Which means those contents need not be parsed at all...just matched with a minimal .*?. In other words, the default rules to parse a code block would be something like: rule code-block { ^^ \h* '=code' >> <code-contents> <blank-line> | ^^ \h* '=for' 'code' <metaoption>* \n <code-contents> <blank-line> | ^^ \h* '=begin' 'code' <metaoption>* \n <code-contents> ^^ \h* '=end' 'code' \h* \n } token code-contents { .*? } token blank-line { ^^ \h* $$ } But if an :allow is one of the metaoptions, then the value of that metaoption is supposed to configure the way contents are matched, so the parser needs to be somewhat more sophisticated. Something like: rule code-block { ^^ \h* '=code' >> <code-contents> <blank-line> | # Capture :allow option separately... ^^ \h* '=for' 'code' [ <allow-option> | <metaoption>]* \n # Then pass the allowed values to the contents parser... <code-contents($<allow-option><value>)> <blank-line> | # And the same here... ^^ \h* '=begin' 'code' [ <allow-option> | <metaoption>]* \n <code-contents($<allow-option><value>)> ^^ \h* '=end' 'code' \h* \n } token code-contents ($allowedoption) { [ @($allowed.words) # Match any of the allowed format code letters '<' # ...then the left delimiter <code-contents($allow-option)> # ...then any nested code contents '>' # ...then the right delimiter | . # Or else any single non-special character ]*? } In other words, after we parse an :allow option we pass that option’s values into the code-contents parser, so that it can parse those particular formatting codes specially. Of course, in the real parser, the parsing of allowed formatting codes would have to be more sophisticated, to account for the different structures of various formatting codes, and to allow for different delimiters than just a single <..> pair. But the above example *(which, BTW, I haven’t verified!)* should at least illustrate the approach I had presupposed for parsing code blocks with :allow exceptions. Having said that, I begin to wonder whether :allow is just to difficult to *(ahem)* allow. In reality, the only formatting codes people are likely to actually want to place inside a code block are: B<>, I<>, U<>, H<>, J<>, T<>, K<>, O<>, R<>, and V<>. All of which have no special internal structure. So now I’m wondering whether, instead of a full :allow mechanism, what if we just defined a second kind of code block (perhaps a formcode block), which automatically allows *all* of those formatting codes. In which case, our problematical example would become: =formcode 1. an entity E<B<this is fallback |> raquo> 2. an indexed item X<B<this is display text |>one, two> 3. an alias A<B<this is fallback |>ALIAS_NAME> Now, since E<>, X<>, and A<> are never special in a formcode block, there’s no need for special parsing of them. Or, in other words, the rules for parsing a formcode can be hardwired, without any run-time reconfiguration of the contents parser: rule formcode-block { ^^ \h* '=formcode' >> <formcode-contents> <blank-line> | ^^ \h* '=for' 'formcode' <metaoption>* \n <formcode-contents> <blank-line> | ^^ \h* '=begin' 'formcode' <metaoption>* \n <formcode-contents> ^^ \h* '=end' 'formcode' \h* \n } token formcode-contents ($allowedoption) { [ # Fixed set of permitted formatting codes (possibly nested)... <[BHIJKORTUV]> '<' <formcode-contents> '>' | # Anything else... . ]*? } It wouldn’t be as powerful or as flexible as the :allow option, but it might be a whole lot simpler to implement (and to use). The only real downside is that, if you *didn’t* want one or more of the permitted formatting codes to actually behave like a formatting code (*i.e.* you wanted it to just be literal contents), you’d have to escape that particular code with a V<>. For example: =begin code :allow :lang<raku> sub demo { B<say> 'Hello R<name>'; I<note> 'The I format is not recognised'; U<warn> 'The U format is not recognised either'; } =end code ...would then have to be written: =begin formcode :lang<raku> sub demo { B<say> 'Hello R<name>'; V<note> 'The I format is not recognised'; V<warn> 'The U format is not recognised either'; } =end code That would probably be more annoying (and error-prone) for those of us who really like to mark-up our code, but maybe that occasional inconvenience would be an acceptable price to pay in order to get the feature at all. — Reply to this email directly, view it on GitHub <#60 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACYZHB53DTKPQLB76NAWG32CP3SFAVCNFSM6AAAAABSLGJDJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJZGY2DENRTG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

thoughtstream · 2024-11-26T09:09:37Z

Suppose we simply restrict allow to the format codes? They by definition have no internal structure.

I'd be perfectly fine with specifying that :allow is restricted to allowing only one or more of
B<>, I<>, U<>, H<>, J<>, T<>, K<>, O<>, R<>, and V<>.

patrickbkr · 2024-11-27T08:59:20Z

There might be evil edge cases, but Rainbow (a mere highlighter, not a full parser) has just gained support for :allow. Since it as a highlighter doesn't care about the semantics of the formatting codes and the syntax is universal it is working with all types of formatting codes in :allow.

But I do wonder, if the parser used to parse =begin code blocks is the same parser that parses regular rakudoc, only with a tiny bit of runtime configuration to reduce the set of formatting codes, why is that more difficult than a fixed set of allowed formatting codes?

finanalyst · 2024-11-27T09:25:53Z

@patrickbkr It is the evil edge cases that cause the problems. :)

I need to look at how you handle :allow to see what you are doing
It does seem that this specification should have a clever way to implement it
The problem for the full parser is that HTML is not the only output format to be considered, and Rainbow is not the only highlighter.
- Take for example a code block with :lang<Haskell> where the syntax highlighting is done in-browser using highlight-js, and :allow is also enabled. L has internal structure that is intended to turn into a link.

patrickbkr · 2024-11-27T10:18:16Z

The problem for the full parser is that HTML is not the only output format to be considered, and Rainbow is not the only highlighter.

Take for example a code block with :lang<Haskell> where the syntax highlighting is done in-browser using highlight-js, and :allow is also enabled. L has internal structure that is intended to turn into a link.

Yeah. That issue I don't have. In Rainbow I only have to spit out a flat list of tokens, not at tree. As such I don't care about composability. Rainbow does it like this:

Parse the block contents as if it'd be usual RakuDoc (with :allowed markup).
Split the resulting token stream into text and markup, taking note where in the text the markup was found.
Tokenize the text again using a suitable tokenizer (currently only Raku code is recognized, but other tokenizers could be plugged in).
Reinsert the RakuDoc markup removed in step two. (It's sometimes necessary to split a token into two if the to be inserted RakuDoc token should be put in the middle of some token.)

This does get tricky when the output is a tree and not a flat token list, because you then need to untangle partially overlapping nodes. I.e. Italicbold-italicbold. I believe this is doable, but annoying. It boils down to merging two independent syntax trees into one. I'd guess this is a well-known problem in computer science. Ping @antononcube, this seems to be an issue right up your alley. Does this ring a bell?

* relates to issue #60 * re-order sequence

finanalyst · 2024-12-30T14:42:04Z

I'm closing this issue here but mentioning it in the suggestions for V3. The specification has some commented out examples related in some way to this issue, but not quite.

finanalyst added a commit that referenced this issue Nov 29, 2024

relates to issue #60

1174ca1

finanalyst added a commit that referenced this issue Nov 30, 2024

relates to issue #60 (#62)

916ffc4

* relates to issue #60 * re-order sequence

finanalyst mentioned this issue Dec 30, 2024

suggestions for RakuDoc v3 #57

Open

finanalyst closed this as completed Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding Markup inside markup #60

Embedding Markup inside markup #60

finanalyst commented Nov 23, 2024

thoughtstream commented Nov 24, 2024

finanalyst commented Nov 24, 2024

lizmat commented Nov 24, 2024

thoughtstream commented Nov 24, 2024

lizmat commented Nov 25, 2024

thoughtstream commented Nov 25, 2024

finanalyst commented Nov 25, 2024

thoughtstream commented Nov 26, 2024

finanalyst commented Nov 26, 2024 via email

thoughtstream commented Nov 26, 2024

patrickbkr commented Nov 27, 2024

finanalyst commented Nov 27, 2024

patrickbkr commented Nov 27, 2024 •

edited

Loading

finanalyst commented Dec 30, 2024

Embedding Markup inside markup #60

Embedding Markup inside markup #60

Comments

finanalyst commented Nov 23, 2024

thoughtstream commented Nov 24, 2024

finanalyst commented Nov 24, 2024

lizmat commented Nov 24, 2024

thoughtstream commented Nov 24, 2024

lizmat commented Nov 25, 2024

thoughtstream commented Nov 25, 2024

finanalyst commented Nov 25, 2024

thoughtstream commented Nov 26, 2024

finanalyst commented Nov 26, 2024 via email

thoughtstream commented Nov 26, 2024

patrickbkr commented Nov 27, 2024

finanalyst commented Nov 27, 2024

patrickbkr commented Nov 27, 2024 • edited Loading

finanalyst commented Dec 30, 2024

patrickbkr commented Nov 27, 2024 •

edited

Loading