Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parser for """long string""" ? #157

Open
carueda opened this issue Jan 11, 2023 · 2 comments
Open

parser for """long string""" ? #157

carueda opened this issue Jan 11, 2023 · 2 comments

Comments

@carueda
Copy link
Contributor

carueda commented Jan 11, 2023

Question: how to define a parser for arbitrary contents delimited by multi-char indicators? As a concrete case, consider python long strings, eg:

"""
some
contents
"""

My case is actually with {{{ and }}} as delimiters, but any hints are appreciated.

@HalosGhost
Copy link
Collaborator

I would assume you'd have to have it be the first rule in your grammar so that it will always take precedence, and then it'd be, more or less {{{ .+ }}}

@carueda
Copy link
Contributor Author

carueda commented Jan 11, 2023

Thanks for the hint.

Not as straightforward as I thought, here's what I've tried:

mpc_re_mode("{{{.+}}}", MPC_RE_MULTILINE | MPC_RE_DOTALL);

the .+ will eagerly consume everything so not a solution.

mpc_re_mode("{{{[^}]+}}}", MPC_RE_MULTILINE | MPC_RE_DOTALL);

this one works better, but not for contents that include }, which is to be allowed, eg:

{{{
  foo {
    bar ...
  }
}}}

With combinators, I've tried the following block definition:

  mpc_parser_t *no3b = mpc_not(mpc_string("}}}"), free);

  mpc_parser_t *item = mpc_or(2,
                              mpc_many1(mpcf_strfold, mpc_noneof("}")),
                              mpc_and(2,
                                      mpcf_strfold,
                                      no3b,
                                      mpc_or(2, mpc_string("}}"), mpc_string("}")),
                                      free
                              ));

  mpc_parser_t *block = mpc_and(3, mpcf_strfold,
                                mpc_string("{{{"),
                                mpc_many(mpcf_strfold, item),
                                mpc_string("}}}"),
                                free, free);

I was expecting this to be a solution, but it only works if there's no } embedded in the contents, in which case, intriguingly, a segmentation fault occurs. Maybe a mistake in the definition?

Now, the following grammar-based block definition seems to work as needed:

  mpc_parser_t *no3b = mpc_new("no3b");
  mpc_parser_t *item = mpc_new("item");
  mpc_parser_t *block = mpc_new("block");

  mpc_define(no3b, mpc_not(mpc_string("}}}"), free));

  mpc_define(item,
             mpca_grammar(MPCA_LANG_WHITESPACE_SENSITIVE,
                          " /[^}]+/  |  ( <no3b> (\"}}\" | \"}\" ) ) ",
                          no3b, NULL));

  mpc_define(block,
             mpca_grammar(MPCA_LANG_WHITESPACE_SENSITIVE,
                          "  \"{{{\" <item>* \"}}}\"  ",
                          item, NULL));

I'll do more testing with this one and eventually go with it.

AFAICT, this grammar-based one should be basically equivalent to the only-combinator one above.

To summarize the exercise (and happy to enter other tickets as convenient):

  • why the segfault mentioned above? (the parser definition seems correct to me)
  • possible additional MPC features (not really sure how difficult to implement):
    • an mpc_until combinator that accepts anything until the given parser.
    • something like " !<parser> ..." to expose the mpc_not combinator at the lang/grammar level.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants