A CommonMark compliant AST format for the MyST specification.
This is some initial work on a specification for the MyST syntax.
The package currently contains functions to:
- Convert CommonMark to mdast, via parsing with Markdown-it
- Convert the mdast to CommonMark compliant HTML (tested against https://spec.commonmark.org/0.30/spec.json)
$ pip install .
$ myst-spec --help
usage: myst-spec [-h] COMMAND ...
MyST Specification tools.
optional arguments:
-h, --help show this help message and exit
Commands:
to-mdast Convert CommonMark to MDAST JSON.
to-html Convert CommonMark to HTML.
$ echo "hallo" | myst-spec to-mdast
{"type": "root", "children": [{"type": "paragraph", "position": {"start": {"line": 1, "column": 1}, "end": {"line": 2, "column": 1}}, "children": [{"type": "text", "value": "hallo"}]}]}
$ echo "hallo" | myst-spec to-html
<p>hallo</p>
This can then be extended, to include the MyST syntax nodes.
The creation of commonmark-spec represented a great step forward in Markdown standardisation. However, the current specification only specifies the expected HTML output, which conflates two aspects of markup language processing:
- The reading of the source input
- The writing of the output format
There are other aspects of Markdown processing that would benefit from such a specification, such as:
- Output to other formats than HTML
- Syntax highlighting of the source text
- Language Server Protocol integration
This would promote interoperability between different implementations for reading and processing of Markdown.
Note, there is an open issue (#274), suggesting an XML specification, but this discussion has not been re-visited since 2017.
- The format should be language agnostic.
- A program written in any programming language should be able to generate the AST, then offload to a different language for processing.
- The format should be extensible.
- The format should allow for new syntax types to be added, and not hard-code to only the CommonMark types.
- Not all processor may be able to handle extended syntax types, but they should be able to "fail gracefully"
- An example of this would be to allow for the GitHub Flavored Markdown extensions
- The AST format should be lossless.
- The AST should be able to be converted back to the source text, without loss of syntax information.
- Note, this does not mean that round-trip conversion should be "byte equivalent", just that it will produce again the same AST.
- Line/column information, for example, would not be preserved.
- The format should allow incremental parsing.
- This would allow for sub-parsing of modified document, without having to re-parse the entire document.
Inspiration also taken from:
- Markdown-it tokens
- Docutils doctree's
- Pandoc JSON AST
- https://microsoft.github.io/language-server-protocol/specifications/specification-current/#textDocuments
- agoose77/jupyterlab-markup#12
Markdown-it-py is used as the parser here, since it is what we currently use for MyST-Parser. It is the best Python Markdown parser I know of:
- It is pure-python
- It is fast
- It is CommonMark compliant
- It captures source line number information
- It is easy to extend by plugins
However, it is not actually the ideal reference implementation, since it does not capture source column position information (currently we just always set 1), or specific line information for inline nodes.
Also, the conversion here is not currently supported by the Markdown-IT JavaScript implementation,
since we utilise the store_labels
and inline_definitions
options, which are only implemented in markdown-it-py.
- A general issue with CommonMark, is that (inline) link/image references are only recognised if the (block level) definitions have already been parsed. This is an issue for incremental parsing, since we wold need to parse all the definitions first, if we were to allow them at "any level".
- docutils records the source for every node, since it may be different to the parent document, if using the
include
directive.