Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor TOC sanitation #1441

Merged
merged 18 commits into from
Mar 8, 2024
Merged

Refactor TOC sanitation #1441

merged 18 commits into from
Mar 8, 2024

Conversation

waylan
Copy link
Member

@waylan waylan commented Feb 9, 2024

  • All postprocessors are run on heading content (not just RawHtmlPostprocessor).
  • Footnote references are stripped from heading content. Fixes Footnote ref number in TOC #660.
  • A more robust striptags is provided to convert headings to plain text. Unlike, markupsafe's implementation, HTML entities are not unescaped.
  • Both the plain text name and rich html and unescaped raw data-toc-label are saved to toc_tokens, which means users can now access the full rich text content of the headings directly from the toc_tokens.
  • data-toc-label is sanitized separate from heading content.
  • A html.unescape call added to slugify and slugify_unicode, which ensures slugify operates on Unicode characters, rather than HTML entities. By including in the functions, users can override with their own slugify functions if they desire. A html.unescape call is made just prior to calling slugify so that slugify only operates on Unicode characters. Note that html.unescape is not run on the name or html.

Note that this first commit includes minimal changes to the tests to show very little change in behavior (mostly the new html attribute of the toc_tokens was added). A refactoring of the tests will be in a separate commit.

- All postprocessors are run on heading content (not just
  `RawHtmlPostprocessor`).
- Footnote references are stripped from heading content. Fixes Python-Markdown#660.
- A more robust `striptags` is provided to convert headings to plain text.
  Unlike, markupsafe's implementation, HTML entities are not unescaped.
- Both the plain text `name` and rich `html` are saved to `toc_tokens`,
  which means users can now access the full rich text content of the
  headings directly from the `toc_tokens`.
- `data-toc-label` is sanitized separate from heading content.
- A `html.unescape` call added to `slugify` and `slugify_unicode`, which
  ensures `slugify` operates on Unicode characters, rather than HTML
  entities. By including in the functions, users can override with their
  own slugify functions if they desire.

Note that this first commit includes minimal changes to the tests to show
very little change in behavior (mostly the new `html` attribute of the
`toc_tokens` was added). A refactoring of the tests will be in a separate
commit.
Copy link
Contributor

@oprypin oprypin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good direction to move in :)

I did a cursory check whether the change affects performance on a large MkDocs site, and seems it doesn't!

markdown/extensions/toc.py Show resolved Hide resolved
markdown/extensions/toc.py Show resolved Hide resolved
markdown/extensions/toc.py Show resolved Hide resolved
@waylan
Copy link
Member Author

waylan commented Feb 9, 2024

In my initial commit, I deleted a couple public functions which are no longer needed. I probably should have left them in and marked them as deprecated instead. Do we need to do that? Are others calling them?

Also, it occurs to me that these changes could result in various slugs being different, which would break existing links out in the wild. True, none of the slugs in our own tests are changed, but by fully rendering the HTML before generating a slug could result in additional/different content from any third-party extensions that make use of postprocessors.

tests/test_extensions.py Outdated Show resolved Hide resolved
@vedranmiletic
Copy link
Contributor

FWIW, I installed this using

pip install git+https://github.com/Python-Markdown/markdown.git@refs/pull/1441/head

and tested it on my group's website page that had issues, angle brackets are removed correctly and the e-mails show up nicely. Not sure I can help with actual code review, but from a user's perspective, this fixes the issue.

markdown/extensions/toc.py Outdated Show resolved Hide resolved
markdown/extensions/toc.py Outdated Show resolved Hide resolved
markdown/extensions/toc.py Outdated Show resolved Hide resolved
Copy link
Contributor

@oprypin oprypin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good. This will be nice :)

@waylan waylan marked this pull request as ready for review March 7, 2024 18:29
@waylan
Copy link
Member Author

waylan commented Mar 7, 2024

I finished updating the tests and documentation. I think this is ready to go.

@waylan waylan merged commit e4ab4a6 into Python-Markdown:master Mar 8, 2024
16 of 17 checks passed
@waylan waylan deleted the toc branch March 8, 2024 14:05
@waylan
Copy link
Member Author

waylan commented Mar 12, 2024

@oprypin @pawamoy I wanted to check with you before I did a release of this. As this is a change in behavior that affects your projects (in that a bug was fixed which allowed markup to be passed through to toc_token['name'] in certain circumstance), I wanted to make sure you where ready for the release before I publish it.

Presumably at a minimum, MkDocs will need to do a release which includes mkdocs/mkdocs#3578 before I do my release. And I am assuming mkdostrings will want to do a release which takes advantage of mkdocs/mkdocs#3578 as well.

For the record, I intend to release this as version 3.6, so any earlier releases of your projects will likely want to be constrained to markdown<3.6 for any users who want markup in their toc token names.

@pawamoy
Copy link
Contributor

pawamoy commented Mar 12, 2024

Thanks for the heads up. I'm not ready for this. I'll start by adding an upper bound in mkdocstrings and publish a new release. I should be able to do that today, but if not, before the end of the week for sure 🙂 I'll report back once it's done!

@oprypin
Copy link
Contributor

oprypin commented Mar 12, 2024

I think actually there's no direct interaction with MkDocs, the release can be made here first. Extracting the primary heading has a separate implementation in MkDocs and extracting secondary headings will just change for the better with this release but still nothing to be done in MkDocs

@waylan
Copy link
Member Author

waylan commented Mar 13, 2024

Well, there definitely is interaction with mkdocstrings. According to my own testing, the show_symbol_type_toc feature breaks with this change. Presumably, mkdocstings will need to retrieve the value from token['data-toc-label'] rather than token['name'] as it does now. Therefore, at a minimum, I'll wait for an indication that mkdocstrings is ready before doing a release.

@pawamoy
Copy link
Contributor

pawamoy commented Mar 13, 2024

I've published a new version of mkdocstrings-python with an upper bound on Python-Markdown 3.6, you can release it :)

@pawamoy
Copy link
Contributor

pawamoy commented Mar 25, 2024

Hello again! Today someone reported that we should support version 3.6, so I made time to work on this. And it turns out I can do this very simple thing:

class _TocLabelsTreeProcessor(Treeprocessor):
    def run(self, root: Element) -> None:  # noqa: ARG002
        self._override_toc_labels(self.md.toc_tokens)  # type: ignore[attr-defined]

    def _override_toc_labels(self, tokens: list) -> None:
        for token in tokens:
            if token["name"] != token["data-toc-label"]:
                token["name"] = token["data-toc-label"]
                self._override_toc_labels(token["children"])

...and register this tree processor in our outer MkdocstringsExtension:

        md.treeprocessors.register(
            _TocLabelsTreeProcessor(md),
            "mkdocstrings_post_toc_labels",
            priority=4.5,  # Right after 'toc'.
        )

I suppose it's a bit brutal and instead of blindly overwriting names with toc-labels, I should add a way to detect when it comes from mkdocstrings and overwrite only those? For example, if token["name"] != token["data-toc-label"] and token["data-toc-label"].startswith("<code")?

Otherwise it means I'm just removing the additional safety that was brought with this PR, right? Is this really bad or is there no practical impact 🤔?

@waylan
Copy link
Member Author

waylan commented Mar 26, 2024

The primary concern for MkDocs is that the page title should not ever contain any markup (to avoid markup in <title> etc). As the page title comes from token['name'], then by extension, token['name'] should never contain any markup... but that only really matters for the top <h1> in a page. That said, in Markdown's own documentation, mkdocstrings does set a value to data-toc-label for the top <h1> on each page it generates.

So, yes, to avoid breaking themes (by causing them to generate invalid HTML), token['name'] should never contain any markup. The correct fix is to pull the value from token['data-toc-label'] (and why I added it). However, that would require an edit to theme templates when building the TOC. And that is something theme devs need to do, unless you can provide a way to override that block of a theme. IIRC, there is not a straightforward way for an extension to do that (it interferes with a user's overrides).

So, in the end, we have a chicken and egg problem. Theme's need to support token['data-toc-label'] directly when building a page's toc, but they are not likely to do that unless users and/or extensions are assigning values to token['data-toc-label'] which contain markup.

@waylan
Copy link
Member Author

waylan commented Mar 26, 2024

Two other factors come into play,

  1. For a theme to access token['data-toc-label'], then MkDocs needs to pass that on. I don't think it does (needs confirmation).
  2. In Full-featured text extraction from the first H1 heading mkdocs/mkdocs#3578, the same sanitation code is used by MkDocs to sanitize page titles. Therefore, assigning markup to `token['name'] may not be a problem after all as MkDocs will sanitize it before setting the value to the page title.

In other words, at this time, your suggested solution may actually be the only workable one.

@pawamoy
Copy link
Contributor

pawamoy commented Mar 26, 2024

Thank you!

I'm not sure to understand why there is a data-toc-label attribute if it is only use to override the item name which is otherwise derived from the heading's text/id 🤔 I thought data-toc-label had exactly this use of providing an alternate label for the table of contents (label as, displayable text that doesn't affect anything else). But from your answer I gather that themes actually never used this attribute and always relied on the name itself, and it only seemed to work as I expected it to work because the toc-label was not sanitized properly. Now that it is, it unveiled the true inner working of themes, which is to use the name directly. Makes sense!

Anyway, I'll go with the solution above then, thanks! 🙂

@waylan
Copy link
Member Author

waylan commented Mar 26, 2024

Prior to this present change, the assumption was that both name and data-toc-label never contained markup. It was a weird anomaly that this ever worked for you. Therefore, with this change only sanitized content of data-toc-label gets copied to name (no markup is preserved as name still never has markup). The thinking is that if you want the raw content of data-toc-label then you need to access it directly from data-toc-label. In other words, with this change, we are officially supporting using markup in data-toc-label but to access that markup you can't get if from name; you have to get it from data-toc-label directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Footnote ref number in TOC
4 participants