-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for preserving non-semantic whitespace and softbreaks #361
Comments
Turndown just follows the common HTML whitespace interpretation. Why do you think it is improper? The thing is that whitespace collapsing behaves differently in Markdown and HTML and there is no easy universal way to transfer arbitrary HTML whitespace syntax to Markdown. So we at least try hard to convert it to a semantic equivalent. Will post more details to your PR. What is your use case for this? |
The HTML standard introduce the possibility to avoid multiple whitespace trimming using My use case is the ability to render markdown code from an I need to use turndown not as a markdown library (that render markdown content to html) but as a html to markdown converter. So the conversion, with options, should be able to convert exactly what is rendered inside the HTML, not applying custom rules or interpretation of the content. |
I know this isn't the best solution, but I normalize the target element text nodes whitespaces before getting the markdown, and them back again to plain white spaces export function getTextNodesIn(el: HTMLElement | Text | Node) {
const textNodes: Text[] = []
if (el.nodeType === 3) {
textNodes.push(el as Text)
} else {
const children = el.childNodes
for (let i in children) {
textNodes.push(...getTextNodesIn(children[i]))
}
}
return textNodes
}
// @NOTE: Convert white space to before getting markdown, since turndown ignores multiple spaces
function normalizeWhiteSpace(el: HTMLElement) {
const target = el.cloneNode(true) as HTMLElement
target.normalize()
const nodes = getTextNodesIn(target)
for (let i in nodes) {
if (nodes[i].textContent) nodes[i].textContent = nodes[i].textContent?.replace(/ /g, '\u00a0') || ''
}
return target
}
const endSpaceRe = new RegExp('\u00a0', 'g')
const normalizedTarget = normalizeWhiteSpace(target)
let result = service.turndown(normalizedTarget).replace(endSpaceRe, ' ') |
Indeed. But I came across just single worthwhile use case so far - the already mentioned preformatted inline code (#318, PR #319).
As I mentioned at your PR, it is not achievable in general. E.g. you can't mirror two consecutive newlines as softbreaks to Markdown, as it will become a paragraph, etc. I can imagine preserving whitespace runs (except newlines, i.e. And a related thing, which can be reasonably demanded in relation to this, is somehow preserving softbreaks. That would be much more tricky. :/ The issue is that whitespace collapsing is deep inside the Turndown's core and we should be pretty sure that we don't break anything, even though it is configuration based. So your placeholder pre/postprocessing for such special use cases[*] is probably a good workaround for now. Some suggestions related to your code:
[*] Don't like when maintainers tell contributors their use case is special. But I wish that the developers who enabled spacebar-formatting in MS Word were never born :D And seriously - do you really use I've also realized that @domchristie hasn't yet released the latest master, so you can check it out manually - there is the improved Unicode whitespace treatment and the |
Okay, got it. To give you a real use case, notion.so created a wysiwyg editor with only |
@elias94 I actually thought the @bruno2ms code snippet was yours, sorry about confusing the reply. Bruno's approach can be used with the tweaks I've described. I've reopened the issue, as it is legit. It's more about benefit vs. introduced complexity and priority indeed. Can I see the notion.so in action eventually? Internal remark: the most tangible reason for legitimity is probably backtranslation preserving. For whitespace and softbreaks that are actually rendered by a CommonMark renderer from pure-Markdown source, it makes sense to have opportunity to convert them back. |
Hello @martincizek, first, kudos for Turndown, it's an amazing tool, and I really appreciate your effort in getting things done right without introducing unwarranted bugs. With that said, I'd like to add my case for reviving the feature that @elias94 tried to implement with #362. We're building a new text editor based on Tiptap, but we need to store Markdown, and this is where Turndown comes into play for us. But we are not only building a rich-text editor, but also an augmented plain-text editor. Before I get into that last part, let me just state that Turndown works perfectly for the rich-text editor mode, it takes the HTML generated by Tiptap, and with a few tweaks to some rules, we get the Markdown output we need. Now, what's this "augmented plain-text editor" thing? It's still a Tiptap-powered editor but only with minimal extensions to provide a plain-text mode, but it also includes custom extensions tailored to our needs (for instance, user mentions). In other words, this editor only allows for a single root As you might've guessed by now, this plain-text editor is meant to be an augmented Markdown-based editor (similar to the GitHub editor we can see right here in the comment box). The problem? A Markdown editor cannot collapse whitespace (think of nested lists). In a nutshell, Turndown is working great for our rich-text editor, but not so much for our plain-text Markdown editor. The option @elias94 introduced would've worked nicely for us, and I'm here requesting if something like that can be added (maybe as an advanced/hidden option, or some sort of hook where we can tweak the collapse whitespace regex/function?). I understand your concerns about adding it as it was presented in the PR, but I don't think those concerns would apply to our plain-text Markdown editor (please correct me if I'm wrong). To finish, because this is a dealbreaker for us, I had to introduce a similar workaround to #361 (comment). I'm passing an HTML string to the // Convert all instances of at least 2 standard space characters to
const htmlInput = input.replace(/ {2,}/g, (m) => m.replace(/ /g, '\u00a0'))
// Convert all characters back to standard spaces
const result turndownService.turndown(htmlInput).replace(/\u00a0/g, ' ') I haven't tested this extensively yet, but it looks to be working for now. Nonetheless, would love your feedback on this. |
Problem
Using the library I noted that the content of the markdown is not dealing properly with multiples whitespaces.
For example
converts to
not respecting the content of the html element.
Replicable example
https://codesandbox.io/s/markdown-to-jsxturndown-1yrfn?file=/src/index.js
I've used
markdown-to-jsx
to convert from markdown to html and back usingturndown
but the result is not always the same. In the example, I've used the css rulewhite-space: pre-wrap;
that helps React to preserve multiple whitespaces with jsx.Possible solution
Using an option as
collapseMultipleWhitespaces: false
that allow the library to disable the collapsing of series of multiple whitespaces, and eventually reduce to one whitespace for leading and trailing.
See related PR #362
The text was updated successfully, but these errors were encountered: