Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marked removes non-breaking spaces in the original text #363

Closed
arturi opened this issue Mar 9, 2014 · 31 comments · Fixed by #1532
Closed

Marked removes non-breaking spaces in the original text #363

arturi opened this issue Mar 9, 2014 · 31 comments · Fixed by #1532
Labels
has PR The issue has a Pull Request associated proposal

Comments

@arturi
Copy link

arturi commented Mar 9, 2014

Consider, say, a text like this:
2 125 euro
I've put a non-breaking space between 2 and 125 so that it would always end up on the same line.

Marked pre-parses the text and completely removes the original non-breaking characters that I've put there:

Lexer.prototype.lex = function(src) {
  src = src
    .replace(/\r\n|\r/g, '\n')
    .replace(/\t/g, '    ')
    .replace(/\u00a0/g, ' ')
    .replace(/\u2424/g, '\n');
  return this.token(src, true);
};

This is where the devil hides: .replace(/\u00a0/g, ' ')

Here is more on why invisible non-breaking space characters are cool: http://destroytoday.com/findings/fix-widows-with-non-breaking-spaces/

@christopherscott
Copy link

+1, if anything this should be an option, or configurable

@daleconboy
Copy link

Yes! I've recently lost a few hours of my life tracking this very same thing down.

@chjj
Copy link
Member

chjj commented Mar 13, 2014

@daleconboy, I'm sorry to hear that, but many people lost several hours of their lives trying to figure out why their spaces weren't getting processed correctly when text was passed in from the DOM (see #52 - cc @OscarGodson), which is why this was added in the first place.

I'll consider adding an option, but I want to keep their removal the default since more people probably get bit by this "feature" of contenteditable elements than not.

@daleconboy
Copy link

Hey, thanks for the response. I definitely sympathize with anyone who's been bitten by this quirk in any way, however I would argue against the default being wholesale replacement of non-breaking spaces.

Reason being, it's not a bug with marked, but rather a browser behavior which shouldn't be the responsibility of marked to manage. Technically the responsibility should fall on the developer who's using the contenteditable elements to be aware of the quirk and to manage the white space handling, or conversion, on their end.

The W3C working draft specifically calls this out to authors working with contenteditable elements:

http://www.w3.org/TR/html51/editing.html#best-practices-for-in-page-editors

Authors are encouraged to set the 'white-space' property on editing hosts and on markup that was originally created through these editing mechanisms to the value 'pre-wrap' …

It seems that with contenteditable regions expected to behave in this way, you would want to preserve their expected behavior by default to avoid confusion. This, in turn, would also avoid the confusion where devs are expecting their explicitly set non-breaking spaces to behave as expected.

And, since marked may also be used in a node environment where contenteditabe does not exist, this replacement behavior by default would be unexpected.

Bottom line, I appreciate you considering it as an option. How you decide to set the default behavior is of course up to you. Any option is definitely better than no option. I'll cast my vote for the default being no replacement. :)

Cheers!

@drscannell
Copy link

@daleconboy's argument is pretty convincing. Are there other use cases for no-break spaces in markdown input? I would think a set of tests would help define the severity of the issue.

@OscarGodson
Copy link

@daleconboy I like your point about the browser, except, in @arturi's post he specifically points out that spaces are good to fix a browser bug haha :) Also, i wouldn't agree that it's a browser issue. Markdown's "spec" doesn't say which kind of spaces are and aren't allowed so IMO Marked, and any markdown parser, should assume all spaces (nbsp, unicode, etc) should be considered what they are: spaces. Your suggestion, unless im misunderstanding it, is wanted to specifically ignore certain kinds of spaces.

@scy
Copy link

scy commented Apr 8, 2014

I’m working on a Markdown-based presentation tool, and I’m using marked to generate HTML.

Having control over when and where text wraps is vital in a good presentation. Currently, the only way I can do that with marked is by overriding the lexer with a custom one that does the same things as the original one, except for the NBSP replacement. This is of course far from future-proof: In case the original lexer changes, I have to adapt my code.

Therefore I’m very much in favor of making this configurable. If you’re interested in a PR, let us know. And although I think that not replacing the NBSPs is the “right” thing to do, I can understand that you don’t want to break existing code that relies on marked fixing the browser behavior. So, I don’t care what the default for this option is, but please introduce one.

@Lendar
Copy link

Lendar commented Jun 17, 2015

@scy suggestion is nice. Let me extend it with an example. It might be helpful for future readers...

import marked from 'marked';

// monkey patch for marked 0.3.3 to preserve non-breaking spaces
marked.Lexer.prototype.lex = function (src) {
  src = src
    .replace(/\r\n|\r/g, '\n')
    .replace(/\t/g, '    ')
    .replace(/\u2424/g, '\n');

  return this.token(src, true);
};

UPDATED 2017-05-11: fixed syntax

@RichardForrester
Copy link

Is anyone aware of an option or a work around for this issue?

There should definitely be an option to allow non-breaking spaces to pass to the HTML.

@arturi
Copy link
Author

arturi commented Feb 15, 2016

I’ve solved it by extending lex, as shown here: #363 (comment).

@deanvaessen
Copy link

Hello everyone,

@Lendar 's suggestion is wonderful and if I edit this in the source code I can fix it this way. However putting it straight into my own code complains about 'this.token' not being a function. What is the best way to implement this?

Cheers!

@davidchambers
Copy link

I imagine the problem, @deanvaessen, is the arrow function. Try replacing (src) => { ... } with function(src) { ... }. :)

@deanvaessen
Copy link

Confirmed. Thank you @davidchambers :)

@Lendar
Copy link

Lendar commented May 11, 2017

@davidchambers @deanvaessen surprised it's still relevant. Updated the example in the comment ⬆️

@yurikhan
Copy link

I am in the same boat as @scy — working on a Markdown-based presentation tool. I, too, want to control where lines break and where they never break. Please make an option that stops breaking non-breaking spaces.

As for browsers and/or WYSIWYG editors inserting non-breaking spaces where not expicitly requested by the user, that’s their bugs and should be fixed there.

@ArTiSTiX
Copy link

ArTiSTiX commented Oct 9, 2017

Up ?
Sometimes, non-breaking spaces are needed by the language (e.g. in French, https://fr.wikipedia.org/wiki/Espace_ins%C3%A9cable, non-breaking spaces are necessary before '?', ':', '!', ';', thousand separators, phone numbers, and i also use then between quotes and where line-breaking should be avoided like brand names).

Thus, there is no reason for removing them (i would say non-breaking spaces should not be interpreted as syntax spaces).

PS: there is also no reason for anyone to monkey-patch marked. But it's a bit annoying to always work with minor-fix forks.

ArTiSTiX added a commit to habx/marked that referenced this issue Oct 9, 2017
ArTiSTiX added a commit to habx/marked that referenced this issue Oct 9, 2017
@oliviertassinari
Copy link

@joshbruce
Copy link
Member

Closing as having a fix or workaround as the Marked library proper figures its life out. :)

@oliviertassinari
Copy link

@joshbruce So it's a won't fix.

@joshbruce
Copy link
Member

joshbruce commented Feb 11, 2018

@oliviertassinari: At this juncture I'm siding with @chjj on this one (#363 (comment)). See #956 as well:

  1. XSS fixes were the focus.
  2. Fixing known issues and complying with CommonMark and GFM are next; so, it's on our radar (@Feder1co5oave and @UziTech) as the spec does see it as a part of mixed content: http://spec.commonmark.org/0.28/#example-302 - see also Release 0.3.9 #958 (if requested to be reopened by the primary contributors at this time, it will be)

I guess what I'm saying is, right now we have bigger fish to fry and it seems like there is a viable workaround in the meantime. Does that help?

@joshbruce
Copy link
Member

Note: This only applies to explicit   inclusion in the Markdown.

@oliviertassinari
Copy link

@joshbruce Thanks for the extra details. I wasn't sure what was the implication of the first answer.

@joshbruce
Copy link
Member

@oliviertassinari: Fair. And sorry for not providing more - was in a rush going through issues. :)

@Feder1co5oave
Copy link
Contributor

What wrote Christopher about people getting bit by this is at least debatable.
I also came across an example in commonmark where a non-breaking space changed the interpretation behaviour because it usually isn't allowed to be used in place of a single space. So if we want to comply I think we need to at least consider this.
I never used one but I guess some people use it so I see no point in replacing them altogether with single spaces.
However, if we merge this we must require it to be tested properly.

@joshbruce
Copy link
Member

@Feder1co5oave: Reopen or no?

Again, I'm not sure if the original ticket was referring to the html encoded   or a unicode character discovery - not the same thing in my book. As a user, I would expect the   to be preserved, but not necessarily a special character injection of UTF-8 or something similar...am I wrong there. I concur that Chris's assessment is debatable.

Leave it to you, brother.

@Feder1co5oave
Copy link
Contributor

Feder1co5oave commented Feb 11, 2018 via email

@joshbruce
Copy link
Member

All right. Leaving closed for now. Flagging with newly minted #1048 for when we're ready to focus there. This could also explain the Chinese character problems with header ids, yeah?

@Feder1co5oave
Copy link
Contributor

Feder1co5oave commented Feb 11, 2018 via email

@Feder1co5oave Feder1co5oave reopened this Feb 23, 2018
@Feder1co5oave Feder1co5oave added this to the 0.5.0 - Architecture and extensibility milestone Feb 23, 2018
@Feder1co5oave
Copy link
Contributor

Tagging as #1048

@joshbruce
Copy link
Member

Tagging #1043 as well, just because of the "header ids" comment.

@joshbruce joshbruce modified the milestones: 0.5.0 - Architecture and extensibility, v1.x - All the nope release Apr 4, 2018
@UziTech
Copy link
Member

UziTech commented Dec 5, 2018

related pr #897

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
has PR The issue has a Pull Request associated proposal
Projects
None yet
Development

Successfully merging a pull request may close this issue.