-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistently handle inline elements with spaces #281
Merged
domchristie
merged 1 commit into
mixmark-io:master
from
kevindew:vanishing-or-duplicating-spacing
Aug 23, 2019
Merged
Consistently handle inline elements with spaces #281
domchristie
merged 1 commit into
mixmark-io:master
from
kevindew:vanishing-or-duplicating-spacing
Aug 23, 2019
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This resolves some odd situations that can occur when there are inline elements that contain spaces in sentences. The first situation is when there is an element that includes a space between words, for example 'Test<span> </span>content'. This would previously have produced a two space result: 'Test content' because this element would have matched both leading and trailing whitespace tests. The second situation is when there is an element that includes a space outside the tests, which is the case of a non-breaking space character (unicode U+00A0), then the space is removed. An example of this is 'Test<span> </span>content' which would result in 'Testcontent' as this wouldn't match the tests for leading/trailing whitespace. This resolves these problems by changing the whitespace tests to use \s rather than a subset of space characters (which is consistent with the blank test [1]) and only allows a leading space if the test for both leading and trailing whitespace passes on a blank element. [1]: https://github.com/domchristie/turndown/blob/80297cebeae4b35c8d299b1741b383c74eddc7c1/src/node.js#L14
kevindew
added a commit
to alphagov/paste-html-to-govspeak
that referenced
this pull request
Apr 4, 2019
This was a rather hard problem to solve, so brace yourself as this message might be complex. We were having cases of HTML pasted from browsers adding span elements in for spaces. So we could end up with HTML like: `Some text<span> </span><a href="">Link</a>` where, depending on the environment, the contents of the span could be `<span> </span>` or `<span> </span>`. In browsers it was the former, whereas we experienced the latter on JSDOM. This causes some poor behaviour in turndown where the presence of a nbsp; character (which has an ASCII code of 160) causes the whole space to be stripped out, resulting in `Some text[Link]()` as output. On the other hand the presence of a normal space (ASCII code 32) causes a different problem of two spaces, resulting in output such as `Some text [Link]()`. This problem is due to Turndowns whitespace rules where they only match a normal space character: https://github.com/domchristie/turndown/blob/80297cebeae4b35c8d299b1741b383c74eddc7c1/src/node.js#L25-L33 With our change to blankReplacement we can fix the case for (which is the more common on we're witnessing) however we can't easily fix the double space issue. We have left tests for both cases to help any future developers that may venture into this hole. We've raised a PR with turndown to fix both of these issues, mixmark-io/turndown#281, so hopefully in the non-too-distant future we can remove this code.
kevindew
added a commit
to alphagov/paste-html-to-govspeak
that referenced
this pull request
Apr 4, 2019
This was a rather hard problem to solve, so brace yourself as this message might be complex. We were having cases of HTML pasted from browsers adding span elements in for spaces. So we could end up with HTML like: `Some text<span> </span><a href="">Link</a>` where, depending on the environment, the contents of the span could be parsed as `<span> </span>` or `<span> </span>`. In browsers it was the former, whereas we experienced the latter on JSDOM. This causes some poor behaviour in turndown where the presence of a nbsp character (which has an ASCII code of 160) causes the whole space to be stripped out, resulting in `Some text[Link]()` as output. On the other hand the presence of a normal space (ASCII code 32) causes a different problem of two spaces, resulting in output such as `Some text [Link]()`. This problem is due to Turndowns whitespace rules where they only match a normal space character: https://github.com/domchristie/turndown/blob/80297cebeae4b35c8d299b1741b383c74eddc7c1/src/node.js#L25-L33 With our change to blankReplacement we can fix the case for nbsp (which is the more common on we're witnessing) however we can't easily fix the double space issue. We have left tests for both cases to document this scenario. We've raised a PR with turndown to fix both of these issues, mixmark-io/turndown#281, so hopefully in the non-too-distant future we can remove this code.
Thanks for tracking this down! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This resolves some odd situations that can occur when there are inline
elements that contain spaces in sentences.
The first situation is when there is an element that includes a space
between words, for example
Test<span> </span>content
. This wouldpreviously have produced a two space result:
Test content
becausethis element would have matched both leading and trailing whitespace
tests.
The second situation is when there is an element that includes a space
outside the tests, which is the case of a non-breaking space character
(unicode U+00A0), then the space is removed. An example of this is
Test<span> </span>content
which would result inTestcontent
asthis wouldn't match the tests for leading/trailing whitespace.
This resolves these problems by changing the whitespace tests to use \s
rather than a subset of space characters (which is consistent with the
blank test 1) and only allows a leading space if the test for both
leading and trailing whitespace passes on a blank element.