Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new TokeniseOption EnsureLF #336

Merged
merged 2 commits into from
Mar 4, 2020

Conversation

satotake
Copy link
Contributor

@satotake satotake commented Feb 23, 2020

ref #329
related gohugoio/hugo#6596

I have been wondering if it is better to add this functionality to chroma in itself or chroma's client (such as Hugo) because it will be a breaking change somehow.

Walking around chroma's source code and playing with your online demo (it always uses \n and works fine), I am inclined to implement this functionality on chroma rather than on its client because I assume that logics behind chroma expect \n as EOL .

So I added EnsureLF as TokeniseOption and the default values is true.

Any comments are welcome.

@chmike
Copy link

chmike commented Mar 3, 2020

Could we replace the instructions

text = strings.ReplaceAll(text, "\r\n", "\n")
text = strings.ReplaceAll(text, "\r", "\n")

with a call to this function which will be more efficient

func replaceCRNLWithNLAndCRxWithNL(text string) string {
	buf := make([]byte,len(text))
	var j int
	for i := 0; i < len(text); i++ {
		c := text[i]
		if c == '\r' {
			if i < len(text)-1 && text[i+1] == '\n' {
				continue
			}
			c = '\n'
		}
		buf[j] = c
		j++
	}
	return string(buf[:j])	
}

I tested this function with the following code. You may use it for the test function.

func main() {
	tests := []struct{in,out string}{
	        {in:"", out:""},
		{in:"abc", out:"abc"},
		{in:"\r", out:"\n"},
		{in:"a\r", out:"a\n"},
		{in:"\rb", out:"\nb"},
		{in:"a\rb", out:"a\nb"},
		{in:"\r\n", out: "\n"},
		{in:"a\r\n", out:"a\n"},
		{in:"\r\nb", out:"\nb"},
		{in:"a\r\nb", out:"a\nb"},
		{in:"\r\r\r\n\r",out:"\n\n\n\n"},
	}
	for _,test := range tests {
		out := replaceCRNLWithNLAndCRxWithNL(test.in)
		if out != test.out {
			fmt.Printf("error: got %v, expected %v for %v\n", []byte(out), []byte(test.out), []byte(test.in))
		}
	}
}

See and test it in the playground: https://play.golang.org/p/t4-rE3grHH8

@satotake
Copy link
Contributor Author

satotake commented Mar 3, 2020

Confirmed. Your method is x2-x3 times faster than mine.
Thank you, @chmike .

❯ go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/alecthomas/chroma/quick
BenchmarkReplaceCRNLWithNLAndCRxWithNL-4   	 5880343	       205 ns/op
BenchmarkMyReplace-4                       	 2135847	       576 ns/op
PASS
ok  	github.com/alecthomas/chroma/quick	4.255s

func ensureLF(text string) string {
buf := make([]byte, len(text))
var j int
for i := 0; i < len(text); i++ {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is unicode safe is it? If a unicode code point contains \r\n this will mangle the codepoint.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this would iterate over the bytes (uint8s). We have to use a for range loop here, so we iterate over the individual runes.

@muesli
Copy link

muesli commented Mar 4, 2020

Here's a version which operates on the runes directly. It should be unicode-safe, but I didn't benchmark it yet.

func replaceCRNLWithNLAndCRxWithNL(text string) string {
	rs := []rune(text)
	buf := make([]rune, len(rs))
	var j int
	for i, c := range rs {
		if c == '\r' {
			if i < len(rs)-1 && rs[i+1] == '\n' {
				continue
			}
			c = '\n'
		}
		buf[j] = c
		j++
	}
	return string(buf[:j])
}

@chmike
Copy link

chmike commented Mar 4, 2020

ASCII characters are single byte unicode code points. Multi-byte code points have at least one of the two most significant bits set to one in all their bytes. There is no possible confusion with ASCII control characters. Working with runes will be less efficient and it's not necessary.

@muesli
Copy link

muesli commented Mar 4, 2020

Multi-byte code points have at least one of the two most significant bits set to one in all their bytes.

Good point!

@alecthomas alecthomas merged commit 34d9c71 into alecthomas:master Mar 4, 2020
mrsdizzie pushed a commit to mrsdizzie/chroma that referenced this pull request Jul 15, 2020
* Add new TokeniseOption EnsureLF

ref alecthomas#329

* Use efficient process suggested by @chmike
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants