RFE: streaming input processing #501

scop · 2021-05-09T10:58:04Z

What problem does this feature solve?

Currently chroma seems to require entire source contents to process to be in memory, which is not memory efficient, and places limits in how large files can be processed.

What feature do you propose?

Process input as a stream instead of requiring a string; use a reader, process it in say 1024 or 2048 byte chunks.

CLI with --fail could exit if a lexer cannot be determined already after examining just the first chunk or whatever smallish is deemed enough.

CLI case/consideration

A particular problem for the CLI when used as a less preprocessor is that it's not that uncommon to pass e.g. large compressed archives to less, and lesspipe or the like will then output archive contents file listing as its output. Passing large compressed archives to the chroma CLI doesn't do anything good, and the newly added --fail flag doesn't help because the reading happens too early for that to kick in.

Incidentally, the current slurping behavior was why I didn't document the specifics how to use the CLI as less preprocessor initially when implementing --fail, but failed to remember that when asked to document it. Currently I'm using this as my ~/.lessfilter to avoid hitting the issue:

#!/bin/sh
set -eu
for source in "$@"; do
	# Don't feed "too large" files to chroma (it slurps them), nor ones we know it doesn't handle (e.g. binary)
	if [ $(stat --format=%s "$source") -gt 100000000 ] || ! grep -qFI "" "$source"; then
		exit 1
	fi
	chroma --formatter=terminal16m --style=dracula "$source"
done

The text was updated successfully, but these errors were encountered:

alecthomas · 2021-05-09T11:29:14Z

To be honest, your use case seems very niche for a potentially large amount of work. Your workaround seems more practical.

scop · 2021-05-09T20:54:47Z

Not sure I agree with very niche, but I guessed it might be a lot of work. FWIW I can't immediately think of a case for which streaming/chunked wouldn't be the right thing to do though.

But nevermind, I had already started implementing a workaround for this in CLI only, finished it and opened #502 in case you'd be interested in that. Unfortunately as expected it does add a bunch of lines just for this purpose.

If you don't like that approach, do you think we should document the current behavior and perhaps add the above script as an example somewhere instead?

alecthomas · 2021-05-09T22:42:52Z

It is definitely niche within the context of what Chroma is used for, which is syntax highlighting source code. It's vanishingly rare that source code is large enough to be an issue in terms of buffering. For example, the amalgamated sqlite3.c is 8MB and this can be entirely loaded into RAM and lexed by Chroma without issue.

The issue with buffering in chunks as you suggest is that it is slower in the common case due to a couple of reasons:

Buffering machinery overhead - minimal it's true, but still overhead.
Retrying tokens - some tokens overflow the buffer (eg. strings, here docs, etc.) and have to be retried after extending the buffer, potentially several times.
Complexity - there needs to be coupling between the lexer and the input source to deal with extending the buffer.

Another option would be to use an io.Reader, but dlclark/regexp2 does not support consuming from an io.Reader and even if it did, again, it would be significantly slower. I know this because I've benchmarked the stdlib's regexp.MatchReader() and it is significantly slower than matching on bytes or strings.

alecthomas · 2021-05-09T22:43:21Z

I think the PR is a good compromise - I don't mind adding this complexity to the command line tool for this purpose.

scop · 2021-05-10T20:46:59Z

Thanks for taking the time to explain.

scop mentioned this issue May 9, 2021

cmd: --fail earlier without reading entire input files #502

Merged

scop mentioned this issue Jun 28, 2021

Add way to fail silently when no specific syntax has been found sharkdp/bat#1709

Open

alecthomas closed this as completed Sep 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE: streaming input processing #501

RFE: streaming input processing #501

scop commented May 9, 2021

alecthomas commented May 9, 2021

scop commented May 9, 2021

alecthomas commented May 9, 2021

alecthomas commented May 9, 2021

scop commented May 10, 2021

RFE: streaming input processing #501

RFE: streaming input processing #501

Comments

scop commented May 9, 2021

alecthomas commented May 9, 2021

scop commented May 9, 2021

alecthomas commented May 9, 2021

alecthomas commented May 9, 2021

scop commented May 10, 2021