Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: streaming input processing #501

Closed
scop opened this issue May 9, 2021 · 5 comments
Closed

RFE: streaming input processing #501

scop opened this issue May 9, 2021 · 5 comments

Comments

@scop
Copy link
Contributor

scop commented May 9, 2021

What problem does this feature solve?

Currently chroma seems to require entire source contents to process to be in memory, which is not memory efficient, and places limits in how large files can be processed.

What feature do you propose?

Process input as a stream instead of requiring a string; use a reader, process it in say 1024 or 2048 byte chunks.

CLI with --fail could exit if a lexer cannot be determined already after examining just the first chunk or whatever smallish is deemed enough.

CLI case/consideration

A particular problem for the CLI when used as a less preprocessor is that it's not that uncommon to pass e.g. large compressed archives to less, and lesspipe or the like will then output archive contents file listing as its output. Passing large compressed archives to the chroma CLI doesn't do anything good, and the newly added --fail flag doesn't help because the reading happens too early for that to kick in.

Incidentally, the current slurping behavior was why I didn't document the specifics how to use the CLI as less preprocessor initially when implementing --fail, but failed to remember that when asked to document it. Currently I'm using this as my ~/.lessfilter to avoid hitting the issue:

#!/bin/sh
set -eu
for source in "$@"; do
	# Don't feed "too large" files to chroma (it slurps them), nor ones we know it doesn't handle (e.g. binary)
	if [ $(stat --format=%s "$source") -gt 100000000 ] || ! grep -qFI "" "$source"; then
		exit 1
	fi
	chroma --formatter=terminal16m --style=dracula "$source"
done
@alecthomas
Copy link
Owner

To be honest, your use case seems very niche for a potentially large amount of work. Your workaround seems more practical.

@scop
Copy link
Contributor Author

scop commented May 9, 2021

Not sure I agree with very niche, but I guessed it might be a lot of work. FWIW I can't immediately think of a case for which streaming/chunked wouldn't be the right thing to do though.

But nevermind, I had already started implementing a workaround for this in CLI only, finished it and opened #502 in case you'd be interested in that. Unfortunately as expected it does add a bunch of lines just for this purpose.

If you don't like that approach, do you think we should document the current behavior and perhaps add the above script as an example somewhere instead?

@alecthomas
Copy link
Owner

It is definitely niche within the context of what Chroma is used for, which is syntax highlighting source code. It's vanishingly rare that source code is large enough to be an issue in terms of buffering. For example, the amalgamated sqlite3.c is 8MB and this can be entirely loaded into RAM and lexed by Chroma without issue.

The issue with buffering in chunks as you suggest is that it is slower in the common case due to a couple of reasons:

  1. Buffering machinery overhead - minimal it's true, but still overhead.
  2. Retrying tokens - some tokens overflow the buffer (eg. strings, here docs, etc.) and have to be retried after extending the buffer, potentially several times.
  3. Complexity - there needs to be coupling between the lexer and the input source to deal with extending the buffer.

Another option would be to use an io.Reader, but dlclark/regexp2 does not support consuming from an io.Reader and even if it did, again, it would be significantly slower. I know this because I've benchmarked the stdlib's regexp.MatchReader() and it is significantly slower than matching on bytes or strings.

@alecthomas
Copy link
Owner

I think the PR is a good compromise - I don't mind adding this complexity to the command line tool for this purpose.

@scop
Copy link
Contributor Author

scop commented May 10, 2021

Thanks for taking the time to explain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants