maximum line length #129

BurntSushi · 2016-09-28T17:47:19Z

I'd like ripgrep to have the ability to either hide or trim lines that are very long. Some lines take up my entire screen and are borderline useless to look at. It's possible that finding an intelligent way to shorten them would be best, since my guess is that the actual matched text is much smaller than the full line. However, this is harder to implement.

I don't think this should be enabled by default. It seems a little surprising for ripgrep to hide lines like that. In general, I like the work flow of, "run a search, see huge lines, confirm that I don't care about them and run ripgrep again with an option to hide them." It may however be plausible to enable this limit if results are being dumped to a terminal (we already enable colors, line numbers and file headings).

The text was updated successfully, but these errors were encountered:

kaushalmodi · 2016-09-29T00:46:46Z

A very good use case is searching in minified js/css.

BurntSushi · 2016-09-29T00:47:42Z

@kaushalmodi Yup. I'm pretty sure that was precisely the thing I hit. :-)

tjayrush · 2016-10-25T18:01:08Z

I'd rather see a Linux-like approach to this. Pipe the file through something like 'tr' first, then pipe the output through 'cut' Like this: cat file.js | sed 's/;/;\n/' | ripgrep whatever | cut -c1-100

edited: to correct command line to use sed

BurntSushi · 2016-10-25T18:07:43Z

@tjayrush What is the purpose of tr there?

In any case, rg has the opportunity to show better output here, for example, by only showing the match and a few characters around it.

tjayrush · 2016-10-25T18:56:36Z

I meant for 'tr' to convert some character into new lines (I forgot to finish editing my comment) to alleviate the long line problem. If 'rg' can already limit to the context that's better.

The larger point was that instead of adding a new command line option, pipe the output (or input) into another already-existing command line program.

Thomas Jay Rush
jrush@greathill.com

On Oct 25, 2016, at 2:07 PM, Andrew Gallant notifications@github.com wrote:

@tjayrush What is the purpose of tr there?

In any case, rg has the opportunity to show better output here, for example, by only showing the match and a few characters around it.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

BurntSushi · 2016-10-25T21:55:46Z

I think my comment still stands.

Also, rg searches recursively by default. In that case, it doesn't make
sense to modify the input.

I don't think you need to argue in favor of piping. I'm all for it. But it
isn't always sufficient.

On Oct 25, 2016 2:56 PM, "Thomas Jay Rush" notifications@github.com wrote:

I meant for 'tr' to convert some character into new lines (I forgot to
finish editing my comment) to alleviate the long line problem. If 'rg' can
already limit to the context that's better.

The larger point was that instead of adding a new command line option,
pipe the output (or input) into another already-existing command line
program.

Thomas Jay Rush
jrush@greathill.com

On Oct 25, 2016, at 2:07 PM, Andrew Gallant notifications@github.com
wrote:

@tjayrush What is the purpose of tr there?

In any case, rg has the opportunity to show better output here, for
example, by only showing the match and a few characters around it.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#129 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAb34iE21_xLvAOvcbGrT8TAZgCpsleNks5q3lDmgaJpZM4KJHLX
.

BurntSushi · 2016-11-19T15:05:17Z

OK, in the interest of moving forward, here is a proposed specification.

Add a new flag, -M/--max-columns NUM, which is disabled by default. When enabled, lines with more than NUM columns are suppressed from output. Columns are counted as the number of bytes in a line.

I mentioned above it might be cool to only show the context around each match (within a line), but I think that's a complicated feature that deserves its own issue/motivation/specification. I think the above flag solves the most pressing problem.

hpsu · 2016-11-22T09:31:13Z

ag has
-W --width NUM Truncate match lines after NUM characters
but yeah, I would really like this!

forgottenswitch · 2016-11-22T15:52:46Z

Spec

Make length of printed lines limit-able with conventional COLUMNS environment variable.
If the latter is not a parseable integer, query the tty or Console width.

Every output line should consist of a contiguous input.

Amount of context before first match on output line should be equal to that after last.
Add the --min-context NUM parameter which specifies minimal context length.

Print at least one match on every output line.
If contexts of a single match do not fit, decrease the min-context for them.
If a single match itself does not fit, drop its end.

BurntSushi · 2016-11-22T16:48:05Z

If the latter is not a parseable integer, query the tty or Console width.

I would prefer to not add more code that does this to ripgrep. There's already too much of it and it has been a terrible pain to maintain. It's sucked up unbelievable amounts of time. Let's just use COLUMNS and/or a user-providable setting.

Amount of context before first match on output line should be equal to that after last.
Add the --min-context NUM parameter which specifies minimal context length.

How does this interact with character encodings? You can't just arbitrarily cut a byte sequence, since you may wind up cutting a UTF-8 sequence in half, for example. While the input isn't necessarily valid UTF-8, our output better be valid UTF-8 if the input is also valid UTF-8.

Print at least one match on every output line.
If contexts of a single match do not fit, decrease the min-context for them.
If a single match itself does not fit, drop its end.

How does this generalize to multiple matches on the same line?

What about lines emitted from --context/--before-context/--after-context?

forgottenswitch · 2016-11-22T16:58:17Z

Let's just use COLUMNS and/or a user-providable setting.

But what can go wrong? Explicit setting is meant for editor plugins.

How does this generalize to multiple matches on the same line?

This paragraph only applies to the case of the single one (on the output line).

Lines from --context get trimmed at end.

BurntSushi · 2016-11-22T17:02:55Z

But what can go wrong?

I spent my entire weekend dealing with tty stuff on Windows. I don't want to do it again.

forgottenswitch · 2016-11-22T17:03:16Z

It's already been dealt with in mingw-w64, and isn't really rust/ripgrep issue as they do obey the tty api.

The tty fiddling (size/isatty) should be in a stable separate crate.

BurntSushi · 2016-11-22T17:14:16Z

@forgottenswitch ripgrep has to work with MSVC too.

I don't really feel like debating this much further. I'd like to leave it out for now. We can revisit it later.

forgottenswitch · 2016-11-22T17:26:58Z

How does this interact with character encodings?

All printing is done in UTF-8 characters.
The font is assumed to be monospaced, i.e. not to have characters that take up fractional number of columns.
Each incorrect byte is assumed to take up 1 column.
Each character is assumed to take up 1 column, with the exception of Tab taking 8.
(as long as there is no Char.len_in_glyphs; some characters take 0, but ignore it).

BurntSushi · 2016-11-22T17:40:11Z

Consider the following line where foo corresponds to the match:

☃☃☃foo☃☃☃

Its UTF-8 encoding looks like this:

\xe2\x98\x83\xe2\x98\x83\xe2\x98\x83foo\xe2\x98\x83\xe2\x98\x83\xe2\x98\x83

Now consider a --min-context of 2. Does 2 refer to Unicode codepoints? Or bytes? If bytes, you'd end up splitting the snowmen on either side of foo.

If 2 refers to Unicode codepoints, then I guess you also need to consider graphemes? Should we worry about that? And even if you figure all of that out, you have to realize that the input may not be valid UTF-8.

I see, OK, you updated your comment. You want the 2 to refer to Unicode codepoints, and you want each codepoint to be 1 column. I think that's OK, but we probably want to use the unicode-width crate to estimate the display columns of a single codepoint. But it also works with &str, which is nice.

forgottenswitch · 2016-11-22T17:51:53Z

OK, will update the pull when done.

BurntSushi · 2016-11-22T17:57:25Z

Here are some thoughts that I would like you to consider:

I still consider this an incredibly complex feature with lots of edge cases. Could you please write up the full spec in one comment so it's easy to follow? (Editing a prior comment is fine.)
Hacking it into the existing printer without a better abstraction is something I don't want to maintain.
The existing printer, in fact, isn't really something I want to maintain. I would like to refactor it and possibly move it to an external library. I would like you to consider holding off implementing this until then.

forgottenswitch · 2016-11-22T18:15:00Z

Spec

Make length of printed lines limit-able with conventional COLUMNS environment variable.

Every output line consist of a contiguous input.

Amount of context before the first match on output line should be equal to that after last.
Add the --min-context NUM parameter which specifies minimal context length on each side.

Print at least one match on every output line.
If only one match on this output line:

if contexts do not fit, decrease the min-context for them
if the match itself does not fit, drop its end

Lines from --context get trimmed at end.

All printing is done in UTF-8 characters.
The font is assumed to be monospaced, i.e. not to have characters that take up fractional number of columns.
Each incorrect byte is assumed to take up 1 column.
How many columns each character takes is determined with unicode-width crate.

forgottenswitch · 2016-11-22T18:18:10Z

Each incorrect byte is assumed to take up 1 column.
How many columns each character takes is determined with unicode-width crate.

Unclear. Implementing should bring more light.

samuelcolvin · 2017-01-06T14:33:37Z

This would be extremely useful. I've just searched a directory which happened to include a few .min.js files - the output is basically useless.

I agree COLUMNS is the best way to do, preferably without a required switch so things "just work".

A few suggestions:

just truncate lines at COLUMNS x 3 unless a --no-truncate flag is set
show [2 matches found in line of length 12,345,678 characters] instead of showing the line if the line is greater than COLUMNS x 3.
use -uu or -uuu flag. *.min.js or any file with only very long lines can basically be considered equivalent to binary (?), or at least they are in my head.

forgottenswitch · 2017-01-06T17:31:48Z

I agree COLUMNS is the best way to do, preferably without a required switch so things "just work".

They don't show up in env output. Therefore at least export COLUMNS would be required in .bashrc. But they are only needed util there is a (reliable) tty crate.

Also, --min-context should probably be a fraction not to easily exceed COLUMNS/2.

samuelcolvin · 2017-01-06T17:36:26Z

COLUMNS works fine on linux:

~ || echo "number of columns: $COLUMNS"
number of columns: 80

Is anyone serious really using another OS and trying to work in the terminal?

In 2017, I mean really?

forgottenswitch · 2017-01-06T17:36:58Z

$ env | grep COL
$

samuelcolvin · 2017-01-06T17:40:56Z

What OS is that?

env doesn't work for me either, I need to either reference COLUMNS as above or use printenv.

forgottenswitch · 2017-01-06T17:42:31Z

It is linux, and man bash does not say that COLUMNS are exported, only that it is "set by the shell".

samuelcolvin · 2017-01-06T17:43:43Z

ye, i see the problem now, COLUMNS is available in the shell but not in a subprocess.

tty is needed. My mistake, sorry.

BurntSushi · 2017-01-07T15:57:31Z

We absolutely, positively, cannot make it the default. We could make it the default when ripgrep is emitting to a tty (like we've done for colors, line numbers and match groupings). We can't make it the true default because ripgrep (like grep) is useful for filtering data as well, and a tool that automatically drops lines just because they are too long does not sound good to me. I am still weary about making it the default even for just emitting to a tty. I worry that dropping data is a bridge too far. However, we've already kind of crossed that bridge by filtering corpora using .gitignore files.

I would be fine with using --truncate as the name. However, this does imply that some part of the line is still visible instead of being dropped completely, which I imagine is also useful. I'd like to only add a single flag for this.

Finally, when --truncate is given, does it specify bytes, Unicode codepoints or graphemes? There are performance and usability trade offs here. Simply hiding the lines is considerably simpler.

BurntSushi · 2017-01-07T15:59:03Z

@forgottenswitch I would like to move away from "fitting" the lines. I think we should consider two options: simple truncation or complete hiding.

forgottenswitch · 2017-01-07T16:00:23Z

I expected that as the simpler way, but isn't it the #129 (comment) with columns counted as graphemes?

samuelcolvin · 2017-01-07T16:03:41Z

I would say truncation was preferable and "columns" in the terminal (whatever that translates to) would be best but, at least for me.

BurntSushi · 2017-01-07T16:06:16Z

@forgottenswitch I'm sorry, but I don't understand. This comment mentions "fitting" the lines and looking at matches.

What I'm trying to say is this: if we can avoid writing a specification that needs to do something that is encoding aware, then this feature becomes much simpler. The --max-columns flag, which simply drops lines that are too long, achieves this because it uses the length of a line in bytes. The --truncate NUM flag, however, poses interesting difficulties. If NUM is bytes, then it's possible to naively emit invalid UTF-8 even when the input is completely valid UTF-8.

@samuelcolvin "columns" in the terminal is a visual thing. A single column can contain an arbitrary number of bytes because of combining characters in Unicode. Checking if every line satisfies this limit is not feasible because of the performance hit it requires.

BurntSushi · 2017-01-07T16:08:35Z

The reason why I like having --max-columns disabled by default (even for emitting to tty) is that it makes dropping the lines a specifically user driven action. The --truncate NUM approach satisfies this by marking truncatation with ellipsis, but then you wind up with other problems described in my previous comment. We could add a short flag, -M, to make this a bit easier to apply iteratively.

samuelcolvin · 2017-01-07T16:11:54Z

I understand why don't want to apply it by default, fine with me.

For length, I guess just bytes is fine. The primary use case is just to make viewing easier where a line which shows 111 columns instead of 120 is no problem. I'm guessing 99% of ripgrep usage is on ascii code anyway.

BurntSushi · 2017-01-07T16:13:32Z

I'm guessing 99% of ripgrep usage is on ascii code anyway.

@samuelcolvin Possibly. I do track mentions of ripgrep on the Internet, and there seems to be quite a few mentions of it in otherwise Chinese, Russian and Japanese publications.

forgottenswitch · 2017-01-07T16:16:57Z

The --truncate NUM flag, however, poses interesting difficulties. If NUM is bytes, then it's possible to naively emit invalid UTF-8 even when the input is completely valid UTF-8.

Backtrack to beginning of the last character?
In UTF-8, it is "backtrack to last non-10xxxxxx byte".

Also, --truncate[-as-utf] NUM (or -MM) for slow mode, considering graphemes.
But then again, -utf depends on unicode-width.

Fitting was to improve both readability and pager's scrolling of overly long lines
(dozens of screen lines, either from binary or minified/log files).
It probably doesn't matter at all because 1. terminal scroll could be used instead of pager 2. rg is for code search
It would have also be slow by requiring utf parsing.
If grapheme width determination was too slow, maybe a dedicated thread could have been tried.

Maybe ripgrep should just consider any file having a line longer than 256 bytes a binary.

samuelcolvin · 2017-01-07T16:19:36Z

@forgottenswitch you seem determined to make this complicated. :-)

Surely best to start with the simplest and most obvious solution and make it more complicated when required.

utf8 counting would be nice but isn't required; anything else is surely excessive at this stage.

forgottenswitch · 2017-01-08T04:18:11Z

OK, I tried addressing different problem - printing matches even in huge lines, instead of omitting them.

forgottenswitch · 2017-01-26T16:33:16Z

Context before a match could be ellipsized, and total number of matches on a line limited.
Implementation, benchmark.
Incorrect UTF-8 is handled by String::to_utf8_lossy-ing the context.

RalfJung · 2017-02-02T12:37:33Z

I have to say I wouldn't find just dropping lines very useful; I still may want to know that these lines matched, but I also want to be able to see other matches. I hacked something quick-n-dirty together in https://github.com/RalfJung/ripgrep/tree/longlines, but I'm not proposing this as a solution to anything, I was just looking for an excuse to write some Rust code ;)

The simplest thing I can come up with that I'd actually like to use is something where we can set a --max-width or so, and if the line in bytes is longer than that, instead of printing the line, it prints how many matches were found in the line. So nothing would be entirely omitted. Would that be accepted? I'd be happy to give it a try.

Of course, something that actually ellipsizes long lines and prints them as "... context match1 context ... context match2 context ..." would be better, but then as you discussed above things quickly become really complicated because unicode.

BurntSushi · 2017-02-02T12:43:04Z

The simplest thing I can come up with that I'd actually like to use is something where we can set a --max-width or so, and if the line in bytes is longer than that, instead of printing the line, it prints how many matches were found in the line. So nothing would be entirely omitted. Would that be accepted? I'd be happy to give it a try.

@RalfJung I think I like that idea. It would have to be disabled by default though (and perhaps enabled by default when emitting to a tty).

RalfJung · 2017-02-02T14:09:51Z

All right, I prototyped this at RalfJung@9618f87. This is just about the behavior if the option is set (to 80), obviously; what remains to be done is adding the CLI argument, deciding about the default, and wiring the printer to that.

There are some open questions however already on this level:

What to do in "replace" mode? Right now, the option would not affect that mode at all. I think this makes sense. Probably, in replace mode, you actually want to do something with that replaced text.
Dies ripgrep support any form of internationalization? I now plugged an English string right into there.
Bikeshedding about the way the message is displayed. ;)
How to write the newline after the message. I currently just call write_eol, but it seems like it could happen that write_match then calls write_eol again for the same line. What is the best way to avoid that? To avoid duplicate newlines, I could test the exact negative of what write_match tests when writing the "stuff omitted" message, but that sounds like an awful idea. I could move the check performed by write_match into write_matched_line, but that will lead to code duplication. So I guess the cleanest thing to do is to write a little function write_eol_if_missing_from_buf or so, that takes an &[u8] and calls write_eol if that buffer does not end in a newline.
What to do with context lines? Should it just silently not write them (could lead to weird gaps, where some context line is printed and others are not), or print a message saying that a line has been omitted?

EDIT: Slightly updated version at RalfJung@50c07fc

forgottenswitch · 2017-02-02T15:39:44Z

Of course, something that actually ellipsizes long lines and prints them as "... context match1 context ... context match2 context ..." would be better, but then as you discussed above things quickly become really complicated because unicode.

I posted implementation above; it does not seem to be so.

RalfJung · 2017-02-03T10:46:58Z

Wow, impressive! That does look nice indeed. I somehow missed these links before, sorry.

BurntSushi · 2017-02-03T12:03:59Z

I would like to avoid the contextual display. That should be a separate issue. I do not want to bring that code (if at all) until libripgrep is done. The code and the feature are too complex.

@RalfJung I don't have rock solid answers for your questions, but here's an attempt:

If the -r/--replace flag is used, then ignoring --max-width doesn't seem quite right. I think it should still be applied to the result of the replacement.
ripgrep has no internationalization.
We can bikeshed in the PR.
I'm not sure about the eol writing. I'd have to look more closely.
For context lines, I think printing something like "context line omitted because of --max-width" is reasonable.

RalfJung · 2017-02-03T12:13:30Z

If the -r/--replace flag is used, then ignoring --max-width doesn't seem quite right. I think it should still be applied to the result of the replacement.

So should it also still show the number of matches? That'd be extra work, unless there is a way I don't know about to get this from re.replace_all.

BurntSushi · 2017-02-03T12:16:04Z

@RalfJung re.replace_all accepts anything that satisfies the regex::bytes::Replacer trait (including FnMut(&Captures) -> Vec<u8> or your own type), which should enable you to both do the replacement and count the number of times it occurred.

RalfJung · 2017-02-11T17:58:28Z

Right, I could use a closure or implement the trait for a custom type that also does the counting... but that will cost some performance, it seems, since we want to be called for every match and hence cannot use the no_expansion optimization.

This permits setting the maximum line width with respect to the number of bytes in a line. Omitted lines (whether part of a match, replacement or context) are replaced with a message stating that the line was elided. Fixes #129

BurntSushi added the enhancement An enhancement to the functionality of the software. label Sep 28, 2016

BurntSushi added the help wanted Others are encouraged to work on this issue. label Nov 19, 2016

forgottenswitch mentioned this issue Nov 22, 2016

Limit line lengths #245

Closed

BurntSushi added the libripgrep An issue related to modularizing ripgrep into libraries. label Jan 11, 2017

BurntSushi added this to the libripgrep milestone Jan 11, 2017

RalfJung mentioned this issue Feb 11, 2017

Add option --max-columns to truncate long lines #356

Closed

BurntSushi mentioned this issue Feb 15, 2017

option to limit context shown per line: match +- K columns) #365

Closed

BurntSushi mentioned this issue Mar 13, 2017

Add new -M/--max-columns option. #406

Merged

BurntSushi closed this as completed in #406 Mar 13, 2017

maximum line length #129

maximum line length #129

Comments

BurntSushi commented Sep 28, 2016

kaushalmodi commented Sep 29, 2016

BurntSushi commented Sep 29, 2016

tjayrush commented Oct 25, 2016 • edited Loading

BurntSushi commented Oct 25, 2016

tjayrush commented Oct 25, 2016

BurntSushi commented Oct 25, 2016

BurntSushi commented Nov 19, 2016

hpsu commented Nov 22, 2016

forgottenswitch commented Nov 22, 2016 • edited Loading

Spec

BurntSushi commented Nov 22, 2016 • edited Loading

forgottenswitch commented Nov 22, 2016 • edited Loading

BurntSushi commented Nov 22, 2016

forgottenswitch commented Nov 22, 2016 • edited Loading

BurntSushi commented Nov 22, 2016

forgottenswitch commented Nov 22, 2016 • edited Loading

BurntSushi commented Nov 22, 2016

forgottenswitch commented Nov 22, 2016

BurntSushi commented Nov 22, 2016

forgottenswitch commented Nov 22, 2016 • edited Loading

Spec

forgottenswitch commented Nov 22, 2016

samuelcolvin commented Jan 6, 2017

forgottenswitch commented Jan 6, 2017 • edited Loading

samuelcolvin commented Jan 6, 2017

forgottenswitch commented Jan 6, 2017

samuelcolvin commented Jan 6, 2017

forgottenswitch commented Jan 6, 2017

samuelcolvin commented Jan 6, 2017

BurntSushi commented Jan 7, 2017

BurntSushi commented Jan 7, 2017

forgottenswitch commented Jan 7, 2017

samuelcolvin commented Jan 7, 2017

BurntSushi commented Jan 7, 2017

BurntSushi commented Jan 7, 2017 • edited Loading

samuelcolvin commented Jan 7, 2017

BurntSushi commented Jan 7, 2017

forgottenswitch commented Jan 7, 2017 • edited Loading

samuelcolvin commented Jan 7, 2017

forgottenswitch commented Jan 8, 2017

forgottenswitch commented Jan 26, 2017 • edited Loading

RalfJung commented Feb 2, 2017

BurntSushi commented Feb 2, 2017

RalfJung commented Feb 2, 2017 • edited Loading

forgottenswitch commented Feb 2, 2017

RalfJung commented Feb 3, 2017

BurntSushi commented Feb 3, 2017

RalfJung commented Feb 3, 2017

BurntSushi commented Feb 3, 2017

RalfJung commented Feb 11, 2017

tjayrush commented Oct 25, 2016 •

edited

Loading

forgottenswitch commented Nov 22, 2016 •

edited

Loading

BurntSushi commented Nov 22, 2016 •

edited

Loading

forgottenswitch commented Nov 22, 2016 •

edited

Loading

forgottenswitch commented Nov 22, 2016 •

edited

Loading

forgottenswitch commented Nov 22, 2016 •

edited

Loading

forgottenswitch commented Nov 22, 2016 •

edited

Loading

forgottenswitch commented Jan 6, 2017 •

edited

Loading

BurntSushi commented Jan 7, 2017 •

edited

Loading

forgottenswitch commented Jan 7, 2017 •

edited

Loading

forgottenswitch commented Jan 26, 2017 •

edited

Loading

RalfJung commented Feb 2, 2017 •

edited

Loading