-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add support for compressed file formats (closes #539) #751
Conversation
complete/_rg
Outdated
@@ -87,6 +87,7 @@ _rg() { | |||
'(-w -x --line-regexp --word-regexp)'{-w,--word-regexp}'[only show matches surrounded by word boundaries]' | |||
'(-e -f --file --files --regexp --type-list)1: :_rg_pattern' | |||
'(--type-list)*:file:_files' | |||
'*'{-z,--search-zip}'search through zip files' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exclusions and the syntax itself are wrong here. At a minimum it should be something like this:
'(-z --search-zip)'{-z,--search-zip}'[search through zip files]'
We might also want to add other exclusions, but i'd like to do another pass on this file within the next few weeks, so it can wait until then to get fancier.
I would also suggest changing the description wording to something like search contents of compressed files — 'through' is a bit ambiguous, and we don't actually search zip files at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix the syntax as per your suggestion. This file is always throwing me off. Sorry about that.
I thought of using -C/--search-compressed
instead of -z/--search-zip
. The only reason I went with the latter is that it would be familiar to folks coming from using ag
. Like you said, search-zip
doesn't really make a lot of sense here. Should we just go with the former option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The completion-function syntax takes a while to get used to, yeah. Every few weeks i learn that something i used to do in these files is wrong, lol.
I like -z
for the short option. It's what ag
and sift
both use, and the connection with 'Ziv' is immediately intuitive. As far as the long option, idk. --search-zip
feels slightly inaccurate, but you're right about ag
— it's one of the most common recursive file searchers, and definitely the most well known one that supports searching compressed files, so it's not unreasonable to borrow the terminology. --search-compressed
makes sense too tho, w/e 🤷♀️
doc/rg.1
Outdated
.RS | ||
.PP | ||
Note that rg expects to find the decompression binaries for the | ||
respective formats on your system\[aq]s PATH for use with this flag. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'On your PATH
' feels strange to me — 'in your PATH
'?
src/app.rs
Outdated
doc!(h, "search-zip", | ||
"Search through compressed files.", | ||
"Search through compressed files. Currently gz, bz2, xz, and \ | ||
lzma files are supported. This options expects the uncompression \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about 'on your PATH
', but also i think 'uncompression' should be 'decompression' here
There is a problem with the completion function, and i added a few bike-shed wording remarks. I won't comment on the Rust stuff except to say that it does at least make sense to me. I tested the patch on macOS (just briefly) and it works! This is neat as hell, ty! |
} | ||
|
||
lazy_static! { | ||
static ref DECOMPRESSION_COMMANDS: HashMap<&'static str, DecompressionCommand> = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why isn't this an enum?
better yet, can't this stuff be done without spawning subprocesses? i imagine there are rust libraries for most of these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@balajisivaraman Thanks so much for working on this! I really wanted to get this merged and there were some fiddly details to get right here, so I went ahead and based a new PR on your work here: #767. I hope that's OK. Most of your code made it through :-) but I did need to refactor the get_reader
method (which is now DecompressionReader::from_path
).
I left some comments on this PR that roughly correspond to the changes I made.
use std::process::Command; | ||
use std::process::Stdio; | ||
use std::process::ChildStdout; | ||
use globset::{Glob, GlobSet, GlobSetBuilder}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imports should be consistent with the rest of the code. They should be grouped in three blocks, where each block is separated by whitespace. The first block are std
imports. The second block are external crate imports. The third block are internal crate imports. Each block should be sorted lexicographically.
/// necessary CLI args | ||
struct DecompressionCommand { | ||
cmd: &'static str, | ||
args: Vec<&'static str> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are building these values in a lazy_static!
, we don't actually need the Vec
. We can use an &'static [&'static str]
instead.
pub fn get_reader(path: &Path, no_messages: bool) -> Option<ChildStdout> { | ||
if is_tar_archive(path) { | ||
debug!("Tar archives are currently unsupported: {}", path.display()); | ||
None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I like to keep control flow straight-forward and avoid gratuitous nesting. Early returns are very useful for this. For example, this code could be written as:
if is_tar_archive(path) {
debug!(...);
return None;
}
let extension = ...;
That way, you don't need to indent the entire rest of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that makes a lot of sense. I did do this in the worker.rs
file but missed it here. 👍
String::from_utf8_lossy(error.as_slice())); | ||
} | ||
None | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so the issue here is that this will deadlock in non-trivial cases. You might not have hit it if you only tested on small files. Basically the issue here is that if you don't read from stdout, then the process you spawned (if it's implemented sanely) will at some point block on writing to stdout because it is connected to a pipe. If the calling process never reads from stdout, then the spawned process will eventually block. If the calling process is in turn blocked on something else from the spawned process---which in this case, we are, because we're trying to read from stderr
until the spawned process stops writing to it---then you wind up with a deadlock.
If you only test this on small files then you won't notice this because the spawned process will likely be using buffered output. And since it's connected to a pipe, the buffer is usually biggish (~4K?) as opposed to being line buffered on a tty. As a result, the spawned process will finish doing its thing before ever flushing its buffer, which gives the appearance of code that works. :-)
The "fix" to this is to couple the child
with reading stdout and wait to read stderr until the process terminates. This basically means that you can't return ChildStdout
directly. You need to return something else that wraps up this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an additional upside, the reader that you return can return normal error values, which are handled by the worker. This means you don't need to thread no_messages
through the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I knew it. 😞 When I made this change and looked at the code, I knew there had to be something I was missing.
Thank you for the detailed write up on the issues with my code. It makes a lot of sense to me and helps me understand it better. It also gives me incentive to do some more reading up on pipes and child processes, which I will. 👍
let err = String::from_utf8_lossy(&output.stderr); | ||
assert_eq!(err.contains("not in gzip format"), true); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic tests, thank you. :-)
I merged this PR in #767. Thanks again @balajisivaraman! |
@BurntSushi, Thank you for merging this. It is one of my first significant contributions of any kind to a OSS project. I learned a lot from working on it as well as going through the changes you made on top of it. Cheers. :-) |
@balajisivaraman It's a helluva first contribution! It will be a highlight feature in the next release. :) |
Creating this PR for review purposes. I still haven't added any tests for the newer code, which is one of the main things still pending here.
Notes:
globset
as dependency forripgrep
module to easily check for supported file formats. The in-builtpath.extension()
method did not allow me to verify for*.tar.gz
files. 😞output
from Command instead ofspawn
, which pipes stdout and stderr by default and gives us access to the process' exit status. However, it would've meant loading the entire file into memory in the form of aVec<u8>
. I've therefore gone with thespawn
approach here, coupled with peeking at thestderr
to verify whether the command succeeded or failed. It works nicely enough in practice in my testing.spawn
method ends up returning anErr
immediately if the command is not found. All other errors are obtained from the process' stderr. That's why I've handled it such that ifspawn
returns anError
, thenrg
will output the appropriate debug message.-z/--search-zip
. One thing to think of is how shouldrg
handle when the-a
argument is provided? Right now, if-a
and-z
are both provided,rg
will search zip files as per the new code, but will also end up treating it as a binary file and try searching it. Are we OK with this?no-messages
flag to thedecompressor
, but there's no better option if we want all error handling to remain within the module itself. In this way, theworker
module remains largely unaffected if we choose to change how we do decompression at a later point in time.Any other thoughts and suggestions are welcome.