-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FR] To extract both matched and non-matched lines or strings to 2 separate files or streams at the same time #416
Comments
This can simply be done with one $ ugrep -yhn -options phrase input_file > temp_file
$ ugrep -P '^[0-9]+:(.*)' --format='%1%~' temp_file > output_file_A
$ ugrep -P '^[0-9]+-(.*)' --format='%1%~' temp_file > output_file_B The second (Note: this won't work with some options you pick like Doing any of this only makes any sense if the For example, searching 1000 words doesn't use much RAM or time. These are the 1000 words from the ugrep benchmark) and searching a 100MB file:
The VM stats show the internal virtual machine size in opcode words as well as the DFA size in nodes and edges. That is not bad at all for 1000 string patterns to search. By comparison, searching with a tiny set of only four words:
Again, it does not matter what the input file size is. If the set of patterns consisting of regex and/or strings is very large, then memory use will increase. But it is limited only to the main thread that constructs the pattern DFA and VM. The worker threads all share the same pattern space. |
Wwith the
For fully task-parallelism, use a named pipe instead of |
I've tested but it's too slow: SPOILERWhich means it's useless in a typical scenario which I asked: when using a single text word as a phrase, as it's x1.72 - x2.73 slower than a typical sequential method (A) (benchmarked on a 661 MB text file with 33 000 000 domains - the same file we were discussing in the other thread).
A typical scenario with a single word, method A (sequential), tested with a 661MB text file, 33 000 000 lines (domains) : As for RAM, a single ug instance took up to 400MB for a while (peak), two of them at once can take 800MB, (I didn't test on larger files like 6GB or 16GB) As for time (includes removing temp file):
Also as for disk space:
Summary, as for now I stick with a typical sequential method (A) for a typical simple search.
I'll test it a bit later, as I'm busy.
Maybe because you weren't clear enought, since you created two separate comments, it misled me into thinking that you wanted me to try the both methods separately (non-parallel and parallel), perhaps instead of blaming me immediately, you should have been edit your previous comment just like you did before, instead of creating another one, or to make it very clear that the both comments are directly connected. Optionally, I can agree that the blame for the context/wording misunderstanding lies on both sides.
Please write clearly and don't blame. Optionally, I can agree that the blame lies on both sides, I also do understand you didn't like my reply, as it was created with a misassumption, and puts your method in a bad light.
Maybe, but theory doesn't always match reality, on my side I can be sure only after doing benchmarks, when I get time, but I don't know whether I'm interested anymore. |
Ahem... that's why a named pipe is best. No file writes at all. Please read. You didn't appear to get my point. You should run this task-parallel. Otherwise it will of course run slower with a temp file and separate ugrep search calls. All piped commands are just streaming commands that run as parallel processes that are relatively cheap (simple pattern to match, i.e. little overhead). This scenario does not warrant a new feature. Let me add that using stderr for regular output does not follow best practices. I think this answers your question. |
I've deleted the last message posted by @garry-ut99 because it was not in compliance with the code of conduct. |
To prove it to you, my suggestion was to use mkfifo teepipe
ugrep -y -n -F -f words4.txt enwik8 | tee teepipe | ugrep -P '^[0-9]+:(.*)' --format='%1%~' > output1.txt & \
ugrep -P '^[0-9]+-(.*)' --format='%1%~' < teepipe > output2.txt then this takes 0.33 seconds: time ./teetest.sh
0.490u 0.160s 0:00.33 196.9% 0+0k 0+0io 0pf+0w
^^^^^^^ when a single ugrep call takes 036 seconds: time ugrep -y -n -F -f words4.txt enwik8 > output1.txt
0.291u 0.060s 0:00.36 97.2% 0+0k 0+0io 0pf+0w
^^^^^^^ The This runs 2x faster than executing |
@garry-ut99 Your replies are in violation of the code of conduct, which states, among other things:
Your replies do not align with all three. If you don't agree with how I write replies, make accusations to tell me that I should edit my replies, or telling me I should explain better, or that I should read better (like what? It's all pretty simple) or that I should not blame (which I don't) or that I don't like your reply (where did I state anything like that?), then please find help elsewhere. All the best. EDIT: added and edited last sentence with explanation of the non-compliance. |
Currently it's needed A or B:
ug -options phrase input_file > output_file_A_.txt
(extracts matched lines)ug -options -v phrase input_file > output_file_B.txt
(extract non-matched/inverted lines)A: is 2x slower, as it needs to run ag two times
B:
expecially when processing heavy files or big ammout of files and doing other tasks on PC meanwhile
I know there is possible to redirect both
stdout
andstderr
to separate files in cmd line:command 2>> error 1>> output
as per https://stackoverflow.com/questions/7901517/how-to-redirect-stderr-and-stdout-to-different-files-in-the-same-line-in-script thought I'm not sure whether there is possible to redirectstdout
to 2 separate files interchangeably from a cmd line, if not, then output / file redirection would have to be handled by ugrep itself rather than cmd or maybestderr
could be used as an additional output stream while using the feature.Also if the output redirections would be handled by ugrep internally, then it would be good to have an option to add patterns to output filenames, so each processed file has two corresponding output files, whose names are input_filename+patternA and input_filename+patternB, for example: ug processes input file
file.txt
, then if provided patterns are_A
and_B
, the output filenames are:file_A.txt
andfile_B.txt
, useful when processing large ammout of files in different folders.(Keep in mind, that despite it's called "feature request", this is just a proposition / idea from my side, I don't force you to implement this idea, you decide for yourself).
The text was updated successfully, but these errors were encountered: