Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] To extract both matched and non-matched lines or strings to 2 separate files or streams at the same time #416

Closed
garry-ut99 opened this issue Aug 8, 2024 · 7 comments
Labels
question A question that has or needs further clarification

Comments

@garry-ut99
Copy link

garry-ut99 commented Aug 8, 2024

Currently it's needed A or B:

  • A) to run ug twice on each file:
    ug -options phrase input_file > output_file_A_.txt (extracts matched lines)
    ug -options -v phrase input_file > output_file_B.txt (extract non-matched/inverted lines)
  • B) or to run two separate instances of ug at the same time

A: is 2x slower, as it needs to run ag two times

B:

  • takes 2x more RAM as it loads the same file twice into RAM
  • also incerases CPU usage 2x, with an additional risk of CPU being bottlenecked and causing speed reduction,
    expecially when processing heavy files or big ammout of files and doing other tasks on PC meanwhile

I know there is possible to redirect both stdout and stderr to separate files in cmd line: command 2>> error 1>> output as per https://stackoverflow.com/questions/7901517/how-to-redirect-stderr-and-stdout-to-different-files-in-the-same-line-in-script thought I'm not sure whether there is possible to redirect stdout to 2 separate files interchangeably from a cmd line, if not, then output / file redirection would have to be handled by ugrep itself rather than cmd or maybe stderr could be used as an additional output stream while using the feature.

Also if the output redirections would be handled by ugrep internally, then it would be good to have an option to add patterns to output filenames, so each processed file has two corresponding output files, whose names are input_filename+patternA and input_filename+patternB, for example: ug processes input file file.txt, then if provided patterns are _A and _B , the output filenames are: file_A.txt and file_B.txt, useful when processing large ammout of files in different folders.

(Keep in mind, that despite it's called "feature request", this is just a proposition / idea from my side, I don't force you to implement this idea, you decide for yourself).

@garry-ut99 garry-ut99 changed the title [feature] To extract both matched and non-matched lines or strings to 2 separate files at the same time [feature] To extract both matched and non-matched lines or strings to 2 separate files / streams at the same time Aug 8, 2024
@garry-ut99 garry-ut99 changed the title [feature] To extract both matched and non-matched lines or strings to 2 separate files / streams at the same time [feature] To extract both matched and non-matched lines or strings to 2 separate files or streams at the same time Aug 8, 2024
@garry-ut99 garry-ut99 changed the title [feature] To extract both matched and non-matched lines or strings to 2 separate files or streams at the same time [FR] To extract both matched and non-matched lines or strings to 2 separate files or streams at the same time Aug 8, 2024
@genivia-inc
Copy link
Member

genivia-inc commented Aug 8, 2024

This can simply be done with one ug call using option -y (any line/passthrough), because the separator is - for context lines and then calls to separate the two cases after the hard work was done in the first call:

$ ugrep -yhn -options phrase input_file > temp_file
$ ugrep -P '^[0-9]+:(.*)' --format='%1%~' temp_file > output_file_A
$ ugrep -P '^[0-9]+-(.*)' --format='%1%~' temp_file > output_file_B

The second ugrep -P strips the line numbers and separator (a : or a -).

(Note: this won't work with some options you pick like -o that conflicts with -y.)

Doing any of this only makes any sense if the phrase pattern is very large or complex, like consuming a file with tens of thousands of patterns. Normally ugrep uses only limited memory with a 256KB sliding window over the input so the size of the input_file does not matter.

For example, searching 1000 words doesn't use much RAM or time. These are the 1000 words from the ugrep benchmark) and searching a 100MB file:

/usr/bin/time -l ugrep -c -F -f benchmarks/words/4.txt benchmarks/corpi/enwik8 --stats=vm
13975

Searched 1 file in 0.186 seconds: 1 matching (100%)
Searched 1128023 lines: 13975 matching (1.239%)
The following pathname selections and search constraints were applied:
  --fixed-strings
  --no-hidden (default)
Lines matched if:
  a string in benchmarks/words/4.txt matches
VM: 8256 nodes (0ms) 8256 edges (0ms) 17513 opcode words (0ms)
        0.19 real         0.18 user         0.01 sys
             7110656  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                 532  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                   3  involuntary context switches
          1132759035  instructions retired
           561118827  cycles elapsed
             5243584  peak memory footprint

The VM stats show the internal virtual machine size in opcode words as well as the DFA size in nodes and edges.

That is not bad at all for 1000 string patterns to search. By comparison, searching with a tiny set of only four words:

/usr/bin/time -l ugrep -c -F -f benchmarks/words/1.txt benchmarks/corpi/enwik8 --stats=vm
100568

Searched 1 file in 0.0354 seconds: 1 matching (100%)
Searched 1128023 lines: 100568 matching (8.915%)
The following pathname selections and search constraints were applied:
  --fixed-strings
  --no-hidden (default)
Lines matched if:
  a string in benchmarks/words/1.txt matches
VM: 21 nodes (0ms) 21 edges (0ms) 47 opcode words (0ms)
        0.04 real         0.02 user         0.01 sys
             3964928  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                 340  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                   6  involuntary context switches
           351225050  instructions retired
           108863380  cycles elapsed
             2491008  peak memory footprint

Again, it does not matter what the input file size is.

If the set of patterns consisting of regex and/or strings is very large, then memory use will increase. But it is limited only to the main thread that constructs the pattern DFA and VM. The worker threads all share the same pattern space.

@genivia-inc genivia-inc added the question A question that has or needs further clarification label Aug 8, 2024
@genivia-inc
Copy link
Member

Wwith the tee utility you can do this in parallel:

ugrep -yhn -options phrase input_file | tee temp_file | ugrep -P '^[0-9]+:(.*)' --format='%1%~' > output_file_A
ugrep -P '^[0-9]+-(.*)' --format='%1%~' temp_file > output_file_B

For fully task-parallelism, use a named pipe instead of temp_file. The tee will send to it and the last ugrep call will read it.

@garry-ut99
Copy link
Author

garry-ut99 commented Aug 9, 2024

genivia-inc : #416 (comment) : This can simply be done with (...) Doing any of this only makes any sense if the phrase pattern is very large or complex (...)

I've tested but it's too slow:

SPOILER

Which means it's useless in a typical scenario which I asked: when using a single text word as a phrase, as it's x1.72 - x2.73 slower than a typical sequential method (A) (benchmarked on a 661 MB text file with 33 000 000 domains - the same file we were discussing in the other thread).

For example, searching 1000 words doesn't use much RAM or time.

A typical scenario with a single word, method A (sequential), tested with a 661MB text file, 33 000 000 lines (domains) :

As for RAM, a single ug instance took up to 400MB for a while (peak), two of them at once can take 800MB, (I didn't test on larger files like 6GB or 16GB)

As for time (includes removing temp file):

  • a single word: ~500 hits, method A (sequential) : 3.23s ; your method 8.82s
  • a single word: 16 600 000 hits, method A (sequential) : 6.67s ; your method 11.45s

Also as for disk space:

  • your method creates additional temp file which can be 1.5x larger than input file (or even 2x-3x, depending on input file content), for example almost 1GB for a 660MB test file, because it inserts : + - and also numbers in each line, which grows quickly and take much additional space, it wastes disk's lifespan, especially having dozens of GB of files to process, as it generates additional 150% data on disk (or even 200%-300% depending on input file's content), atop of the 100% of source data + 100% of output data, also finally the temp file has to be also removed.

Summary, as for now I stick with a typical sequential method (A) for a typical simple search.
As for pararrel (B) I tried it, but failed so far, so I will try again soon.


genivia-inc : #416 (comment) : Wwith the tee utility you can do this in parallel:

I'll test it a bit later, as I'm busy.


genivia-inc : #416 (comment): You didn't appear to get my point.

Maybe because you weren't clear enought, since you created two separate comments, it misled me into thinking that you wanted me to try the both methods separately (non-parallel and parallel), perhaps instead of blaming me immediately, you should have been edit your previous comment just like you did before, instead of creating another one, or to make it very clear that the both comments are directly connected. Optionally, I can agree that the blame for the context/wording misunderstanding lies on both sides.

genivia-inc : Please read.

Please write clearly and don't blame. Optionally, I can agree that the blame lies on both sides, I also do understand you didn't like my reply, as it was created with a misassumption, and puts your method in a bad light.

genivia-inc : I think this answers your question.

Maybe, but theory doesn't always match reality, on my side I can be sure only after doing benchmarks, when I get time, but I don't know whether I'm interested anymore.

@genivia-inc
Copy link
Member

genivia-inc commented Aug 9, 2024

Also as for disk space:

  • your method creates additional temp file which can be 1.5x larger than input file (or even 2x-3x, depending on input file content), for example almost 1GB for a 660MB test file, because it inserts : + - and also numbers in each line, which grows quickly and take much additional space, it wastes disk's lifespan, especially having dozens of GB of files to process, as it generates additional 150% data on disk (or even 200%-300% depending on input file's content), atop of the 100% of source data + 100% of output data, also finally the temp file has to be also removed.

Ahem... that's why a named pipe is best. No file writes at all. Please read.

You didn't appear to get my point. You should run this task-parallel. Otherwise it will of course run slower with a temp file and separate ugrep search calls. All piped commands are just streaming commands that run as parallel processes that are relatively cheap (simple pattern to match, i.e. little overhead).

This scenario does not warrant a new feature.

Let me add that using stderr for regular output does not follow best practices.

I think this answers your question.

@Genivia Genivia deleted a comment from garry-ut99 Aug 9, 2024
@genivia-inc
Copy link
Member

I've deleted the last message posted by @garry-ut99 because it was not in compliance with the code of conduct.

@genivia-inc
Copy link
Member

genivia-inc commented Aug 9, 2024

To prove it to you, my suggestion was to use tee and a named pipe and put in a script file e.g. teetest.sh (my example here uses 1000 words to search and a 100MB file to search):

mkfifo teepipe
ugrep -y -n -F -f words4.txt enwik8 | tee teepipe | ugrep -P '^[0-9]+:(.*)' --format='%1%~' > output1.txt & \
ugrep -P '^[0-9]+-(.*)' --format='%1%~' < teepipe > output2.txt

then this takes 0.33 seconds:

time ./teetest.sh
0.490u 0.160s 0:00.33 196.9%	0+0k 0+0io 0pf+0w
              ^^^^^^^

when a single ugrep call takes 036 seconds:

time ugrep -y -n -F -f words4.txt enwik8 > output1.txt
0.291u 0.060s 0:00.36 97.2%	0+0k 0+0io 0pf+0w
              ^^^^^^^

The tee and two extra ugrep -P aren't expensive to execute and use less memory than the main work done by the first ugrep when the search pattern is very large (an important assumption!)

This runs 2x faster than executing ugrep and ugrep -v separately in about 0.36s + 0.36s = 0.72 seconds to get the results for output1.txt and output2.txt separately. You could run ugrep and ugrep -v in parallel in 0.36 seconds. If the search patterns are "normal", then that will run very well in parallel without any issues (not much RAM use, 256KB for the input buffer and a few more KB for other stuff, since the executable code is shared among the two cores). If the search pattern specified is huge (a couple of MB) then this consumes memory to a point perhaps that it makes sense to use the approach above. But you don't have to. Note that the input file size does not matter for RAM use, since input files are not stored in memory as a whole.

Edit: screenshot evidence:
image

@genivia-inc
Copy link
Member

genivia-inc commented Aug 11, 2024

@garry-ut99 Your replies are in violation of the code of conduct, which states, among other things:

  • Using welcoming and inclusive language
  • Being respectful of differing viewpoints and experiences
  • Gracefully accepting constructive criticism

Your replies do not align with all three.

If you don't agree with how I write replies, make accusations to tell me that I should edit my replies, or telling me I should explain better, or that I should read better (like what? It's all pretty simple) or that I should not blame (which I don't) or that I don't like your reply (where did I state anything like that?), then please find help elsewhere.

All the best.

EDIT: added and edited last sentence with explanation of the non-compliance.

@Genivia Genivia deleted a comment from garry-ut99 Aug 11, 2024
@Genivia Genivia deleted a comment from garry-ut99 Aug 13, 2024
@Genivia Genivia locked and limited conversation to collaborators Aug 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question A question that has or needs further clarification
Projects
None yet
Development

No branches or pull requests

2 participants