Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

still amazing yet helm-org-rifle-org-directory and helm-org-rifle-occur-directories really slow #15

Open
zeltak opened this issue Apr 12, 2017 · 28 comments

Comments

@zeltak
Copy link

zeltak commented Apr 12, 2017

Hi again

so im loving helm-org-file and have been using it alot over the past year or so. one thing that is constantly breaking my fast workflow are commands that work on directories such as helm-org-rifle-org-directory and helm-org-rifle-occur-directories. these can take a long time to load from the time you issue the command until it asks for input. i saw today with the latest ivy release that he included ripgrep for searching with is blazing fast. I was wondering if we can have something like that in helm-org-rifle?

thx alot again for your amazing work!

Z

@alphapapa
Copy link
Owner

alphapapa commented Apr 12, 2017

Hey Z!

Funny you should mention that, just yesterday I was working on this because I saw that update in ivy! (Great minds...?)

I'd like to try ripgrep, but it still doesn't support multi-line matching, which makes it a bit less suitable. git grep -W with the proper xfuncname in .git/config makes this a lot easier, although it's not perfect either, and of course it only works on Org files stored in a git repo, and requires the user to manually configure that setting.

The slowness comes from having to open each file and activate org-mode in it, which can be slow depending on the size of the buffer and your personal Org config. So using an external tool to find out which files have matches could speed it up by skipping files that don't have matches, but each matching file would still require opening and activating Org in.

Another angle is to get actual matching nodes from the external tool (which git grep allows), then inserting them into a temporary buffer, running Helm on that, and finally only opening the source files when the user selects a result. This is more complicated, but I think it's doable. But now that I think of it, maybe I should try the other idea first.

Can you give me an idea of how many files you end up searching when you call one of these commands? If it's a really large number, the first method would probably help your situation a lot. If it's a few large files, the second method would probably be the one to try.

Thanks for the feedback!

@alphapapa
Copy link
Owner

Hey, I just added a branch which may help a lot: https://github.com/alphapapa/helm-org-rifle/tree/find-files-raw This reads unopened files "literally," which avoids activating Org mode unless the user actually chooses a result. This should avoid the slowness caused by activating Org mode in every file before searching it. Could you please test it and let me know how it goes? I'd still like to make use of some external searching tools, but that's proving a bit difficult since rifle searches by nodes rather than lines, so this may be a good solution in the meantime.

@zeltak
Copy link
Author

zeltak commented Apr 15, 2017

Hi @alphapapa

Happy to test it out! yet im very nontechnical as you remember :) how do i update to the new branch?

best

Z

@alphapapa
Copy link
Owner

Oh, sorry. :) Probably the easiest way to try it is to go here in your browser: https://raw.githubusercontent.com/alphapapa/helm-org-rifle/find-files-raw/helm-org-rifle.el Copy and paste the contents of that file into a buffer in Emacs, then run eval-buffer. Then try it out! :)

@zeltak
Copy link
Author

zeltak commented Apr 15, 2017

Whoa boy thats insanely fast now :) :)

i tried it various times over last ~1h hour and so and it seems to work great.

One related question, can the results be presented in ivy instead of helm (as an option?)

thx again, will be happy to test anything needed!

Z

@alphapapa
Copy link
Owner

alphapapa commented Apr 15, 2017

Hey, great! Can you give me an idea of how many files you're searching with it? Like, are we talking tens, hundreds, or thousands, or...?

Using Ivy is not a bad idea for an alternative UI. It wouldn't be too difficult to add. However, Ivy doesn't provide as much functionality behind the scenes, so some of the more advanced features (like choosing between multiple actions, sorting in different ways) would either have to be reimplemented from scratch (not appealing) or left out altogether. But a basic version of the command could be done easily enough. Of course, if I were to do that, I might need to rename the package since it wouldn't be just for Helm anymore (and that's something I've been considering anyway). Another thing I don't know about is whether Ivy supports multi-line entries. If it doesn't, that would be a big drawback.

I'll put it on the todo list as a maybe item. ;)

I'm going to hold off on releasing this find-files-raw branch for a while because I wouldn't be surprised if it causes some little bugs here and there. I'll probably tag the 1.4 release without it in the next few days, and then push find-files-raw to master after I polish and test it more, aiming to release it in 1.5.

But I would appreciate it if you could continue testing it and let me know about any issues you may find. If you want to use it automatically, without having to evaluate the buffer, you can replace the helm-org-rifle.el file in your elpa/helm-org-rifle... directory; be sure to delete the helm-org-rifle.elc file if you do.

Thanks.

@zeltak
Copy link
Author

zeltak commented Apr 16, 2017

Hi, i think its around 100 files more or less :)
Cool, ill keep testing it and report bugs if i find any! so far its been working flawlessly :)

best

Z

@alphapapa
Copy link
Owner

Great, thanks.

@priyadarshan
Copy link

Hi, I have been testing helm-org-rifle on a project with about 8000 files.

find . -type f | wc -l
    8334

After applying the branch as described above, running helm-org-rifle-directories takes a few minutes before the pattern query appears in the minibuffer.

After that, typing a string known to exists fails silently (no results, no error message).

On a different note, have you considered testing sift? It seem to have multiline support.

@alphapapa
Copy link
Owner

Hi Priyadarshan,

Thank you very much, that's definitely the kind of testing I've been hoping for. That is a lot of files indeed. I am curious to see how Emacs would handle opening that many files in, say, text-mode. I'll see if I can test this myself. I'm guessing that that's simply too many for Emacs to handle quickly, and so the way rifle currently works, opening each one in an Emacs buffer first, is just not suitable for that many files.

As a matter of fact, I stumbled upon sift again last night, and it's on my list of tools to test. I've tried a few others, but each one seems to have some small issue that makes it unsuitable or difficult to use for this project. I'm hoping that sift will be the one!

By the way, can you give me a rough idea of the size of these files, like the average size? I doubt it matters much here, but I'm curious. For that many files, you might want to consider some kind of indexing solution.

If I could impose on you, would you mind running one of your typical queries using helm-do-grep or one of the similar commands on the set of 8000 files, and let me know how it performs? I wonder what I'm up against here. :)

Thanks for your help.

@priyadarshan
Copy link

priyadarshan commented Apr 22, 2017

Since testing on about 8000 files was too lengthy, I made a selection of a collection of 1768 files

$ cd archive
$ find . -type f -name '*.org' | wc -l
    1768

Files are more or less the same length, total size is 76M,

$ du -ch -- **/*.org | tail -n 1
 76M	total

So, each file is about 42K.

I do use an indexing tool, recoll, but being able to access the files from Emacs would be ideal.

I tried searching for pattern "please" with helm-do-ag. Emacs immediately displayed some results, but then it stop responding. CPU was at 100% for a few minutes, I could not even stop with C-g.

I tested it on Intel i7 2.6 GHz, with 16GB Ram.

I wonder it if would make sense to just "slurp" all files as "fundamental mode", and then use occur, or helm-swoop?

Reading a file seems an ideal candidate for async operation, so Emacs could use all CPU cores.

I would not mind to dedicate even 1GB of RAM, in oder to have all the archive available through helm-org-rifle.

@alphapapa
Copy link
Owner

Hm, well, that's a lot fewer files, but I'm guessing Emacs is going to take a while to open 1,768 files, no matter what.

I do use an indexing tool, recoll, but being able to access the files from Emacs would be ideal.

Have you seen helm-recoll? I remember reading about it a while back. Here are a couple of links you might want to check:

https://oremacs.com/2015/07/27/counsel-recoll/
https://github.com/emacs-helm/helm-recoll

I wonder it if would make sense to just "slurp" all files as "fundamental mode", and then use occur, or helm-swoop?

The find-files-raw branch does load them in fundamental mode...only in the Helm commands, not the occur commands, though (I'll fix that sometime). I'm guessing that helm-swoop would actually be extremely slow for this use case. IIRC it copies every buffer's content into a new buffer and adds line numbers, before it even starts searching them, so doing that across 1700 files and 76 MB would probably take a while...

Reading a file seems an ideal candidate for async operation, so Emacs could use all CPU cores.

It would be, indeed, but as far as I know, there's no way for Emacs to load files asynchronously. Tools that use async stuff, like Paradox, Magit, etc, run external processes. So, yeah, you could run a second Emacs process in the background and load all the files into it, but then you'd have to pass the results back into the first process, and if you're going to do that, you probably should just use a dedicated searching tool like sift, etc.

I would not mind to dedicate even 1GB of RAM, in oder to have all the archive available through helm-org-rifle.

Well, that sounds good to me! haha :) I guess you could try loading all of the files you might want to search, then taking a coffee break while they load, and then keeping that Emacs process loaded and all those buffers open while you work. I guess the only problem might be displaying the buffer list when you need to switch buffers, but that could probably be worked around with some kind of custom function that only displays certain ones, or something like that.

I tried searching for pattern "please" with helm-do-ag. Emacs immediately displayed some results, but then it stop responding. CPU was at 100% for a few minutes, I could not even stop with C-g.

That's a little bit surprising to me, but I don't actually have ag, so I haven't tried that command. Maybe the results were coming in so fast that Emacs couldn't process them fast enough to respond? I'm not sure. If you are interested, you might try some of the other commands, like helm-do-grep which just uses plain grep, and I think you can also use other similar searching tools. It's possible the bottleneck is not in the tool but in the way the Emacs command is implemented.

And it's also possible that Emacs is just not able to handle that much data coming in from a process very well. For example, if I use Magit on a git repo containing Firefox, it is...very slow indeed, just to display the status buffer. I guess that's because there are so many lines to read from the external tool, but I'm not completely sure.

Well anyway, thanks for your help. I hope to be able to make rifle more useful for you in the future, but I'm not sure how much of the issue is Emacs itself. If sift turns out to work, then I think that will help a lot.

@priyadarshan
Copy link

Thank you for the detailed reply.
Thank you also for the recoll links.

I cut down my test files to a subset, since 8000 files were taking way too much time.

Please let me know if I can be of any help testing. I like helm-org-rifle, and I would like to use it as much as possible.

In case, let me know what kind of elisp functions to use for timing and benchmarking, or do replicable testing.

I am submitting a small report on my own "toying" around.

Just for fun, I tried to open those 1768 files with find-file-literally, via elisp snippet. It took several minutes at 100% CPU.

Ram usage went from about 300M to about 900M. Browsing buffers via C-x C-b was fine though.

Then I tried consolidating all files into one:

find . -type f -name '*.org' -exec cat {} > ~/results.org \;

That took about 4 secs (on SSD).

The nice thing was, to open that file with find-file-literally was basically instantaneous. Also, navigation was instantaneous, much faster, say, than SublimeText, which was surprising to me.

Having the 76M file open in a fundamental buffer, I then tried to use helm-swoop, but it was very slow, basically unusable. Using occur was fine.

I then tried to install ivy+swiper. Although I like helm much better than ivy, I must admit ivy+swiper was much faster in this case, making it usable.

Also using counsel-rg (ripgrep interface) was fine, although not as fast as having the big file open and searching with swiper or occur.

@alphapapa
Copy link
Owner

Please let me know if I can be of any help testing. I like helm-org-rifle, and I would like to use it as much as possible.

Great, I am very thankful for testers like you!

In case, let me know what kind of elisp functions to use for timing and benchmarking, or do replicable testing.

As a matter of fact, there is a macro in the notes.org file in this repo called profile-rifle that makes it pretty easy to test the functions that underlie the interactive commands. You can use them to test the interactive ones too, but you have to manually end the command with C-g, and that makes it less accurate, of course.

So, for example, evaluate that macro and then you can run:

(profile-rifle 1 (helm-org-rifle-directories "~/org" t))

That will instrument most of the relevant commands, then run helm-org-rifle-directories with those args 1 time. You can try typing a query as soon as the prompt appears, and then C-g as soon as the results appear, and then you'll get a result showing which functions run the most and take the most time.

If you run the macro from an Org source block with C-c C-c, the report will be output into an Org results block.

You can also profile the internal functions, like:

(profile-rifle 10 (helm-org-rifle--get-candidates-in-buffer (get-buffer "~/org/something.org") "please"))

And that will search that file for that input 10 times, then display the profiling results.

Just for fun, I tried to open those 1768 files with find-file-literally, via elisp snippet. It took several minutes at 100% CPU.

Thanks, that confirms my suspicion that Emacs simply can't open that many buffers quickly.

Ram usage went from about 300M to about 900M. Browsing buffers via C-x C-b was fine though.

That's a lot of memory, too, but not terribly surprising. Glad to hear that it's usable once it's loaded, though.

The nice thing was, to open that file with find-file-literally was basically instantaneous. Also, navigation was instantaneous, much faster, say, than SublimeText, which was surprising to me.

Yeah, I guess Emacs handles one large file better than many smaller ones. I doubt many people even try to open that many files in Emacs. :)

Having the 76M file open in a fundamental buffer, I then tried to use helm-swoop, but it was very slow, basically unusable. Using occur was fine.

Yeah, helm-swoop is slow by nature. It's okay for smaller files, but...

I then tried to install ivy+swiper. Although I like helm much better than ivy, I must admit ivy+swiper was much faster in this case, making it usable.

That's interesting! Sometime I'll have to take a look at how it works. Maybe helm-swoop can be made faster.

Did you happen to try helm-org-rifle-current-buffer on the 76 MB file? I'm curious to see how well that would work. I imagine it would be pretty slow still, but maybe faster than helm-swoop.

Thanks for all your help. I'm going to be busy here for a while, but maybe in a few weeks we can work on improving this.

@priyadarshan
Copy link

Regarding sift, sift.el may be of some use.

@alphapapa
Copy link
Owner

Thanks, I'll check it out.

@alphapapa
Copy link
Owner

Well, sift is almost good enough, but not quite. For example:

sift -n -i -m -e '^\*+ +[^*]+emacs[^*]+' --only-matching  main.org

That produces what looks like good output: the heading and entry contents for every entry that contains "emacs". But the problem is that the negated character sequences [^*]+ will cause entries to be truncated if the entry contains a * anywhere in it (e.g. for bold text, or plain lists). PCRE negated groups would probably fix this (e.g. (?!:^*)+ to match anything except a * at the beginning of a line, which should match entry contents but not the next heading)...but, of course:

Error: cannot parse pattern: error parsing regexp: invalid or unsupported Perl syntax: `(?!`

I think using only negated character classes would be a bad idea, because it would result in truncated matches, and that might cause false negatives as well--I'm not sure.

Anyway, it's another case of a tool being 95% of what we need, but that last 5% is really important. :(

So, two possibilities that I can think of for going on from here:

  1. Use insert-file-contents-literally to read every file into a single temporary buffer, one at a time, and search it for matches, then present the results. That would avoid opening a buffer for every file. It would be almost like making Emacs work like grep. It might be fast enough to be useful.

  2. Use grep or whatever for line-matching instead of entry-matching. Then open a buffer for every file in the results and get each entry from that file. The benefit over existing behavior would be that it would only open files that have matches in them (potential matches, at least; negations aside). But if a common term were searched for, it would still open a lot of buffers, which would defeat the purpose.

  3. I wonder if awk could be used to get what we need. I don't relish writing Awk scripts, but I guess using it would be faster than doing it in Emacs, and it might be possible to do exactly what we need with it.

@alphapapa
Copy link
Owner

alphapapa commented Apr 29, 2017

Well, I've been experimenting, and I've found that the biggest speed problem is fontification.

(defun helm-org-rifle--get-source-for-literal-results (results)
  "Return Helm source for RESULTS."
  (let ((source (helm-build-sync-source (car results)
                  :after-init-hook helm-org-rifle-after-init-hook
                  :candidates (cdr results)
                  :candidate-transformer helm-org-rifle-transformer
                  :match 'identity
                  :multiline helm-org-rifle-multiline
                  :volatile t
                  :action (helm-make-actions
                           "Show entry" 'helm-org-rifle--show-candidates
                           "Show entry in indirect buffer" 'helm-org-rifle-show-entry-in-indirect-buffer
                           "Show entry in real buffer" 'helm-org-rifle-show-entry-in-real-buffer)
                  :keymap helm-org-rifle-map)))
    source))
(let ((helm-candidate-separator " ")
      (fontify-fn #'identity)
      (fontify-fn #'helm-org-rifle-fontify-like-in-org-mode))
  (helm :sources (cl-loop for r in (let ((case-fold-search t)
                                         (input "emacs")
                                         (outline-regexp "\\*+ "))
                                     (with-current-buffer (get-buffer "*test*")
                                       (cl-loop for file in org-agenda-files
                                                do (progn
                                                     (insert-file-contents-literally file nil nil nil t)
                                                     (goto-char (point-min)))
                                                collect (cons file
                                                              (cl-loop while (re-search-forward input nil t)
                                                                       collect (progn
                                                                                 (outline-back-to-heading)
                                                                                 (cons (funcall fontify-fn
                                                                                                (buffer-substring-no-properties (point)
                                                                                                                                (progn
                                                                                                                                  (outline-next-heading)
                                                                                                                                  (point))))
                                                                                       (point))))))))
                          collect (helm-org-rifle--get-source-for-literal-results r))))

That will show results for emacs in all of org-agenda-files, but by inserting them literally into a temp buffer one-by-one instead of opening every file in a new buffer.

Now if you set fontify-fn to 'identity, it's fast. But when you set it to the fontification function, it's much slower. So if you don't care about the appearance of the results, you can have the faster version, but it looks like plain text, not an Org buffer. If you want the result fontified, it's slow.

And I don't see any way to fix that. Emacs has to do the fontification itself, so no matter how we feed it entries, whether from an external tool or from within Emacs, the fontification is going to be the bottleneck.

So, if you are interested in a non-fontified version, I can add some code to do that. The only advantage it would have over a plain grep command is that it would show the whole entry instead of just matching lines, but that's some benefit.

Let me know what you think. :)

@priyadarshan
Copy link

Thank you, very useful comments.

I do not mind at all to barter fontification for more speed, I would be very interested in testing it and using it.

I see Emacs more as a platform for many applications, and I think it is fine to rely on lower-level tools, like find, awk, ag, rg, etc to leverage their speed.

For example, two packages that can deal with hundreds of thousand of text files are mu4e and notmuch.

They both use xapian to index the messages. Perhaps in the future that could be leveraged as well.

@alphapapa
Copy link
Owner

Ok, I'll try to push a branch with that soon. Thanks.

@priyadarshan
Copy link

I was intrigued by your hint of combining a "search-engine" like recoll. In the meanwhile, I have found beagrep, which could perhaps offer some additional ideas.

@alphapapa
Copy link
Owner

Thanks, that looks very interesting. The author says that it only supports whole-word matches, so we'd need to test it to see how it matches Org syntax, non-alphabetic characters, etc. It might be useful.

@JohnJohnstone
Copy link

@alphapapa I have been testing the find-files-raw branch with great success, time between keybinding and helm buffer appearing is significantly quicker. Shaved off about 5 seconds on my setup, I have around 200 org-mode files across 60 directories. Wasnt able to get the benchmarking macro to work, should it just be a case of C-c C-c on the src_block.

Finally, thanks for all the work you have put into this and your other emacs packages they are extremely helpful and much appreciated.

@alphapapa
Copy link
Owner

@Johnstone-Tech

I have been testing the find-files-raw branch with great success, time between keybinding and helm buffer appearing is significantly quicker. Shaved off about 5 seconds on my setup, I have around 200 org-mode files across 60 directories.

Thanks for the feedback. How long was the total time? Were any of the files already open in Emacs?

Wasnt able to get the benchmarking macro to work, should it just be a case of C-c C-c on the src_block.

I'm not sure which one you mean. What happened when you tried?

Finally, thanks for all the work you have put into this and your other emacs packages they are extremely helpful and much appreciated.

Thanks for the kind words.

You sound like someone who might be interested in some early code I have for indexing Org files in a SQLite database. It's not very user-friendly yet, but you can look at the org-rifle branch if you are interested. Most of the db-related code is in that branch in the sandbox directory. So far I think there are a few issues with the idea:

  1. Indexing happens in a child Emacs process, and we need some way to launch the indexing process and ensure that the same file isn't indexed simultaneously by multiple indexing processes. So we need some kind of locking, probably a queue that files can be added to, etc.
  2. Indexing is rather slow, but probably fast enough to be useful. And it can probably be improved to some extent.
  3. However, re-indexing files is very slow because of the way old rows are deleted from the db before re-indexing. There might be some SQLite tricks we could use to improve that. Or maybe we could forego SQLite altogether and use something like MySQL or Postgres. (Or even other indexers, like Recoll, but I know little about them.) But I would like it to be as user-friendly as possible and avoid requiring manual configuration of databases, etc, even though I'm sure there are some users who wouldn't mind doing that.

Thanks for your feedback.

@JohnJohnstone
Copy link

I will do some additional tests prior to opening the org files and after to verify i was getting a performance increase. Certainly feels a lot faster.

Brilliant I will checkout the org-rifle branch. Sqlite seems like an appropriate choice considering how popular it is and the single file nature of the format.

@x-ji x-ji mentioned this issue Sep 19, 2020
@dustinlacewell
Copy link

I noticed that on the find-files-raw branch, that all of the files that are searched are still open buffers after the search is complete. I was thinking that this branch implemented the "draw all content into a single file and fontify/search that".

Is this not the case? If that is never going to be a thing, what do you think about a parameter that would track which files were opened as a result of search, and then subsequently close them after search is complete?

@dustinlacewell
Copy link

This also produces the prompt:

The file gulp.org is already visited literally,
meaning no coding system decoding, format conversion, or local variables.
You have asked to visit it normally,
but Emacs can visit a file in only one way at a time.

Do you want to revisit the file normally now? (y or n) y

When you go to open the file normally.

@alphapapa
Copy link
Owner

I noticed that on the find-files-raw branch, that all of the files that are searched are still open buffers after the search is complete. I was thinking that this branch implemented the "draw all content into a single file and fontify/search that".

Is this not the case?

No, because that would present the results as all being from a single buffer rather than being from their individual source files. One could try using text properties on each buffer's text to keep track of that; it would require changes in a few places, and it would need benchmarking.

If that is never going to be a thing, what do you think about a parameter that would track which files were opened as a result of search, and then subsequently close them after search is complete?

That would make sense.

This also produces the prompt:

The branch is experimental, and probably needs rebasing by now, being a few years old.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants