Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on the first stage of make #121

Closed
kendonB opened this issue Oct 31, 2017 · 29 comments
Closed

Questions on the first stage of make #121

kendonB opened this issue Oct 31, 2017 · 29 comments

Comments

@kendonB
Copy link
Contributor

kendonB commented Oct 31, 2017

Hi again!

I have a large project with ~30000 targets. When I run make, it takes several minutes for the first piece of visual feedback to print (I see packages loading after these first few minutes).

drake is doing something in these minutes.

  1. Out of curiosity, what is it doing?
  2. is there a visual feedback like a progress bar that might make sense for this first stage?
  3. Are there any potential gains in efficiency you could think about that wouldn't compromise robustness?
@kendonB
Copy link
Contributor Author

kendonB commented Oct 31, 2017

Did some digging and it seems to be the line:

command_deps <- lapply(plan$command, command_dependencies) in build_graph()

@kendonB
Copy link
Contributor Author

kendonB commented Oct 31, 2017

One thing might be to first pull out unique functions and files, then process? You might be doing this already.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Oct 31, 2017

@kendonB I was hoping for a huge test case like this!

You may be happy to know that that line of code now reads

command_deps <- lightly_parallelize(
    plan$command, command_dependencies, jobs = jobs)

Since you're not using a Windows machine, you can just set jobs to be whatever you want mc.cores to be so that mclapply() speeds things up. Since you're using the "future_lapply" backend for targets, this will not affect how the meat of your workflow is processed.

To speed things up further, I think we would need a C++-accelerated static code analysis tool to parse commands and look for dependencies (targets/imports). That would be a giant undertaking. Drake uses codetools plus some customizations, and the state of the art is CodeDepends. Neither has compiled code.

Do you think drake should log what it's doing to the console? It might be too much to print one line per target, but I was thinking something like resolve dependencies of imports and resolve dependencies of targets.

You should also check out the new sub-graphing functionality I implemented over the weekend. As long as you don't have too many imports, the interactive visNetwork graph should be sane, even with a 30000-target project. You might even speed things up by using the targets argument to plot_graph().

@wlandau-lilly
Copy link
Collaborator

Your comment here is an interesting thought. Currently, drake does not try to pull out unique code fragments or unique imports to analyze. Intuitively, I don't think having unique values would noticeably increase speed, and unique could take a long time on 30000 targets. On the other hand, I do not have projects as large as yours, so feel free to prove me wrong.

@kendonB
Copy link
Contributor Author

kendonB commented Oct 31, 2017

I think that printing a line like resolve dependencies of imports or resolve dependencies of targets makes sense.

Regarding pulling out unique functions, I imagine most large projects are, like mine, just many small chunks of few operations. For example, I have 6 user-written functions generating the 30000 targets. On mine, it would greatly increase speed!

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Oct 31, 2017

I think I see. In a workflow plan with 30000 targets

##    target       command
## 1  target1  rnorm(1000)
## 2  target2  rnorm(1000)
...

"rnorm(1000)" should really only be processed once.

@kendonB
Copy link
Contributor Author

kendonB commented Oct 31, 2017

You could go finer than that though. Even with commands my_fun(obj1), and my_fun(obj2) you could check the body of my_fun once.

@wlandau-lilly
Copy link
Collaborator

Actually, drake already does that last part just fine. It doesn't waste any time processing duplicate imports. Each code analysis only goes one level deep at a time (the body of one function or the code fragment of one command).

@kendonB
Copy link
Contributor Author

kendonB commented Oct 31, 2017

You could also consider caching results so that future runs of make don't do the same work twice? Kind of like memoise but using the disk?

@kendonB
Copy link
Contributor Author

kendonB commented Oct 31, 2017

Another potential speed up for usability could be to implement different levels of checking like remake? This is really nice while testing.

@wlandau-lilly
Copy link
Collaborator

Drake does already cache imports, but it always checks for changes. It always needs to check for changes somehow. I would assume storr does not cache items if their hashes have not changed, but that is a question for @richfitz.

What levels of checking would you like to see? If you have custom file targets/imports like report.md and report.Rmd in the basic example, drake uses file modification times to check whether to bother computing a new hash (#4), but that's about it.

@kendonB
Copy link
Contributor Author

kendonB commented Oct 31, 2017

I quite liked remakes different levels: "exists" (current if the target exists), "depends" (current if exists and dependencies unchanged), "code" (current if exists and code unchanged) or "all" (current if exists and both dependencies and code unchanged).

Regarding caching, I was referring to the output of command_dependencies which seems different from caching imports?

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Oct 31, 2017

I did not know remake exposed those levels of checking to the user. To be honest, I still do not think I see the full practical use.

For drake, dependency checking is mostly in current.R and hash.R.

I was not actually sure about caching all the individual command dependencies for every target. I opted for a small manageable dependency hash, thinking it would cut down on storage and time spent caching.

@wlandau-lilly
Copy link
Collaborator

By the way, b18b43c and beyond has those console messages. Now, more people will be able to guess why build_graph() takes so long.

> load_basic_example()
> make(my_plan)
interconnect 7 items: ld, td, simulate, reg1, my_plan, reg2, dl
interconnect 15 items: 'report.md', small, large, regression1_small, regressi...
import knit
import 'report.Rmd'
...

@wlandau-lilly
Copy link
Collaborator

See this new Stack Overflow post. This seems like such a common problem, and I hope someone has solved it already.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Oct 31, 2017

@kendonB, in 87de915, see lightly_parallelize(), which now calls lightly_parallelize_atomic(). Should avoid duplication of effort in the processing of commands. I posted the general solution here.

@wlandau-lilly
Copy link
Collaborator

I think the last thing on this thread is the possible memoization of processing commands here. memoise with the file system cache is probably the best option. I am a bit reluctant because this memoization cache belongs inside the .drake/ storr cache, and passing the right path to build_graph() would be a pain.

Also, the code to memoise is this:

command_deps <- lightly_parallelize(
    plan$command, command_dependencies, jobs = jobs)

It depends on both the commands and the number of jobs. I would want to memoize in a way that ignores jobs.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Oct 31, 2017

See r-lib/memoise#54. There is resistance to optionally ignoring function parameters.

@wlandau-lilly
Copy link
Collaborator

I will memoize only if the memoise package handles this use case reasonably well.

@wlandau-lilly
Copy link
Collaborator

I think I have done all I can do on this thread at the moment. Will reopen if/when there are new possibilities or suggestions.

@kendonB
Copy link
Contributor Author

kendonB commented Oct 31, 2017

Ran this again and it runs much faster with the parallelization. I noticed that the interconnect message came up twice for the same set of targets. Is this what you expect? It appears first then I see a bunch of import ..., then I see interconnect again, then check ..., then the making happens.

@wlandau-lilly
Copy link
Collaborator

Glad to hear the parallelism was an improvement. I expect your idea of lightly_parallelize_atomic() may have sped things up too (avoid analyzing the same workflow command twice). I do expect interconnect to show up twice: one for imports only, and another for the targets listed in your workflow plan. There are two calls to console_dependencies(), one here and another here. I suppose I should make different console messages for each. Do you see two interconnects for the same make()?

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Oct 31, 2017

FYI: I just made the distinction between the two interconnect (now connect) logs clearer. Now, one says imports and the other says targets.

load_basic_example()
make(my_plan)
## connect 9 imports: ld, lint_summary, gwd, td, simulate, reg1, my_plan, reg2, dl
## connect 15 targets: 'report.md', small, large, regression1_small, regression1...
## check 9 items: 'report.Rmd', knit, summary, suppressWarnings, coefficients, d...
## import 'report.Rmd'
## import knit
## import summary
...

Do you still get import ... messages before the connect ones?

@kendonB
Copy link
Contributor Author

kendonB commented Oct 31, 2017

I actually see more than two (names removed):

cache ....
interconnect 42 items: ...
interconnect 24306 items: ...
interconnect 42 items: ...
interconnect 24306 items: ...
check 92 items: ...
import ... lots
interconnect 42 items: ...
interconnect 24306 items: ...
check 92 items: ...
check 7 items: ...

Note this was before your most recent change. I'm in the middle of building the whole project so I'm not how long it'll be before I can test the latest change.

@wlandau-lilly
Copy link
Collaborator

I understand.

I actually reproduced this on one of my larger projects. Happens with distributed parallelism only. Will explain tonight or tomorrow.

@kendonB
Copy link
Contributor Author

kendonB commented Nov 1, 2017

I also see the same behavior when running drake::outdated(my_plan, jobs = 8) so it seems to also happen with vanilla parallelism?

@wlandau-lilly
Copy link
Collaborator

Yes, I am seeing redundant work in outdated() through config(). Currently working on refactoring the internals.

@wlandau-lilly
Copy link
Collaborator

@kendonB you just rooted out major inefficiencies in drake. The fixes are effective as of 29f6d29. Not only did those extra connect messages go away, but the unit tests became blazing fast! This makes my life so much easier, and I cannot thank you enough! If it sounds good to you, please submit a pull request to add yourself as a contributor above this line in the DESCRIPTION.

@kendonB
Copy link
Contributor Author

kendonB commented Nov 1, 2017

No need to add me as a contributor. Thanks for the fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants