Questions on the first stage of `make` #121

kendonB · 2017-10-31T02:34:28Z

Hi again!

I have a large project with ~30000 targets. When I run make, it takes several minutes for the first piece of visual feedback to print (I see packages loading after these first few minutes).

drake is doing something in these minutes.

Out of curiosity, what is it doing?
is there a visual feedback like a progress bar that might make sense for this first stage?
Are there any potential gains in efficiency you could think about that wouldn't compromise robustness?

The text was updated successfully, but these errors were encountered:

kendonB · 2017-10-31T02:57:58Z

Did some digging and it seems to be the line:

command_deps <- lapply(plan$command, command_dependencies) in build_graph()

kendonB · 2017-10-31T03:07:45Z

One thing might be to first pull out unique functions and files, then process? You might be doing this already.

wlandau-lilly · 2017-10-31T03:23:41Z

@kendonB I was hoping for a huge test case like this!

You may be happy to know that that line of code now reads

command_deps <- lightly_parallelize(
    plan$command, command_dependencies, jobs = jobs)

Since you're not using a Windows machine, you can just set jobs to be whatever you want mc.cores to be so that mclapply() speeds things up. Since you're using the "future_lapply" backend for targets, this will not affect how the meat of your workflow is processed.

To speed things up further, I think we would need a C++-accelerated static code analysis tool to parse commands and look for dependencies (targets/imports). That would be a giant undertaking. Drake uses codetools plus some customizations, and the state of the art is CodeDepends. Neither has compiled code.

Do you think drake should log what it's doing to the console? It might be too much to print one line per target, but I was thinking something like resolve dependencies of imports and resolve dependencies of targets.

You should also check out the new sub-graphing functionality I implemented over the weekend. As long as you don't have too many imports, the interactive visNetwork graph should be sane, even with a 30000-target project. You might even speed things up by using the targets argument to plot_graph().

wlandau-lilly · 2017-10-31T03:41:57Z

Your comment here is an interesting thought. Currently, drake does not try to pull out unique code fragments or unique imports to analyze. Intuitively, I don't think having unique values would noticeably increase speed, and unique could take a long time on 30000 targets. On the other hand, I do not have projects as large as yours, so feel free to prove me wrong.

kendonB · 2017-10-31T03:44:03Z

I think that printing a line like resolve dependencies of imports or resolve dependencies of targets makes sense.

Regarding pulling out unique functions, I imagine most large projects are, like mine, just many small chunks of few operations. For example, I have 6 user-written functions generating the 30000 targets. On mine, it would greatly increase speed!

wlandau-lilly · 2017-10-31T03:56:24Z

I think I see. In a workflow plan with 30000 targets

##    target       command
## 1  target1  rnorm(1000)
## 2  target2  rnorm(1000)
...

"rnorm(1000)" should really only be processed once.

kendonB · 2017-10-31T03:58:54Z

You could go finer than that though. Even with commands my_fun(obj1), and my_fun(obj2) you could check the body of my_fun once.

wlandau-lilly · 2017-10-31T04:00:16Z

Actually, drake already does that last part just fine. It doesn't waste any time processing duplicate imports. Each code analysis only goes one level deep at a time (the body of one function or the code fragment of one command).

kendonB · 2017-10-31T04:02:50Z

You could also consider caching results so that future runs of make don't do the same work twice? Kind of like memoise but using the disk?

kendonB · 2017-10-31T04:05:18Z

Another potential speed up for usability could be to implement different levels of checking like remake? This is really nice while testing.

wlandau-lilly · 2017-10-31T04:08:24Z

Drake does already cache imports, but it always checks for changes. It always needs to check for changes somehow. I would assume storr does not cache items if their hashes have not changed, but that is a question for @richfitz.

What levels of checking would you like to see? If you have custom file targets/imports like report.md and report.Rmd in the basic example, drake uses file modification times to check whether to bother computing a new hash (#4), but that's about it.

kendonB · 2017-10-31T04:11:45Z

I quite liked remakes different levels: "exists" (current if the target exists), "depends" (current if exists and dependencies unchanged), "code" (current if exists and code unchanged) or "all" (current if exists and both dependencies and code unchanged).

Regarding caching, I was referring to the output of command_dependencies which seems different from caching imports?

wlandau-lilly · 2017-10-31T04:30:34Z

I did not know remake exposed those levels of checking to the user. To be honest, I still do not think I see the full practical use.

For drake, dependency checking is mostly in current.R and hash.R.

I was not actually sure about caching all the individual command dependencies for every target. I opted for a small manageable dependency hash, thinking it would cut down on storage and time spent caching.

wlandau-lilly · 2017-10-31T04:49:13Z

By the way, b18b43c and beyond has those console messages. Now, more people will be able to guess why build_graph() takes so long.

> load_basic_example()
> make(my_plan)
interconnect 7 items: ld, td, simulate, reg1, my_plan, reg2, dl
interconnect 15 items: 'report.md', small, large, regression1_small, regressi...
import knit
import 'report.Rmd'
...

wlandau-lilly · 2017-10-31T14:10:47Z

See this new Stack Overflow post. This seems like such a common problem, and I hope someone has solved it already.

wlandau-lilly · 2017-10-31T15:11:10Z

@kendonB, in 87de915, see lightly_parallelize(), which now calls lightly_parallelize_atomic(). Should avoid duplication of effort in the processing of commands. I posted the general solution here.

wlandau-lilly · 2017-10-31T15:29:57Z

I think the last thing on this thread is the possible memoization of processing commands here. memoise with the file system cache is probably the best option. I am a bit reluctant because this memoization cache belongs inside the .drake/ storr cache, and passing the right path to build_graph() would be a pain.

Also, the code to memoise is this:

command_deps <- lightly_parallelize(
    plan$command, command_dependencies, jobs = jobs)

It depends on both the commands and the number of jobs. I would want to memoize in a way that ignores jobs.

wlandau-lilly · 2017-10-31T15:32:07Z

See r-lib/memoise#54. There is resistance to optionally ignoring function parameters.

wlandau-lilly · 2017-10-31T15:39:53Z

I will memoize only if the memoise package handles this use case reasonably well.

wlandau-lilly · 2017-10-31T15:41:27Z

I think I have done all I can do on this thread at the moment. Will reopen if/when there are new possibilities or suggestions.

kendonB · 2017-10-31T20:35:27Z

Ran this again and it runs much faster with the parallelization. I noticed that the interconnect message came up twice for the same set of targets. Is this what you expect? It appears first then I see a bunch of import ..., then I see interconnect again, then check ..., then the making happens.

wlandau-lilly · 2017-10-31T20:43:53Z

Glad to hear the parallelism was an improvement. I expect your idea of lightly_parallelize_atomic() may have sped things up too (avoid analyzing the same workflow command twice). I do expect interconnect to show up twice: one for imports only, and another for the targets listed in your workflow plan. There are two calls to console_dependencies(), one here and another here. I suppose I should make different console messages for each. Do you see two interconnects for the same make()?

wlandau-lilly · 2017-10-31T21:06:55Z

FYI: I just made the distinction between the two interconnect (now connect) logs clearer. Now, one says imports and the other says targets.

load_basic_example()
make(my_plan)

## connect 9 imports: ld, lint_summary, gwd, td, simulate, reg1, my_plan, reg2, dl
## connect 15 targets: 'report.md', small, large, regression1_small, regression1...
## check 9 items: 'report.Rmd', knit, summary, suppressWarnings, coefficients, d...
## import 'report.Rmd'
## import knit
## import summary
...

Do you still get import ... messages before the connect ones?

kendonB · 2017-10-31T21:47:26Z

I actually see more than two (names removed):

cache ....
interconnect 42 items: ...
interconnect 24306 items: ...
interconnect 42 items: ...
interconnect 24306 items: ...
check 92 items: ...
import ... lots
interconnect 42 items: ...
interconnect 24306 items: ...
check 92 items: ...
check 7 items: ...

Note this was before your most recent change. I'm in the middle of building the whole project so I'm not how long it'll be before I can test the latest change.

wlandau-lilly · 2017-10-31T22:59:24Z

I understand.

I actually reproduced this on one of my larger projects. Happens with distributed parallelism only. Will explain tonight or tomorrow.

kendonB · 2017-11-01T02:31:45Z

I also see the same behavior when running drake::outdated(my_plan, jobs = 8) so it seems to also happen with vanilla parallelism?

wlandau-lilly · 2017-11-01T02:32:59Z

Yes, I am seeing redundant work in outdated() through config(). Currently working on refactoring the internals.

wlandau-lilly · 2017-11-01T03:57:05Z

@kendonB you just rooted out major inefficiencies in drake. The fixes are effective as of 29f6d29. Not only did those extra connect messages go away, but the unit tests became blazing fast! This makes my life so much easier, and I cannot thank you enough! If it sounds good to you, please submit a pull request to add yourself as a contributor above this line in the DESCRIPTION.

kendonB · 2017-11-01T05:27:12Z

No need to add me as a contributor. Thanks for the fixes.

wlandau mentioned this issue Oct 31, 2017

Add console messages for building the graph #122

Merged

wlandau-lilly closed this as completed Oct 31, 2017

wlandau mentioned this issue Nov 1, 2017

Cut out redundant work #125

Merged

This was referenced Nov 2, 2017

Corrupted storr when killing making process #126

Closed

Add a make_imports argument to config(), make(), outdated(), max_useful_jobs(), etc. #136

Closed

wlandau-lilly mentioned this issue Nov 11, 2017

Testing the high-performance computing examples #139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on the first stage of `make` #121

Questions on the first stage of `make` #121

kendonB commented Oct 31, 2017

kendonB commented Oct 31, 2017

kendonB commented Oct 31, 2017 •

edited

Loading

wlandau-lilly commented Oct 31, 2017 •

edited

Loading

wlandau-lilly commented Oct 31, 2017

kendonB commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017 •

edited

Loading

kendonB commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017

kendonB commented Oct 31, 2017

kendonB commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017

kendonB commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017 •

edited

Loading

wlandau-lilly commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017 •

edited

Loading

wlandau-lilly commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017 •

edited

Loading

wlandau-lilly commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017

kendonB commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017 •

edited

Loading

kendonB commented Oct 31, 2017 •

edited

Loading

wlandau-lilly commented Oct 31, 2017

kendonB commented Nov 1, 2017

wlandau-lilly commented Nov 1, 2017

wlandau-lilly commented Nov 1, 2017

kendonB commented Nov 1, 2017

Questions on the first stage of make #121

Questions on the first stage of make #121

Comments

kendonB commented Oct 31, 2017

kendonB commented Oct 31, 2017

kendonB commented Oct 31, 2017 • edited Loading

wlandau-lilly commented Oct 31, 2017 • edited Loading

wlandau-lilly commented Oct 31, 2017

kendonB commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017 • edited Loading

kendonB commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017

kendonB commented Oct 31, 2017

kendonB commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017

kendonB commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017 • edited Loading

wlandau-lilly commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017 • edited Loading

wlandau-lilly commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017 • edited Loading

wlandau-lilly commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017

kendonB commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017

wlandau-lilly commented Oct 31, 2017 • edited Loading

kendonB commented Oct 31, 2017 • edited Loading

wlandau-lilly commented Oct 31, 2017

kendonB commented Nov 1, 2017

wlandau-lilly commented Nov 1, 2017

wlandau-lilly commented Nov 1, 2017

kendonB commented Nov 1, 2017

Questions on the first stage of `make` #121

Questions on the first stage of `make` #121

kendonB commented Oct 31, 2017 •

edited

Loading

wlandau-lilly commented Oct 31, 2017 •

edited

Loading

wlandau-lilly commented Oct 31, 2017 •

edited

Loading

wlandau-lilly commented Oct 31, 2017 •

edited

Loading

wlandau-lilly commented Oct 31, 2017 •

edited

Loading

wlandau-lilly commented Oct 31, 2017 •

edited

Loading

wlandau-lilly commented Oct 31, 2017 •

edited

Loading

kendonB commented Oct 31, 2017 •

edited

Loading