-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions on the first stage of make
#121
Comments
Did some digging and it seems to be the line:
|
One thing might be to first pull out unique functions and files, then process? You might be doing this already. |
@kendonB I was hoping for a huge test case like this! You may be happy to know that that line of code now reads command_deps <- lightly_parallelize(
plan$command, command_dependencies, jobs = jobs) Since you're not using a Windows machine, you can just set To speed things up further, I think we would need a C++-accelerated static code analysis tool to parse commands and look for dependencies (targets/imports). That would be a giant undertaking. Do you think You should also check out the new sub-graphing functionality I implemented over the weekend. As long as you don't have too many imports, the interactive visNetwork graph should be sane, even with a 30000-target project. You might even speed things up by using the |
Your comment here is an interesting thought. Currently, |
I think that printing a line like Regarding pulling out unique functions, I imagine most large projects are, like mine, just many small chunks of few operations. For example, I have 6 user-written functions generating the 30000 targets. On mine, it would greatly increase speed! |
I think I see. In a workflow plan with 30000 targets
|
You could go finer than that though. Even with commands |
Actually, |
You could also consider caching results so that future runs of make don't do the same work twice? Kind of like |
Another potential speed up for usability could be to implement different levels of checking like |
What levels of checking would you like to see? If you have custom file targets/imports like |
I quite liked remakes different levels: "exists" (current if the target exists), "depends" (current if exists and dependencies unchanged), "code" (current if exists and code unchanged) or "all" (current if exists and both dependencies and code unchanged). Regarding caching, I was referring to the output of |
I did not know For I was not actually sure about caching all the individual command dependencies for every target. I opted for a small manageable dependency hash, thinking it would cut down on storage and time spent caching. |
By the way, b18b43c and beyond has those console messages. Now, more people will be able to guess why build_graph() takes so long. > load_basic_example()
> make(my_plan)
interconnect 7 items: ld, td, simulate, reg1, my_plan, reg2, dl
interconnect 15 items: 'report.md', small, large, regression1_small, regressi...
import knit
import 'report.Rmd'
... |
See this new Stack Overflow post. This seems like such a common problem, and I hope someone has solved it already. |
@kendonB, in 87de915, see lightly_parallelize(), which now calls lightly_parallelize_atomic(). Should avoid duplication of effort in the processing of commands. I posted the general solution here. |
I think the last thing on this thread is the possible memoization of processing commands here. memoise with the file system cache is probably the best option. I am a bit reluctant because this memoization cache belongs inside the Also, the code to memoise is this: command_deps <- lightly_parallelize(
plan$command, command_dependencies, jobs = jobs) It depends on both the commands and the number of jobs. I would want to memoize in a way that ignores |
See r-lib/memoise#54. There is resistance to optionally ignoring function parameters. |
I will memoize only if the |
I think I have done all I can do on this thread at the moment. Will reopen if/when there are new possibilities or suggestions. |
Ran this again and it runs much faster with the parallelization. I noticed that the |
Glad to hear the parallelism was an improvement. I expect your idea of lightly_parallelize_atomic() may have sped things up too (avoid analyzing the same workflow command twice). I do expect |
FYI: I just made the distinction between the two load_basic_example()
make(my_plan)
Do you still get |
I actually see more than two (names removed):
Note this was before your most recent change. I'm in the middle of building the whole project so I'm not how long it'll be before I can test the latest change. |
I understand. I actually reproduced this on one of my larger projects. Happens with distributed parallelism only. Will explain tonight or tomorrow. |
I also see the same behavior when running |
Yes, I am seeing redundant work in |
@kendonB you just rooted out major inefficiencies in |
No need to add me as a contributor. Thanks for the fixes. |
Hi again!
I have a large project with ~30000 targets. When I run
make
, it takes several minutes for the first piece of visual feedback to print (I see packages loading after these first few minutes).drake is doing something in these minutes.
The text was updated successfully, but these errors were encountered: