Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On-demand branch creation #1364

Closed
wlandau opened this issue Nov 1, 2024 · 11 comments
Closed

On-demand branch creation #1364

wlandau opened this issue Nov 1, 2024 · 11 comments
Assignees

Comments

@wlandau
Copy link
Member

wlandau commented Nov 1, 2024

Dynamic branches take up a lot of memory in the main session of a large pipeline. Instead of a full branch object, it may be possible to store a lightweight reference to the branch until it is actually needed. If this works, we may see a large reduction in memory consumption.

@wlandau wlandau self-assigned this Nov 1, 2024
@wlandau
Copy link
Member Author

wlandau commented Nov 1, 2024

Bud targets probably also need a similar treatment.

@wlandau
Copy link
Member Author

wlandau commented Nov 1, 2024

Actually, I think it's more efficient to just go with the original serialization idea proposed in #1352. Even after subtracting pedigree creation time, it is much faster to deserialize a branch on demand than to create one from scratch.

command <- command_init()
settings <- settings_init()
cue <- cue_init()
value <- value_init()
branch <- branch_init(command, settings, cue, value)
serialized_branch_high <- qs::qserialize(branch, preset = "high")
serialized_branch_balanced <- qs::qserialize(branch, preset = "balanced")
serialized_branch_fast <- qs::qserialize(branch, preset = "fast")

microbenchmark(
  create_branch = branch_init(command, settings, cue, value),
  create_pedigree = pedigree_new(parent = branch$settings$name, index = 1L),
  deserialize_high = qs::qdeserialize(serialized_branch_high),
  deserialize_balanced = qs::qdeserialize(serialized_branch_balanced),
  deserialize_fast = qs::qdeserialize(serialized_branch_fast),
  times = 1e4,
  control = list(warmup = 100)
)
#> Unit: microseconds
#>                  expr    min     lq      mean median      uq      max neval  cld
#>         create_branch 56.621 58.876 68.514920 60.721 64.9235 8143.133 10000 a   
#>       create_pedigree  1.558  2.009  2.270166  2.173  2.2960   34.440 10000  b  
#>      deserialize_high 17.917 18.942 27.222868 21.689 34.3170 7516.366 10000   c 
#>  deserialize_balanced 14.801 15.785 23.145619 18.204 30.8730  403.645 10000    d
#>      deserialize_fast 14.760 15.662 22.881981 17.917 30.7090  204.795 10000    d

@wlandau wlandau closed this as not planned Won't fix, can't repro, duplicate, stale Nov 1, 2024
@wlandau
Copy link
Member Author

wlandau commented Nov 1, 2024

and to summarize the storage sizes in the various options:

command <- command_init()
settings <- settings_init(name = "target_name")
cue <- cue_init()
value <- value_init()
branch <- branch_init(command, settings, cue, value, index = 1L)
serialized_branch_high <- qs::qserialize(branch, preset = "high")
serialized_branch_balanced <- qs::qserialize(branch, preset = "balanced")
serialized_branch_fast <- qs::qserialize(branch, preset = "fast")

library(lobstr)
obj_size(qs::qserialize(branch$pedigree))
#> 176 B
obj_size(branch$pedigree)
#> 456 B
obj_size(serialized_branch_high)
#> 680 B
obj_size(serialized_branch_balanced)
#> 840 B
obj_size(serialized_branch_fast)
#> 1.17 kB
obj_size(branch)
#> 9.54 kB

The "high" present on the branch looks like a good tradeoff (#1365).

@wlandau
Copy link
Member Author

wlandau commented Nov 7, 2024

After optimizing with #1368, branch creation got much faster. Also, it will be much easier now to create branches on demand and store only lightweight references whenever possible. I will need to refactor the junction class and add fancy checking to pipleine_set_target() and pipeline_get_target(), but #1368 also makes this part easier.

target <- tar_target(y, x, pattern = map(x))
name <- "x_branch"
command <- target$command
store <- target$store
cue <- target$cue
settings <- target$settings
index <- 1L
deps_parent <- character(0L)
deps_child <- character(0L)
branch <- branch_init(
  name = name,
  command = command,
  deps_parent = deps_parent,
  deps_child = deps_child,
  settings = settings,
  cue = cue,
  store = store,
  index = index
)
serialized_branch_high <- qs::qserialize(branch, preset = "high")
serialized_branch_balanced <- qs::qserialize(branch, preset = "balanced")
serialized_branch_fast <- qs::qserialize(branch, preset = "fast")

microbenchmark::microbenchmark(
  create_branch = branch_init(
    name = name,
    command = command,
    deps_parent = deps_parent,
    deps_child = deps_child,
    settings = settings,
    cue = cue,
    store = store,
    index = index
  ),
  deserialize_high = qs::qdeserialize(serialized_branch_high),
  deserialize_balanced = qs::qdeserialize(serialized_branch_balanced),
  deserialize_fast = qs::qdeserialize(serialized_branch_fast),
  times = 1e4,
  control = list(warmup = 100)
)
#> Unit: microseconds
#>                  expr    min     lq     mean median     uq      max neval cld
#>         create_branch 15.170 16.400 19.10258 16.974 18.204 5561.978 10000  a 
#>      deserialize_high 17.835 19.024 26.08239 20.623 28.618 6385.299 10000   b
#>  deserialize_balanced 14.678 15.785 21.73968 16.851 25.092 6344.299 10000  a 
#>      deserialize_fast 14.555 15.662 20.98209 16.687 24.928 5613.187 10000  a 

@wlandau wlandau reopened this Nov 7, 2024
@wlandau
Copy link
Member Author

wlandau commented Nov 8, 2024

Notes to self on the next steps for the implementation:

  • ensure values when the subpipeline is created. For transient-memory targets, ensure the value after making the subpipeline copy. For persistent memory targets, do it before.
  • The lightweight reference in the pipeline object should include the parent name, the file path if known, and the file stage if known.
  • Transient memory branches should be converted back to references once they run. Persistent memory branches should not.
  • Need to look at if/how buds keep persistent values in memory
  • need to look at target_load_value on patterns. Can those targets be converted back into references?

@wlandau
Copy link
Member Author

wlandau commented Nov 9, 2024

targets already has a sophisticated mechanism for transient memory via pipeline_unload_transient(). I think it's just a matter of converting the target definition object back to a reference during pipeline_unload_target().

@wlandau
Copy link
Member Author

wlandau commented Nov 9, 2024

In 23652fd (branch 1364), I added a new R/class_refernece.R class and unit tests in tests/testthat/test-class_reference.R. This is the new machinery for converting branches and buds to and from lightweight references. At this point, I just need to plug reference_produce_target() into pipeline_get_target() and target_produce_reference() into pipeline_unload_target(). (pipeline_set_target() should not change.) And for efficiency, I should reduce excessive calls to pipeline_get_target(), which I have treated thus far as inexpensive.

@wlandau
Copy link
Member Author

wlandau commented Nov 9, 2024

I also need to go into the pattern and stem classes and make sure they store references and not whole targets in the pipeline object when they create branches and buds.

@wlandau
Copy link
Member Author

wlandau commented Nov 9, 2024

83b706c converts branches and buds to and from lightweight references using the existing machinery of target_load_value() and pipeline_unload_target(). Next, I will ensure branches and buds are created as references from the start.

@wlandau
Copy link
Member Author

wlandau commented Nov 9, 2024

branch 1364 implements the above, and branch 1364-ref3 takes the extra step of creating branches and buds as lightweight references from the beginning. The latter is failing at the moment and I have to step away for now, but I will return to it soon.

@wlandau
Copy link
Member Author

wlandau commented Nov 10, 2024

Implemented in #1370

@wlandau wlandau closed this as completed Nov 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant