High memory usage in large pipelines #1349

wlandau · 2024-10-22T00:24:23Z

c.f. #1347 and #1329. I tried the following pipeline on a RHEL9 node:

library(autometric)
library(crew)
library(targets)

controller <- crew_controller_local(
  workers = 1L,
  garbage_collection = TRUE,
  options_metrics = crew_options_metrics(
    path = "logs",
    seconds_interval = 1
  )
)

if (tar_active()) {
  controller$start()
  log_start(
    path = "logs/main.txt",
    seconds = 1,
    pids = controller$pids()
  )
}

tar_option_set(
  memory = "transient",
  garbage_collection = TRUE,
  controller = controller
)

write_file <- function(x) {
  fs::dir_create("files")
  path <- file.path("files", paste0(x, ".rds"))
  saveRDS(x, path)
  path
}

list(
  tar_target(x, seq_len(2e4)),
  tar_target(y, write_file(x), pattern = map(x), format = "file"),
  tar_target(z, readRDS(y), pattern = map(y))
)

Then I read and visualized the autometric logs:

library(autometric)
log <- log_read("logs", units_memory = "megabytes")
names <- unique(log$name)
log_plot(log, name = names[1], metric = "resident")
log_plot(log, name = names[2], metric = "resident")
log_plot(log, name = names[3], metric = "resident")

The crew worker and mirai dispatcher are efficient with memory, consuming no more than a few megabytes. But the memory consumption of the local targets process kept increasing without an ostensible bound. 3 GB isn't necessarily alarming, but I will need to look into what is responsible for most of this memory.

I wonder if this could explain #1347 or #1329, and I wonder what would happen without crew.

The text was updated successfully, but these errors were encountered:

wlandau · 2024-10-22T12:46:30Z

I tried a similar pipeline without crew:

library(autometric)
library(targets)

if (tar_active()) {
  log_start(
    path = "logs/main.txt",
    seconds = 1
  )
}

tar_option_set(
  memory = "transient",
  garbage_collection = TRUE
)

write_file <- function(x) {
  fs::dir_create("files")
  path <- file.path("files", paste0(x, ".rds"))
  saveRDS(x, path)
  path
}

list(
  tar_target(x, seq_len(2e4)),
  tar_target(y, write_file(x), pattern = map(x), format = "file"),
  tar_target(z, readRDS(y), pattern = map(y))
)

The pipeline took a lot longer to run (~7 hr), but memory usage looked more reasonable:

There is a mild surge at the beginning, a mild surge at around 10000s (presumably when all the dynamic branches of z are defined) and then another mild surge at the end. A max of 800 MB is pretty good.

Takeaways:

Something about crew + targets guzzles memory.
Something about targets alone is slow for this type of pipeline, and the slowness does not appear to have anything to do with crew or (1).

So we actually have 2 different unrelated performance problems.

wlandau · 2024-10-22T14:52:24Z

For (2), the slowness just comes from garbage collection 😆 . I should have known.

library(targets)

tar_option_set(
  memory = "transient",
  garbage_collection = TRUE
)

write_file <- function(x) {
  fs::dir_create("files")
  path <- file.path("files", paste0(x, ".rds"))
  saveRDS(x, path)
  path
}

list(
  tar_target(x, seq_len(1000)),
  tar_target(y, write_file(x), pattern = map(x), format = "file"),
  tar_target(z, readRDS(y), pattern = map(y))
)

library(proffer)
library(targets)
tar_destroy()
pprof(tar_make(callr_function = NULL, reporter = "summary"))

wlandau · 2024-10-22T20:11:26Z

As best I can tell for now, most of the memory is consumed by the internal data structures targets needs for bookkeeping. targets has an internal object oriented programming system which uses environments with S3 classes. With 32k targets, there are 32k+ nested environments, and those happen to take up a lot of memory in aggregate. Unless I am missing something in scaled-up examples, improving memory efficiency here would be a huge undertaking and may involve converting many of the internal data structures into compact C structs. Converting this thread to a discussion.

wlandau added the topic: performance label Oct 22, 2024

wlandau self-assigned this Oct 22, 2024

ropensci locked and limited conversation to collaborators Oct 22, 2024

wlandau converted this issue into discussion #1352 Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

High memory usage in large pipelines #1349

High memory usage in large pipelines #1349

wlandau commented Oct 22, 2024

wlandau commented Oct 22, 2024 •

edited

Loading

wlandau commented Oct 22, 2024

wlandau commented Oct 22, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

High memory usage in large pipelines #1349

High memory usage in large pipelines #1349

Comments

wlandau commented Oct 22, 2024

wlandau commented Oct 22, 2024 • edited Loading

wlandau commented Oct 22, 2024

wlandau commented Oct 22, 2024

This issue was moved to a discussion.

wlandau commented Oct 22, 2024 •

edited

Loading