Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codify processor RAM usage into nomad job specs #54

Closed
kurtwheeler opened this issue Sep 19, 2017 · 8 comments
Closed

Codify processor RAM usage into nomad job specs #54

kurtwheeler opened this issue Sep 19, 2017 · 8 comments
Assignees
Labels
future thoughts A conversation that has content that we want to come back to in the distant future. nomad-integration
Milestone

Comments

@kurtwheeler
Copy link
Contributor

kurtwheeler commented Sep 19, 2017

Context

#55 creates different Nomad job specifications for different processor job types. One benefit of this is that we can specify the resource requirements (probably just RAM/CPU) for each job type so that Nomad can schedule the work in a (hopefully) intelligent way.

Problem or idea

Assuming this works well, we'll want to do this for all job types. However we should make sure that Nomad does in fact do a good job of scheduling work before we do this for everything. Therefore we should start with just one, so why not SCAN.UPC?

Solution or next step

We need to test out SCAN.UPC on a variety of file sizes and see what a reasonable upper bound for RAM and CPU is. Once we've determined these, they should be encoded into the Nomad job specification for SCAN.UPC jobs created in #55.

@kurtwheeler kurtwheeler added this to the Create Nomad job specifications milestone Sep 19, 2017
@kurtwheeler kurtwheeler added backlog and removed next labels Jan 24, 2018
@kurtwheeler
Copy link
Contributor Author

I emailed Steven Piccolo (the creator of SCAN.UPC) about this a while ago. This was his response:

The numbers below are for peak memory usage (in Gigabytes). It usually doesn't include downloading the CEL file from GEO (but downloading doesn't take much memory). In many cases, it doesn't include downloading annotation packages from Bioconductor, but I did throw in some to show the effect of doing that (doesn't seem to increase peak memory usage). I used default parameters for SCAN (0.01 convergence threshold), except when noted otherwise. None of these tests used BrainArray mapping. Are you planning to do that?

U133A = 1.119 gigabytes
U133 (SCANfast) = 1.06
Gene ST 1.0 = 2.48
Gene ST 2.1 (also downloading annotations) = 2.93
Gene ST 2.1 (not downloading annotations) = 2.92
Exon 1.0 (also downloading from GEO and downloading annotations) = 3.28
HTA 2 arrays (also downloading annotations) = 11.77

I executed the above tests on a MacBook Pro with 16 GB memory. I raw two of the tests on a Linux server, and the performance was similar.

Below is an example command that I used to test the peak memory usage on my MacBook Pro.

/usr/bin/time -l <test command>

On Linux, you can use a command like this:

/usr/bin/time -v <test command>

These numbers are for peak memory usage. Memory usage is often less than this while the data are being processed.

Does this help?

@cgreene
Copy link
Contributor

cgreene commented Mar 18, 2018

Can tag @srp33 too. 😄

@cgreene cgreene modified the milestones: Create Nomad job specifications, VJackieP3 - Autoimmune/rheumatic disease compendium Apr 8, 2018
@cgreene cgreene modified the milestones: VJackieP3 - Autoimmune/rheumatic disease compendium, Data Refinery Version 1 Apr 26, 2018
@jaclyn-taroni jaclyn-taroni changed the title Codify SCAN.UPC RAM usage into nomad job specs Codify processor RAM usage into nomad job specs Apr 26, 2018
@jaclyn-taroni
Copy link
Member

I've updated the title to reflect the fact that this should probably be addressing multiple processor types. @kurtwheeler can you go back through and note what additional testing/experimentation, if any, needs to be done?

@kurtwheeler
Copy link
Contributor Author

Thanks for updating the title, this should in fact address multiple processor types. I do not have a definitive answer as to what the full extent of testing we _should_do, but I can provide a lot of information that will hopefully facilitate a discussion which will get us to a good enough solution. The reason for this is that this is partially an optimization and partially extremely necessary to have things run smoothly.

So let me start by explaining the extremely necessary part of this. Nomad Jobs have memory specifications on them which it uses to bin pack jobs onto instances which have the resources available to run them. This is a nice feature because it lets us put as many jobs onto a machine as possible without running out of memory and having to switch to swap. Switching to swap would be bad because it would be an enormous performance degradation. However this feature also has a downside, because Nomad disables swap for those containers and will not let them exceed their memory requirement. If they do it kills the job.

The key takeaway from that is that we have to tell Nomad how much memory to allocate for a job, and if the job exceeds that amount of RAM then it will be killed. Therefore underestimating the amount of memory to specify for a type of job (possibly done on a per-processor basis, but even different platforms can have different memory requirements as demonstrated above) will mean that any jobs which exceed it will never complete, until we notice and raise the memory limit.

So what this means for us is that we need to have a pessimistic estimation of how much memory each type of job can take. By pessimistic estimation I mean that we need an estimation that is probably higher than the highest amount of memory that type of job will ever use. That's the extremely necessary part of this issue.

The part of this issue that relates to optimization is the granularity of the job types and the accuracy of our estimates. The more granular our job types are, the "personalized" the job types will be to the requirements of the jobs which fall under that type. For example, consider the numbers Dr. Piccolo provided above for SCAN. He broke them out by platform type and found that while U133 only needs a little more than one GB of RAM, that HTA 2 arrays need close to 12. This means that if we use processor type as the way to differentiate job types that we would have to assume that all SCAN jobs need 12 GB of RAM (although we'd probably want to pad that by x1.5 or something to account for the fact that Dr. Piccolo probably did not choose to run the sample which had the very highest memory usage for that platform). So we would allocate 12 GB of RAM for every SCAN job when the average job probably is using less than 4 GB (rough estimate based on the other platforms he provided data for). Therefore doing things that way has the potential to increase the costs of processing by ~3x!!! This is not a minor optimization.

However there are a lot of ways that we can tackle this, which is why I'm writing all of this out instead of just providing a definitive answer to this. I have a few rough ideas to start with, but I think that as a team we potentially could come up with even better ones:

  1. 80/20 it. We could create specific job types for the most popular platforms that would be relevant ~80% of the time and then use a processor-specific catch-all job type for the rest. This would minimize the testing we need to do to generate all of these, while only leaving ~20% running in unoptimized way.

  2. Use a tiered system We could create various tiers of memory requirements such as 1 GB, 2 GBs, 4 GBS, 8 GBS, etc. We could create a lookup table that maps platforms to tiers, start it off with some conservative estimates, and then build logic into the Foreman which would promote platforms to higher tiers when it detects that a job failed via the OOM killer.

  3. Create a heuristic based off of organism My understanding of why the RAM requirements are different for different platforms is that it's tied into the size of the reference genome/transcriptome that has to be read into memory. If we can detect the size of that, then perhaps we can multiply it by some constant to get an estimate that is certainly higher than the max without being very wasteful.

Finally, it appears to be possible to rebuild Nomad so that it accepts a memory limit which isn't used for bin packing. This would allow the memory limit used for allocation planning and the actual hard memory limit on the docker container to be different. I would say that rebuilding Nomad is definitely not the best long term solution, but it may be a reasonable way to do the initial million samples without sinking an immense amount of time into optimization. We could potentially even keep track of the peak memory usage while running those to give us data to inform a more robust solution later.

So to finally directly address @jaclyn-taroni's question of "note what additional testing/experimentation, if any, needs to be done", it depends how we decide to specify job types. That being said, the following will output the max memory usage in KB of $command.

/usr/bin/time -v $command 2>&1 >/dev/null | grep Maximum | awk '{print $NF}'

Really I think CPU is not that big of a deal given that memory is the constraining factor for us. We probably won't run into that issue as much, so we can probably just specify fairly low CPU numbers so Nomad basically ignores them. However I have put some thought into this so I will record it below. Feel free to stop reading here.

Testing CPU is more difficult because even a command like ls will use 100% of the CPU available if it can for a very very short period of time. Therefore we can't just look at the peak CPU usage and get a number. However, Docker specifies that CPUShares as An integer value containing the container’s CPU Shares (ie. the relative weight vs other containers). Since they are relative, we don't need hard numbers, just relative numbers compared to other job types. This means that we can decide how we want to split up job types based on the RAM considerations, then just go and assign them relative weights for CPU.

@Miserlou suggested that the best way to determine CPU requirements for a certain process is to limit how much CPU it can use and see where performance drops off. Since these processors are memory-bound, it should be that they have relatively constant performance until they don't have enough CPU power and then their performance will drop off drastically. It appears that this could be done using climit although that enforces the CPU limit using SIGSTOP and SIGSTART so I'm not actually sure that would end up having the desired effect after all.

@kurtwheeler
Copy link
Contributor Author

So I discussed this some with @cgreene and a few relevant things came up:

  • The variability in RAM usage for MicroArray data is not related to the organism, but rather the platform.
  • All of our processors (except tximport) can be done one sample at a time. Doing so this way will probably cause the variability of RAM usage for different samples from the same platform to be very minimal.
  • The variability in RAM usage for RNASeq data is related to the organism.

Therefore it should be possible to loop through these two lists:

  • The MicroArray platforms we support.
  • The organisms we have transcriptome indices for.

For each item of each list, query for a few samples (maybe 5 or 6), record the peak memory usage for processing each one, average them together, and use that as an estimate of the RAM needs for that platform or organism.

Therefore two things need to happen for this issue to be resolved:

  • Any processors running on multiple samples at a time should be changed to run on a single sample at a time (can just be changed to iterating over each sample in the same processor job).
  • Two tables should be generated, one mapping MicroArray platforms to average RAM usage and one mapping organisms to RAM usage.

@Miserlou
Copy link
Contributor

Noting here that MultiQC can also use a large amount of memory (or at least more than we currently allocate it, as I had a few of these in the logs):

data-refinery-log-group-rich-zebra9-dev-prod log-stream-processor-docker-rich-zebra9-dev-prod 2018-05-27 10:13:59,063 i-0c749b6231acb5ac3/MainProcess data_refinery_workers.processors.utils ERROR [processor_job: 10582] [failure_reason: Shell call to MultiQC failed because: Error: [Errno 12] Cannot allocate memory\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/usr/local/bin/multiqc", line 442, in mul]: Processor function _run_multiqc failed. Terminating pipeline.

@Miserlou
Copy link
Contributor

Miserlou commented Jun 15, 2018

I've submitted a PR to get better telemetry out of Nomad. We should be able to use this to reevaluate our job requirements after we run a staging crunch.

@jaclyn-taroni
Copy link
Member

Per our discussion in tech team meeting this morning — when we’re running a subset of experiments to get more information abou cost, let’s make sure we do the zebrafish samples (#227), the imatinib samples (#335) and a random subset of others.

@Miserlou Miserlou self-assigned this Jul 12, 2018
@jaclyn-taroni jaclyn-taroni added the future thoughts A conversation that has content that we want to come back to in the distant future. label Jul 30, 2018
@ghost ghost removed the review label Jul 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
future thoughts A conversation that has content that we want to come back to in the distant future. nomad-integration
Projects
None yet
Development

No branches or pull requests

4 participants