Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question - resource allocation in docker #2082

Closed
OferE opened this issue Dec 12, 2016 · 23 comments
Closed

question - resource allocation in docker #2082

OferE opened this issue Dec 12, 2016 · 23 comments

Comments

@OferE
Copy link

OferE commented Dec 12, 2016

Hi,
I see that there is a strict memory allocation policy in nomad.
Is there a way to workaround it? - I would like to have containers running from nomad to run as regular docker containers without limitations... (just like in docker swarm for example)...

Also, using swap may act as a safety nets (expecally in containers running things like spark that has complicated memory allocation).
Is there a way to allow swap usage?
Enforcing resource allocation is a great feature but it would be nice if it can be canceled for certain types of containers....

@jippi
Copy link
Contributor

jippi commented Dec 12, 2016

afaik oversubscription is planned for a later release - since nomad is doing binpacking oversubscription is incredible hard to mix in with that :)

Today there is no workaround for this behavior, and afaik it's a release or two (at least) in the future at this time.

@OferE
Copy link
Author

OferE commented Dec 12, 2016

So, is this means that i cannot use swap in my infrastructure if i choose to work with docker and nomad?

@jippi
Copy link
Contributor

jippi commented Dec 12, 2016

for now, yes i think so. I'll leave any additional comments and clarification to someone from hashicorp :)

@OferE
Copy link
Author

OferE commented Dec 12, 2016

A.
If I understand coorectly oversubscription will not solve this issus - it is more like AWS spot instances than a real solution like docker swarm.

B.
I see that Mesos has the same behavior - looks like everyone copied from Google's Borg...
This design is great where u have a cluster of physical machines, but has some disadvantages when working in the cloud.
The scheduling should be much simpler in case we are in the cloud and have dedicated machines to each service. Let the user choose which machine type to use for which service, and let the containers just run there under the docker and OS level - there is no need for nomad to allocate resources as the user already decided on this. QOS is a nice feature only in data centers and not in the cloud - there it is just limitation.

for example:
I have spark cluster, kafka cluster etc.
For each cluster i chose upfront different instance types - and i just run one container in each machine. why should i indicate resources in nomad? why can't I use swap for spark - where it makes my app stable in peak moments? it doesn't make any sense...

Docker swarm for example, is much simpler and solves this out of the box - there is no allocation for hardware, just constraints and affinities.
This approach is better for most cloud usages.

@OferE
Copy link
Author

OferE commented Dec 12, 2016

I think that by adding a flag to remove all cgroups limitations - it will solve all the problems.
This flag should generate a warning regarding QOS and that's it - in this mode the user is in charge on the QOS and not nomad.

@dadgar
Copy link
Contributor

dadgar commented Dec 12, 2016

Hey @OferE,

Your use case of Nomad is slightly different if you are doing static partitioning of nodes to types of jobs and is not the design goal of Nomad. Nomad is designed to be be run in a resource agnostic way as much as possible. Jobs should declare what they need and the decision of where to place it should be done by the scheduler. In order to guarantee that there is both enough resources on the chosen machine and that the placed jobs get the runtime performance they need we both do resource isolation and disable swap.

If the machine is swapping, performance loss is significant and in a system designed to be multi-tenant and with bin-packed machines, this is unacceptable.

If you would like to have finer grain control, we provide the raw_exec driver which allows you to make these decisions. In the future there will also be pluggable drivers so you could build your own which is less restrictive. However, for the built in drivers we won't be making that concession.

Thanks,
Alex

@dadgar dadgar closed this as completed Dec 12, 2016
@OferE
Copy link
Author

OferE commented Dec 12, 2016

thanks for the info - i'll try raw_exec instead of docker driver.

@OferE
Copy link
Author

OferE commented Dec 12, 2016

I urge u to rethink about the use case of "static partitioning". Dynamically allocating resources for the cluster is not suitable for all use cases. Kafka and Spark are great examples. It just won't work there.
You need dedicated machines for them - and there is no point in limiting their processes. You want to get the entire efficiency from the cloud machine without worrying that u specified a resource incorrectly.

@dadgar
Copy link
Contributor

dadgar commented Dec 12, 2016

@OferE I am not sure why you think those require dedicated machine? They require dedicated resources. In those cases I would specify a large resource requirements such that they are guaranteed enough resources and as such in most cases they won't be multi-tenant.

I agree that there are use cases that require whole machines (databases for example). To support that case we will add a future resource option to reserve the whole node. But for most applications this is not the case.

@OferE
Copy link
Author

OferE commented Dec 12, 2016

Spark has some logic inside it and some defaults, for example to use all the cores of the machine.
Working in a cgroup environment will confuse it.

I understand your vision and I also think that someday we will get to a point that many major products will understand containerized environments and align their inner logic accordingly.

Playing and adjusting spark or worse pyspark memory allocation is not a trivial thing. Making it work for cgroups is too much at this time.

Also - reserving the entire instance just for one container is also not good for all cases - there is always other container/process that needs to run in this instances: agents for log collection and monitoring for example - in fact -it would be very nice to limit these agents and let the main container/process get wild :-)

One more thing I like to recommend to you: there is also a point of working with dev vs production.
Dev environments are significantly weaker than the production ones since you want to save money.
Writing 2 versions of nomad files (to allow 2 different resource isolation) for each of them is too limiting.
For development it will also be nice to not specify resources - since developers play with the instance type all the time.

Nomad is a great project - i like it much more than mesos/swarm/Kubernetes

@OferE
Copy link
Author

OferE commented Dec 12, 2016

If I had found a magic fish that would give me a wish - i'd ask to have the following type of resource isolation:

Minimal resource allocation - make sure my container will run in a strong env
Maximal resoure allocation - limit infra containers (monitoring/log collection).
Minimal + maximal - replicate my logic according to your design
None - dev + static partitioning...

I would have a declartive dev and production resource isolation.

This is how a perfect world looks like :-)

@OferE
Copy link
Author

OferE commented Dec 13, 2016

raw_exec driver is not working correctly: stopping the job doesn't kill the container - thus it is not a good workaround.

@jippi
Copy link
Contributor

jippi commented Dec 13, 2016

can you please share your job file and how the script is executed if you use any shell to wrap it? hard to help debug without any information :)

raw_exec does work just fine, so its probably that you need to TRAP a signal to make sure docker stops the container :)

@OferE
Copy link
Author

OferE commented Dec 13, 2016

Thanks - i just realized that. i didn't trap it lol.

@OferE
Copy link
Author

OferE commented Dec 13, 2016

BTW - which signal should i trap?

@jippi
Copy link
Contributor

jippi commented Dec 13, 2016

@OferE
Copy link
Author

OferE commented Dec 13, 2016

thank u so much for your help on this!

@jippi
Copy link
Contributor

jippi commented Dec 13, 2016

I just did a test and I got SIGTERM though, better test for yourself :)

@OferE
Copy link
Author

OferE commented Dec 14, 2016

It seems like it doesn't work. i trapped the SIGTERM SIGINT and my script never got it.
When i send the signal myself the script is stopped.

this is my script:
the trap never gets any signal from nomad :-(

#!/bin/bash
# handler for the signals sent from nomad to stop the container
my_exit() 
{
   echo "killing $CID"
   docker stop --time=5 $CID # try to stop it gracefully
   docker rm -f $CID # remove the stopped container 
}

trap 'my_exit; exit' SIGHUP SIGTERM SIGINT

# Building docker run command
CMD="docker run -d"
for a in "$@"; do
   CMD="$CMD $a"
done

echo docker wrapper: the docker command that will run is: $CMD
echo from here on it is the docker output:
echo 
# actually running the command
CID=`$CMD`

# docker logs is printed in the background
docker logs -f $CID &

# allows the process to listen to signals every 3 seconds
while : 
   do 
      sleep 3 
   done

@OferE
Copy link
Author

OferE commented Dec 14, 2016

I found the problem :-)
The problem is in my sleep and the gracefull period.
I have sleep 3 - and kill_timeout default is 5 - this cases my script to be terminated before it gets the signal.
Changing my script sleep to 1 solved the issue.

I think i will stay with sleep 3 and set kill_timeout to explictly 45 seconds.

@drscre
Copy link

drscre commented Dec 16, 2016

If you don't mind building Nomad from sources there is a trivial patch for Nomad 0.5.1

It adds "memory_mb" docker driver option which, if set to non-zero, overrides memory limit specified in task resources.

https://gist.github.com/drscre/4b40668bb96081763f079085617e6056

You can allow swap in a similar way

@OferE
Copy link
Author

OferE commented Dec 16, 2016 via email

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants