-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant slow down when executed in quick succession #34
Comments
Hey @tncks0121, why do you care about run time of isolate itself? Isn't just run time of user's program important? 😕 |
@hermanzdosilovic, yes, that is correct, but suppose you are conducting a contest and each submission needs at least 0.4*100 seconds just because of the sandbox when the solution is super easy. In this situation the queue will be very long quickly if one has not that many workers. |
@tncks0121, thanks for clarifing that. Now when I think about it I think
that if you are planning to conduct a contest where 0.4s sandbox overhead
is problem than you should rather think about larger scaling, but I
understand your concern. To be honest, I really hope this is an real
isolate issue and not kernel issue, because it is probably easier to fix
isolate than kernel. 😛
I will try to reproduce your results on few of my machines - some VM
(Digital Ocean and AWS) and some bare metal, with latest isolate version. 😊
In the meantime you can try using my API for your platform
https://api.judge0.com and stress test it to see if it fits your needs.
|
Hi! I can't reproduce this on any of my machines, VMs or physical. Can you try the following:
Or if you are able to provide a shell to a machine that can reproduce this, I'd be happy to investigate. (It probably requires root to diagnose though). |
@bblackham, I did what you said, and it seems this is the bottleneck. Unfortunately I don't know how to read these files, so I'm not sure what the problem is..
|
@tncks0121, that clone() definitely looks to be the culprit. No user code is executing between the start of the clone() and its completion. This points at the Linux kernel and something about cloning a task into a new namespace. Are you able to run operf on the affected machines? I'm not certain that operf will work in certain types of VM (I think it requires direct hardware access to the MSRs). That sort of latency would have to be either some kind of network traffic (maybe there is a small amount of buffering which is why the first one is okay?), a lot of memory zeroing (like, gigabytes, which would be strange), or evicting something to swap, or dropping some caches. Perhaps it is triggering some call out to a really slow userspace helper. If operf doesn't provide any information, I don't know how to diagnose further without being able to reproduce it locally. Can you help me reproduce it locally, or provide a shell to somewhere that it is reproducible? |
@bblackham, I tried but it seems it doesn't give any useful information. Maybe I've done in a wrong way as I don't know about operf.
Anyway, I'll try to find a way to reproduce in a new machine. (both tested machines are already using ones) |
I think that the execution speed depends on the machine, only the first one or first two iterations are significantly faster (6 times or so). This can be seen on both "good" and "bad" machines - the good one has first iteration 0.01 and then about 0.06 and the bad one 0.06 and then about 0.48, which is almost the same slowdown for both of them. My measurement on CentOS VPS:
|
Right, I can reproduce it under docker here. The killer is creating a separate networking namespace. If you pass |
Confirm that |
Thanks for confirming @stefano-maggiolo. Some extra data points: I never saw it on my VM tests earlier because I was running a single-CPU VM (where the RCU slowdown issues never occur). On a dual-CPU VM, without any iptables loaded, there is no issue. But then as soon as I run For @tncks0121 and anyone else affected, try blacklisting nf_conntrack (add |
Closing this issue as I believe it is definitely a Linux kernel bug and there's nothing isolate (or any sandbox that uses Linux network namespaces for network isolation) can do about it. A potential workaround is given in my previous comment (blacklisting nf_conntrack). If this workaround solves the issue for you, please confirm here for posterity. Thanks! |
I am just reopening issue #29 since I am really concerned with this issue. In my server, the delay is about 0.4-0.5s, which is significant in contests which has usual time limit 1 sec.
Before I saw that issue, I just decided to use another sandbox only for C/C++ (but using isolate for other languages, because they are slow itself), but it would be good for me to use isolate for all languages.
I'm not sure whether this is a kernel issue. We checked the time of successive isolate executions in two machines on Linode, using this bash script.
Machine 1: working bad
Machine 2: working nicely
Also I executed the script on a Docker container (on my Mac): working bad
Maybe the difference is just Linode's fault, but I think this difference is notable.
The text was updated successfully, but these errors were encountered: