Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant Memory Growth Regardless of Snapshot #875

Closed
james-andrewsmith opened this issue Jul 3, 2014 · 31 comments
Closed

Constant Memory Growth Regardless of Snapshot #875

james-andrewsmith opened this issue Jul 3, 2014 · 31 comments
Labels

Comments

@james-andrewsmith
Copy link

Hello Etcd Team/Community!

Firstly, how awesome is etcd - I feel it's life changing awesome.

But I seem to have found a bug on win64 using the latest build, when updating the same key constantly memory will continue to grow, even after writing the log / snapshot. I only have about 3 records in etcd - but they are updated with an extended TTL every 60 seconds. I noticed after a couple of days etcd was using a lot of ram.

I wrote a quick test to show the scenario:

First I start etcd:
etcd -v -snapshot=true -snapshot-count=100 -name=testsnapshot -data-dir=testsnapshot

Then run an exe which:

  1. Creates a key with a ttl and a value.
  2. Loops constantly on the same key, extending the TTL and updating the value (keeping it the same)

I notice the snapshot file just replaces it self and the log continues to grow, and most unfortunately, the ram usage will just grow.

Here is a link to:

  1. The contents of the data directory
  2. The text exe
  3. The C# source

https://www.dropbox.com/sh/fh8m091aip66fwz/AACo3OFrxs_iXVu3dXjPmm9Ha

Let me know if there is any further information I can provide.

@james-andrewsmith
Copy link
Author

FYI - I have retested this with the latest release (https://github.com/coreos/etcd/releases/tag/v0.4.5) and have found it to still be an issue.

Although it does appear that memory usage grows slightly slower, it still does grow indefinitely. I have added the logs and snapshot directory for this test to the dropbox link above.

@willejs
Copy link

willejs commented Jul 17, 2014

+1 on this. What versions are people running in production that are stable?

@bmizerany
Copy link
Contributor

@james-andrewsmith Thank you for reporting this. I'm working on reproducing. Can you throw the source in a gist http://gist.github.com or in something other than a 7z file? I can't open that without installing dubious apps on my mac.

@bmizerany
Copy link
Contributor

@willejs What OS are you running etcd on?

@james-andrewsmith
Copy link
Author

@bmizerany Thank you for looking into it! I totally understand why you wouldn't want to install said software.

Here are some alternative links to the zips within:

All the tests have been built and run on Windows 7 / Windows Server 2012, let me know if there is any further information I can provide.

@bmizerany
Copy link
Contributor

I've been looking into this and found there appears to be a slight leak. In
still conforming. There is a bench running that's I will check the output
of in the morning. Stay tuned.

On Thursday, July 17, 2014, James Andrew-Smith notifications@github.com
wrote:

@bmizerany https://github.com/bmizerany Thank you for looking into it!
I totally understand why you wouldn't want to install said software.

Here are some alternative links to the zips within:

http://archfashionsea.blob.core.windows.net/public/etcd/etcd-4.5-testsnapshot.7z

http://archfashionsea.blob.core.windows.net/public/etcd/EtcdWaitMemoryLeakSource.7z

http://archfashionsea.blob.core.windows.net/public/etcd/MemoryLeakTest.7z

http://archfashionsea.blob.core.windows.net/public/etcd/testsnapshot.7z

All the tests have been built and run on Windows 7 / Windows Server 2012,
let me know if there is any further information I can provide.


Reply to this email directly or view it on GitHub
#875 (comment).

@james-andrewsmith
Copy link
Author

@bmizerany That's awesome news! Thank you - I have such an ugly hack in production to work around this.

@bmizerany
Copy link
Contributor

We tracked it down! This should be fixed by #900. Please run your tests again and let us know if nailed it.

@james-andrewsmith
Copy link
Author

@bmizerany Great stuff! I think you've nailed the main leak, usage grows much slower now, unfortunately it still grows indefinitely.

Below are the results as I ran against the current master. (For watchers each iteration is a set on the same key but with a new TTL)

75k = 165mb
100k = 215mb
165k = 340mb
215k = 442mb

Let me know if there is anything else I can do to assist!

@bmizerany
Copy link
Contributor

Can you provide new code? My tests were based on the code you attached - in which the watch was commented out iirc.

@james-andrewsmith
Copy link
Author

@bmizerany It's the same code, here is the relevant snippet from the previous attachment

 static void UpdateRecordWithTtl()
    {
        if (source.IsCancellationRequested) return;

        // overwriting an existing key with a new TTL
        etcd.Set("update/somethingelse", "testing", 60, true, "testing");
        lock (sync)
        {
            count++;
            Console.Clear();
            Console.WriteLine("TTL Iteration: " + count);
        }
        // wait for 10 milliseconds, then update again
        Task.Delay(10, source.Token)
            .ContinueWith((_) => UpdateRecordWithTtl());
    }

@bmizerany
Copy link
Contributor

@james-andrewsmith, you said: "For watchers each iteration is a set on the same key but with a new TTL". Who/what are the watchers? It's not clear in your code.

@bmizerany
Copy link
Contributor

Also, we haven't seen that kind of growth in memory in our tests. Is there anything else you're doing in your environment/setup that you may have left out?

@james-andrewsmith
Copy link
Author

@bmizerany Sorry Blake. I am suffering from man flu and have a cloudy mind.

I meant watchers as in people watching this thread. I see now how confusing that must of been. My apologies.

I did a build from master yesterday, perhaps I am stuffing something up? Can you attach a zip of the etcd.exe you did the tests on so I can verify I am testing against the right version?

@bmizerany
Copy link
Contributor

@james-andrewsmith I do not have an exe. I don't have easy access to Windows. The SHA I built was 5072772 which can be found at https://github.com/bmizerany/etcd-team/tree/benchwip

@james-andrewsmith
Copy link
Author

@bmizerany Just downloaded that build - we've still got a leak there.

To rule out the test exe / .NET client code I've created the following CURL commands to give the same effect, after this ran the etcd process had grown to use 160mb+

curl -L http://127.0.0.1:4001/v2/keys/update -XPUT -d dir=true
curl -L http://127.0.0.1:4001/v2/keys/update/somethingelse?[1-100000] -XPUT -d value=bar -d ttl=60

Hope this makes it easier to test! (And the test shows the same thing on your non-windows environment). Otherwise I am happy to fire up a VM on Azure with the binary installed and send you the credentials.

@bmizerany
Copy link
Contributor

I fired up:

$ ./bin/etcd --version
etcd version 0.4.5

and ran:

curl -L http://127.0.0.1:4001/v2/keys/update/somethingelse?[1-100000] -XPUT -d value=bar -d ttl=60

And only hit ~40mb of RSS, max:

screen shot 2014-07-23 at 10 59 21 am

@james-andrewsmith
Copy link
Author

@bmizerany Why can't it ever be easy. Where to from here? Should I setup a demo VM and send you the credentials? Is there further diagnostic information I can send from a windows environment? Happy to help in any way that I can.

@bmizerany
Copy link
Contributor

@james-andrewsmith Can you run gcvis etcd and hit it with the curl command above, and post a screenshot of the results here?

https://github.com/davecheney/gcvis

@james-andrewsmith
Copy link
Author

@bmizerany Done!

image

@bmizerany
Copy link
Contributor

What event causes it to flatline? Is it the completion of the curls?

@james-andrewsmith
Copy link
Author

@bmizerany That's right (I got distracted by intense office foosball). Even after the release it's using 140mb.

@philips
Copy link
Contributor

philips commented Jul 29, 2014

I am going to release 0.4.6 that fixes the timer leak in the next hour or so. It would be great to see if this resolves the leak for you!

@james-andrewsmith
Copy link
Author

@philips Thank you - unfortunately it hasn't resolved the leak.

After the above curls ETCD has 160mb allocated. Attached is a gcvis for the run (the flatline at the end is because the curls ended and I was away from desk).

image

@philips
Copy link
Contributor

philips commented Jul 30, 2014

@james-andrewsmith Hrm, OK. Is there any chance you can build from source on Windows with Go 1.3? It is a crazy idea but maybe something is wrong with the cross compiler setup or go 1.2?

@james-andrewsmith
Copy link
Author

@philips Happy to! I sort of wished it was a magic bullet like that, but unfortunately not, we're still seeing the growth with Go 1.3 using commit 5072772

image

Happy to try anything else!

@bmizerany
Copy link
Contributor

@james-andrewsmith Can you try patching main.go with:

diff --git a/main.go b/main.go
index e7283bc..b722581 100644
--- a/main.go
+++ b/main.go
@@ -7,6 +7,8 @@ import (
    "net"
    "net/http"
    "os"
+   "os/signal"
+   "runtime/debug"
    "time"

    "github.com/coreos/etcd/config"
@@ -68,7 +70,20 @@ func serve(who string, addr string, tinfo *config.TLSInfo, cinfo *ehttp.CORSInfo
        log.Fatal("unsupported http scheme", tinfo.Scheme())
    }

-   h := &ehttp.CORSHandler{handler, cinfo}
-   s := &http.Server{Handler: h, ReadTimeout: readTimeout, WriteTimeout: writeTimeout}
-   log.Fatal(s.Serve(l))
+   done := make(chan os.Signal)
+   signal.Notify(done, os.Interrupt, os.Kill)
+
+   go func() {
+       h := &ehttp.CORSHandler{handler, cinfo}
+       s := &http.Server{Handler: h, ReadTimeout: readTimeout, WriteTimeout: writeTimeout}
+       log.Fatal(s.Serve(l))
+       done <- os.Kill
+   }()
+
+   <-done
+   f, err := os.Create("heapdump")
+   if err != nil {
+       log.Fatal(err)
+   }
+   debug.WriteHeapDump(f.Fd())
 }

and zip up the heapdump and exe for me?

@bmizerany
Copy link
Contributor

Please let it run until you see the bloat before killing.

@james-andrewsmith
Copy link
Author

@bmizerany Sure thing - will let you know how it goes

@yichengq yichengq added the bug label Aug 28, 2014
@james-andrewsmith
Copy link
Author

Hi Guys,

I just tested the 0.5.0 alpha release and this appears to be resolved!

Thank you for your assistance debugging and for continuing to throw such effort into improving ETCD.

Greatly appreciated!

Regards,
James

@jonboulle
Copy link
Contributor

@james-andrewsmith glad to hear it! Thanks for following up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

6 participants