Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classes loaded by Pods are leaked #268

Closed
stephenbrady opened this issue Aug 7, 2015 · 20 comments
Closed

Classes loaded by Pods are leaked #268

stephenbrady opened this issue Aug 7, 2015 · 20 comments
Labels
Milestone

Comments

@stephenbrady
Copy link
Contributor

This repo produces a class leak with each invocation of test by boot watch.

https://github.com/stephenbrady/boot-pod-leak

In writing custom boot tasks, I see the same kind of class leak when using either boot.pod/make-pod with boot.pod/destroy-pod or boot.pod/pod-pool with :refresh.

@martinklepsch
Copy link
Member

Hey @stephenbrady thanks for the report!

How exactly did you detect this memory leak? Did you get java.lang.OutOfMemoryError: Java heap space when your heap filled up or did you get java.lang.OutOfMemoryError: PermGen space errors?

@stephenbrady
Copy link
Contributor Author

Sure. By profiling in jvisualvm. I can see with each invocation of test that the number of classes in memory grows by roughly the number of new classes you would expect to see with each new loading of the clojure library. The biggest symptom is how long it takes to run each marginal invocation of test as the number of classes just grows and slows things down in the jvm.

This is with jdk8, so no perm gem memory issues.

@martinklepsch
Copy link
Member

@stephenbrady Thanks, that confirms my understanding.

@martinklepsch
Copy link
Member

@alandipert:

@martinklepsch, @micha re: mem core.async might be a culprit, https://github.com/projectodd/shimdandy#preventing-memory-leaks

I read that page but assumed that Boot does not use/depend on core.async and searching for classes in the YourKit classbrowser doesn't list any classes containing "async".

The program I use for testing also is very minimal and does not load any additional deps:

#!/usr/bin/env boot

(require '[boot.pod :as pod])

(defn -main [& args]
  (loop [i 0]
    (let [c-pod (pod/make-pod (get-env))]
      (boot.util/info "New pod taken: %s\n"  c-pod)
      (pod/with-eval-in c-pod
        (boot.util/info "Hi from pod no. %s.\n" ~i))
      (pod/destroy-pod c-pod)
      (recur (inc i)))))

Used with these $BOOT_JVM_OPTIONS it causes the program to quit after about 16 loops:

-Xmx100m -XX:MaxMetaspaceSize=200m -client -XX:+TieredCompilation -XX:TieredStopAtLevel=1 -Xverify:none

@tobias
Copy link
Contributor

tobias commented Oct 23, 2015

I believe I have figured out what is causing this - I believe boot is
triggering a call to either clojure.lang.Var.cloneThreadBindingFrame
(which will happen if clojure.core/binding-conveyor-fn is called,
either via a call to clojure.core/future or one of the
clojure.core/send* fns), or clojure.lang.Var.getThreadBindings
(which will happen if clojure.core/get-thread-bindings is called,
which would most likely be via clojure.core/bound-fn). If either of
those Var methods are called, Var.dvals.get() is called, which
will set a ThreadLocal on the calling thread. In the test case,
we're doing all of the pod work on a single thread (the main thread),
so we end up creating those thread locals on that thread. If we had a
single copy of clojure.lang.Var loaded, this would be fine, since
there would only be one thread local, but we end up with one per pod,
which causes retention of the pod's class loader, preventing any
classes loaded in the pod from being unloaded.

Code that triggers either Var method call is perfectly fine clojure,
and we can't prevent it (nor do we want to). One fix for this would be
to wrap each shim in a thread, where all invokes against the shim
occur on that thread. Then we dispose of the thread once we are done
with the shim.

The following test case demonstrates that, and runs indefinitely for
me:

#!/usr/bin/env boot

(require '[boot.pod :as pod])

(defn -main [& args]
  (loop [i 0]
    (doto
        (Thread.
          #(let [c-pod (pod/make-pod (get-env))]
             (boot.util/info "New pod taken: %s\n"  c-pod)
             (pod/with-eval-in c-pod
               (boot.util/info "Hi from pod no. %s.\n" ~i))
             (pod/destroy-pod c-pod)))
      .start
      .join)
    (recur (inc i))))

The reason I haven't seen this with ShimDandy before is likely because
I never created many shims on the same thread.

I think I can fix this internally in ShimDandy, by adding a
thread/shim, with a queue passing actions to that thread for each
invoke. I don't want to add that context-switch overhead to all usages
of ShimDandy, but could add an option to newRuntime() that allows
you to use the thread-isolated implementation. I'll also need to add a
shutdown() method to the shim interface that stops the shim's
thread. I'd probably add a call to clojure.core/shutdown-agents
there as well, just so users don't have to do that themselves.

@micha
Copy link
Contributor

micha commented Oct 23, 2015

@tobias that sounds great, thanks for the detailed explanation! In the destroy-pod function in boot we also call .close() on the shim classloader, I guess that's not really necessary?

@tobias
Copy link
Contributor

tobias commented Oct 23, 2015

@micha I'll see if I can get out a ShimDandy release today. It's probably a good idea to call .close(), since it closes any open files (according to the URLClassLoader javadoc).

@tobias
Copy link
Contributor

tobias commented Oct 23, 2015

#314 should fix it. I went with the simpler solution of ensuring Var.dvals was removed when we are done with an invoke call. That PR also uses the new .close() method on the shim, which invokes shutdown-agents for you.

martinklepsch pushed a commit that referenced this issue Oct 24, 2015
1.2.0 removes an avenue for leaking memory, and provides a close method
to call `shutdown-agents` on your behalf.
@martinklepsch
Copy link
Member

Closed via #314

martinklepsch pushed a commit that referenced this issue Oct 26, 2015
1.2.0 removes an avenue for leaking memory, and provides a close method
to call `shutdown-agents` on your behalf.
micha added a commit that referenced this issue Oct 27, 2015
@stephenbrady
Copy link
Contributor Author

Thanks for the attention on this, and sorry I didn't follow up before the release. I tested this again with boot 2.4.2. The class leak has been (significantly) mitigated but not resolved. New classes are still being created at the rate of ~2,000 per watch invocation compared to ~10,000 before. The above minimal cases probably help isolate aspects of the leak, but they don't fully model the case that I reported which is of the variety of boot watch foo-task where foo-task creates and destroys a pod (whether via a pod pool or not).

Please see this repo again: https://github.com/stephenbrady/boot-pod-leak.

The difference in reproducing tests suggests that watch may directly or indirectly be involved.

@martinklepsch
Copy link
Member

@stephenbrady unless you supply some JVM option like -XX:MaxMetaspaceSize=100m the number of classes loaded will grow until it hits the system's natural memory limits.

Probably good to document this on the JVM Options page.

@martinklepsch
Copy link
Member

@stephenbrady
Copy link
Contributor Author

Sorry, but that's not what I observe in the repo I published. All setting that JVM option does is trigger an OOM exception in Java8 after so many classes have been created and not released. I still see significant class leakage. GC is not releasing those classes when under pressure. Since the JVM chooses to OOM instead of pausing the JVM to GC those classes, it means there's a class leak.

@martinklepsch
Copy link
Member

I can replicate the behaviour I described (i.e. classes are GCd) with your project (ref 9a1593a).

Also just to double check - is the output of boot -V correctly reporting v2.4.2?


EDIT No need to try the stuff below unless you want to test without having to run touch a bunch of times.

Can you try putting this into a file pod-loop and run chmod +x pod-loop; ./pod-loop?

#!/usr/bin/env boot

(require '[boot.pod :as pod])

(defn -main [& args]
  (println "Hello")
  (loop [i 0]
    (let [c-pod (pod/make-pod (get-env))]
      (boot.util/info "New pod taken: %s\n"  c-pod)
      (pod/with-eval-in c-pod
        (boot.util/info "Hi from pod no. %s.\n" ~i))
      (pod/destroy-pod c-pod)
      (recur (inc i)))))

@stephenbrady
Copy link
Contributor Author

Hmm, an old version of boot was sneaking in as I was switching between repos. Sigh...sorry for the churn.

@martinklepsch
Copy link
Member

No worries! 👍

@tobias
Copy link
Contributor

tobias commented Nov 23, 2015

@stephenbrady - I ran your test app, with a while loop in the shell that touched core.clj every half second, and it quickly reached a steady-state, and did not run out of memory (at least not yet):

boot-leak

This was with:

$ boot --version
#https://github.com/boot-clj/boot
#Mon Nov 23 17:51:23 EST 2015
BOOT_JVM_OPTIONS=-client -XX\:+TieredCompilation -XX\:TieredStopAtLevel\=1 -Xmx2g -XX\:+UseConcMarkSweepGC -XX\:+CMSClassUnloadingEnabled -Xverify\:none
BOOT_CLOJURE_NAME=org.clojure/clojure
BOOT_VERSION=2.4.2
BOOT_CLOJURE_VERSION=1.7.0
#App version: 2.4.2

$ java -version
openjdk version "1.8.0_65"
OpenJDK Runtime Environment (build 1.8.0_65-b17)
OpenJDK 64-Bit Server VM (build 25.65-b01, mixed mode)

I'll let it run for a few hours and see if the behavior changes. Is there any thing else I could try to help recreate it?

@tobias
Copy link
Contributor

tobias commented Nov 23, 2015

Oops, I see y'all have figured it out. Ignore my last comment then :)

@stephenbrady
Copy link
Contributor Author

Yep, I finally got the sawtooth graph too after I pinned down the boot configuration on my machine. Thanks again.

@arichiardi
Copy link
Contributor

Hello folks, the problem is still there when it comes to using tasks that use pods (like boot-test) and don't properly cleanup executor threads and friends. In the linked PR Micha and I were thinking of adding destroy-hooks in order to allow custom code at destroy.

Does it sound like a good idea?

The problem for me happens when I run boot watch test in my app. I quickly reach OutOfMemoryError: Metaspace .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants