Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing large object to file is slow #43

Open
borkdude opened this issue Mar 29, 2019 · 15 comments
Open

Writing large object to file is slow #43

borkdude opened this issue Mar 29, 2019 · 15 comments

Comments

@borkdude
Copy link

borkdude commented Mar 29, 2019

Problem:

Writing a large object to a file with transit-clj was slower than I expected. It takes about 1 second with transit-clj and roughly 1/10th of that time with regular spit.

Test data:

test.edn.zip
Extract to test.edn.

Repro:

clj -Sdeps '{:deps {com.cognitect/transit-clj {:mvn/version "0.8.313"}}}'
(require '[clojure.edn :as edn])
(def edn (edn/read-string (slurp "test.edn")))
(count (keys edn)) ;;=> 799
(require '[cognitect.transit :as transit])
(require '[clojure.java.io :as io])
(def writer (transit/writer (io/output-stream (io/file "transit.json")) :json))
(time (transit/write writer edn)) ;; prints:
"Elapsed time: 1151.116438 msecs"

While writing the same EDN to a file (printing with str is much faster):

(time (with-open [fos (java.io.FileOutputStream. "/tmp/foo.edn")] (let [w (io/writer fos)] (.write w (str edn)) (.flush fos)))) ;; prints:
"Elapsed time: 73.316135 msecs"

Flamegraphs

Created with https://github.com/clojure-goes-fast/clj-async-profiler

Flamegraph of transit:
transit-flamegraph.svg.zip

Flamegraph of EDN to file:
edn-flamegraph.svg.zip

@strssndktn
Copy link

strssndktn commented Jul 3, 2020

@borkdude I investigated this a bit and it seems that transit calls .flush overly aggressive, thus forcing the data to be flushed out to disk on every element (I am not completely sure how and when .flush gets called or what are the reasons for that). As an experiment, could you put a proxy in between your output-stream and transit/writer overwriting .flush with a noop function?

I think it makes sense if transit-java calls .flush only once for each top-level write.

@borkdude
Copy link
Author

borkdude commented Jul 3, 2020

@strssndktn I worked around this by first writing to a ByteArrayOutputStream for which .flush is a no-op and that helped a lot.

@RokLenarcic
Copy link

@borkdude writing to BAOS first will require that you hold the whole output in memory which is undesireable. Instead you can try wrapping the output stream into your own OutputStream subclass that makes .flush a no-op.

@tonsky
Copy link

tonsky commented Jul 21, 2023

I’m hitting the same issue. This did the trick for me:

(defn no-flush-output-stream [^OutputStream os]
  (proxy [BufferedOutputStream] [os]
    (flush [])
    (close []
      (.flush os)
      (.close os))))

@tonsky
Copy link

tonsky commented Jul 25, 2023

UPD: that actually should’ve been

(defn no-flush-output-stream [^OutputStream os]
  (proxy [BufferedOutputStream] [os]
    (flush [])
    (close []
      (proxy-super flush)
      (proxy-super close))))

Works like a charm

@ericdallo
Copy link

@tonsky I just updated clojure-lsp to use your suggestion and improved memory usage indeed for huge maps, thanks for that!

@borkdude
Copy link
Author

Thanks. I needed to add a couple more type hints to make it work without reflection:

(defn no-flush-output-stream
  "See https://github.com/cognitect/transit-clj/issues/43#issuecomment-1650341353"
  ^java.io.OutputStream [^java.io.OutputStream os]
  (proxy [java.io.BufferedOutputStream] [os]
    (flush [])
    (close []
      (let [^java.io.BufferedOutputStream this this]
        (proxy-super flush)
        (proxy-super close)))))

@tonsky
Copy link

tonsky commented Jul 27, 2023

Did the exact same change yesterday myself! Look at that, two senior Clojure devs got 10-line function right after just 3 attempts!

@timewald
Copy link
Contributor

timewald commented Jul 28, 2023 via email

@tonsky
Copy link

tonsky commented Jul 28, 2023

Well, the problem is not that transit flushes at the end of writing an entity. Problem is that it flushes multiple times in the middle of writing entity. I don’t think that could serve any purpose, it must’ve been an implementation bug

@timewald
Copy link
Contributor

timewald commented Jul 28, 2023 via email

@tonsky
Copy link

tonsky commented Jul 28, 2023

Why is it preferable from network perspective? All these maps are wrapped in a list, and you can’t read part of the list on the other side, can you?

@timewald
Copy link
Contributor

timewald commented Jul 28, 2023 via email

@tonsky
Copy link

tonsky commented Jul 28, 2023

Sure, but that’s why you wrap your OutputStream in BufferedOutputStream. It’ll buffer enough for you and flush itself when buffer runs out

@timewald
Copy link
Contributor

timewald commented Jul 28, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants