-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: overflowing qual value in vcf #268
Conversation
Codecov Report
@@ Coverage Diff @@
## master #268 +/- ##
=======================================
Coverage 88.85% 88.85%
=======================================
Files 78 78
Lines 6442 6444 +2
Branches 454 454
=======================================
+ Hits 5724 5726 +2
Misses 264 264
Partials 454 454
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your PR, @nuggetoriniku !
Before reviewing the code, I'd like to ask for your help with the followings:
- Please add a test code for the modification.
- Please publish your public key which you used to sign your commits.
- Would you provide more detailed information in your PR comments? For example, how this field is specified in the hts-specs, what is the difference between the spec and the current implementation, and whether there's anything else needs to be modified, etc.
bedbed0
to
7fb069a
Compare
Thank you for your comment. @alumi
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the issue!
I agree that format
is convenient here, but it is somewhat too generic and I'm concerned about the overhead of calling it in a tight loop (especially, the overhead for parsing the format string and the memory allocation of arrays allocated for varargs invocation). I think cljam carefully avoids using such a "rich" function in tight loops.
How about using DecimalFormat
instead of format
? It’s more performant in time and is also less memory consuming than format
:
(require '[criterium.core :as cr])
(cr/quick-bench (format "%.0f" 6e9))
;; Evaluation count : 1777980 in 6 samples of 296330 calls.
;; Execution time mean : 336.902228 ns
;; Execution time std-deviation : 15.432298 ns
;; Execution time lower quantile : 321.841997 ns ( 2.5%)
;; Execution time upper quantile : 352.636071 ns (97.5%)
;; Overhead used : 1.831502 ns
(import '[java.text DecimalFormat])
(def format-decimal
(let [fmt (DecimalFormat. "0")]
(fn [^String s]
(.format fmt s))))
(cr/quick-bench (format-decimal 6e9))
;; Evaluation count : 2697042 in 6 samples of 449507 calls.
;; Execution time mean : 223.626072 ns
;; Execution time std-deviation : 3.167582 ns
;; Execution time lower quantile : 219.338954 ns ( 2.5%)
;; Execution time upper quantile : 226.908426 ns (97.5%)
;; Overhead used : 1.831502 ns
As for memory consumption, you can see how much memory will be used with format
by enabling -XX:+PrintGCDetails
or so:
;; format causes GC pretty much
(dotimes [_ 1e6] (format "%.0f" 6e9))
;; [104.742s][info ][gc,start ] GC(4) Pause Young (Normal) (G1 Evacuation Pause)
;; [104.742s][info ][gc,task ] GC(4) Using 9 workers of 9 for evacuation
;; [104.745s][info ][gc,phases ] GC(4) Pre Evacuate Collection Set: 0.1ms
;; [104.745s][info ][gc,phases ] GC(4) Merge Heap Roots: 0.1ms
;; [104.745s][info ][gc,phases ] GC(4) Evacuate Collection Set: 1.5ms
;; [104.745s][info ][gc,phases ] GC(4) Post Evacuate Collection Set: 0.3ms
;; [104.745s][info ][gc,phases ] GC(4) Other: 0.0ms
;; [104.745s][info ][gc,heap ] GC(4) Eden regions: 75->0(75)
;; [104.745s][info ][gc,heap ] GC(4) Survivor regions: 1->1(10)
;; [104.745s][info ][gc,heap ] GC(4) Old regions: 1->1
;; [104.745s][info ][gc,heap ] GC(4) Archive regions: 2->2
;; [104.745s][info ][gc,heap ] GC(4) Humongous regions: 0->0
;; [104.745s][info ][gc,metaspace] GC(4) Metaspace: 19398K(21056K)->19398K(21056K) NonClass: 15132K(16192K)->15132K(16192K) Class: 4265K(4864K)->4265K(4864K)
;; [104.745s][info ][gc ] GC(4) Pause Young (Normal) (G1 Evacuation Pause) 618M->18M(1040M) 2.101ms
;; [104.745s][info ][gc,cpu ] GC(4) User=0.01s Sys=0.00s Real=0.00s
;=> nil
;; whereas DecimalFormat barely does
(dotimes [_ 1e6] (format-decimal 6e9))
;=> nil
I reviewed the issue and got a different idea. I am still not sure if we should serialize a qual of (str 1e6)
;=> "1000000.0"
(str 1e7)
;=> "1.0E7" This condition ( (defn- stringify-data-line-qual
[x]
(when x
- (if (zero? (mod x 1))
+ (if (and (zero? (mod x 1)) (< x 1e7))
(str (int x))
(str x)))) |
@athos I changed implementation to use exponential notation for numbers above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the patch!
I’m concerned that we cannot guarantee the precision of 31 bits since the fraction part of floats is only 24 bits wide, according to IEEE 754. This means that conversions like "float to int to float" may produce a different value from the original input value:
(Float/intBitsToFloat 0x4effffff)
;=> 2.1474835E9
(stringify-data-line-qual (Float/intBitsToFloat 0x4effffff))
;=> "2147483520"
;; ^-- This '2' comes from the precision error
Although I suggested offline that it's fine to use any value as the threshold as long as it fits into an int, I think it's better not to be greater than
(defn- stringify-data-line-qual'
[x]
(when x
(if (and (zero? (mod x 1))
(< x (bit-shift-left 1 (dec Float/PRECISION))))
(str (int x))
(str x))))
(stringify-data-line-qual' (Float/intBitsToFloat 0x4effffff))
;=> "2.1474835E9"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the patch! LGTM 👍
Sorry, I only saw the updated code. I see some tests failing. |
Thank you for your comment! @athos
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the code! Now the tests got to pass again 🤝
I think the current implementation is totally fine. You could also define the whole (bit-shift-left 1 23)
as a constant if you like, though. Either way, I approve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nuggetoriniku, Thank you for updating the PR. LGTM 👍
Would you squash your commits into one before merging?
cad8521
to
381e3d0
Compare
@alumi |
Use
format
instead ofint
to solve the problem of stringifying huge QUAL values.Fixes #267