fix: overflowing qual value in vcf #268

nuggetoriniku · 2023-02-22T02:01:44Z

Use format instead of int to solve the problem of stringifying huge QUAL values.

According to hts-specs, the value of the QUAL field in the VCF is only considered to be a float, and there is no mention of an upper limit.
The old implementation of cljam was limited to the Integer range.

Fixes #267

codecov · 2023-02-22T02:03:22Z

Codecov Report

Merging #268 (cad8521) into master (2cfa13e) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head cad8521 differs from pull request most recent head 381e3d0. Consider uploading reports for the commit 381e3d0 to get more accurate results

@@           Coverage Diff           @@
##           master     #268   +/-   ##
=======================================
  Coverage   88.85%   88.85%           
=======================================
  Files          78       78           
  Lines        6442     6444    +2     
  Branches      454      454           
=======================================
+ Hits         5724     5726    +2     
  Misses        264      264           
  Partials      454      454

Impacted Files	Coverage Δ
src/cljam/io/vcf/writer.clj	`85.31% <100.00%> (+0.20%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

alumi

Thank you for your PR, @nuggetoriniku !

Before reviewing the code, I'd like to ask for your help with the followings:

Please add a test code for the modification.
Please publish your public key which you used to sign your commits.
Would you provide more detailed information in your PR comments? For example, how this field is specified in the hts-specs, what is the difference between the spec and the current implementation, and whether there's anything else needs to be modified, etc.

nuggetoriniku · 2023-03-02T10:22:20Z

Thank you for your comment. @alumi
The following was done.

Add test code
signed commit
Append details to description

athos

Thanks for fixing the issue!

I agree that format is convenient here, but it is somewhat too generic and I'm concerned about the overhead of calling it in a tight loop (especially, the overhead for parsing the format string and the memory allocation of arrays allocated for varargs invocation). I think cljam carefully avoids using such a "rich" function in tight loops.

How about using DecimalFormat instead of format? It’s more performant in time and is also less memory consuming than format:

(require '[criterium.core :as cr])

(cr/quick-bench (format "%.0f" 6e9))
;; Evaluation count : 1777980 in 6 samples of 296330 calls.
;;             Execution time mean : 336.902228 ns
;;    Execution time std-deviation : 15.432298 ns
;;   Execution time lower quantile : 321.841997 ns ( 2.5%)
;;   Execution time upper quantile : 352.636071 ns (97.5%)
;;                   Overhead used : 1.831502 ns

(import '[java.text DecimalFormat])

(def format-decimal
  (let [fmt (DecimalFormat. "0")]
    (fn [^String s]
      (.format fmt s))))

(cr/quick-bench (format-decimal 6e9))
;; Evaluation count : 2697042 in 6 samples of 449507 calls.
;;             Execution time mean : 223.626072 ns
;;    Execution time std-deviation : 3.167582 ns
;;   Execution time lower quantile : 219.338954 ns ( 2.5%)
;;   Execution time upper quantile : 226.908426 ns (97.5%)
;;                   Overhead used : 1.831502 ns

As for memory consumption, you can see how much memory will be used with format by enabling -XX:+PrintGCDetails or so:

;; format causes GC pretty much
(dotimes [_ 1e6] (format "%.0f" 6e9))

;; [104.742s][info   ][gc,start    ] GC(4) Pause Young (Normal) (G1 Evacuation Pause)
;; [104.742s][info   ][gc,task     ] GC(4) Using 9 workers of 9 for evacuation
;; [104.745s][info   ][gc,phases   ] GC(4)   Pre Evacuate Collection Set: 0.1ms
;; [104.745s][info   ][gc,phases   ] GC(4)   Merge Heap Roots: 0.1ms
;; [104.745s][info   ][gc,phases   ] GC(4)   Evacuate Collection Set: 1.5ms
;; [104.745s][info   ][gc,phases   ] GC(4)   Post Evacuate Collection Set: 0.3ms
;; [104.745s][info   ][gc,phases   ] GC(4)   Other: 0.0ms
;; [104.745s][info   ][gc,heap     ] GC(4) Eden regions: 75->0(75)
;; [104.745s][info   ][gc,heap     ] GC(4) Survivor regions: 1->1(10)
;; [104.745s][info   ][gc,heap     ] GC(4) Old regions: 1->1
;; [104.745s][info   ][gc,heap     ] GC(4) Archive regions: 2->2
;; [104.745s][info   ][gc,heap     ] GC(4) Humongous regions: 0->0
;; [104.745s][info   ][gc,metaspace] GC(4) Metaspace: 19398K(21056K)->19398K(21056K) NonClass: 15132K(16192K)->15132K(16192K) Class: 4265K(4864K)->4265K(4864K)
;; [104.745s][info   ][gc          ] GC(4) Pause Young (Normal) (G1 Evacuation Pause) 618M->18M(1040M) 2.101ms
;; [104.745s][info   ][gc,cpu      ] GC(4) User=0.01s Sys=0.00s Real=0.00s
;=> nil

;; whereas DecimalFormat barely does
(dotimes [_ 1e6] (format-decimal 6e9))
;=> nil

athos · 2023-03-07T05:56:18Z

I reviewed the issue and got a different idea.

I am still not sure if we should serialize a qual of 1e10 as 10000000000 (in that it would take up unnecessary width).
In Java (and in Clojure), a floating point number d is serialized as a raw decimal if d $< 10^7$ while it will be serialized in the scientific notation if d $\geq 10^7$ (see the document):

(str 1e6)
;=> "1000000.0"

(str 1e7)
;=> "1.0E7"

This condition (d $< 10^7$) seems convenient and reasonable in our case too, and even better, $10^7$ fits in int.
So, I think something like the implementation below will suffice to resolve the issue. This implementation also doesn't have the performance issues mentioned in the above comment. What do you think?

 (defn- stringify-data-line-qual
   [x]
   (when x
-    (if (zero? (mod x 1))
+    (if (and (zero? (mod x 1)) (< x 1e7))
       (str (int x))
       (str x))))

nuggetoriniku · 2023-03-09T00:26:03Z

@athos
Thank you for your review comments.
I too think it is a good idea.

I changed implementation to use exponential notation for numbers above $2^{31}-1$ (the Integer maximum).

athos

Thanks for updating the patch!

I’m concerned that we cannot guarantee the precision of 31 bits since the fraction part of floats is only 24 bits wide, according to IEEE 754. This means that conversions like "float to int to float" may produce a different value from the original input value:

(Float/intBitsToFloat 0x4effffff)
;=> 2.1474835E9

(stringify-data-line-qual (Float/intBitsToFloat 0x4effffff))
;=> "2147483520"
;;           ^-- This '2' comes from the precision error

Although I suggested offline that it's fine to use any value as the threshold as long as it fits into an int, I think it's better not to be greater than $2^{24}$:

(defn- stringify-data-line-qual'
  [x]
  (when x
    (if (and (zero? (mod x 1))
             (< x (bit-shift-left 1 (dec Float/PRECISION))))
      (str (int x))
      (str x))))

(stringify-data-line-qual' (Float/intBitsToFloat 0x4effffff))
;=> "2.1474835E9"

athos

Thanks for updating the patch! LGTM 👍

athos · 2023-03-16T06:31:39Z

Sorry, I only saw the updated code. I see some tests failing.
It seems like Float/PRECISION is a fairly new field introduced in Java 19. So, I guess we have to hardcode the value 🤔

nuggetoriniku · 2023-03-17T06:24:09Z

Thank you for your comment! @athos

Defined a constant float-precision to replace Float/PRECISION.
Added a test case for boundary values.

athos

Thanks for fixing the code! Now the tests got to pass again 🤝
I think the current implementation is totally fine. You could also define the whole (bit-shift-left 1 23) as a constant if you like, though. Either way, I approve.

alumi

@nuggetoriniku, Thank you for updating the PR. LGTM 👍
Would you squash your commits into one before merging?

nuggetoriniku · 2023-04-04T02:14:04Z

@alumi
Thank you for your comment!
I have squashed the commits:pray:

nuggetoriniku self-assigned this Mar 1, 2023

nuggetoriniku marked this pull request as ready for review March 1, 2023 06:00

nuggetoriniku requested review from alumi and a team as code owners March 1, 2023 06:00

nuggetoriniku requested review from athos and removed request for a team March 1, 2023 06:00

alumi reviewed Mar 1, 2023

View reviewed changes

nuggetoriniku force-pushed the fix/overflow-qual branch from bedbed0 to 7fb069a Compare March 2, 2023 09:45

nuggetoriniku requested a review from alumi March 2, 2023 10:22

athos reviewed Mar 6, 2023

View reviewed changes

nuggetoriniku requested a review from athos March 9, 2023 02:17

athos reviewed Mar 10, 2023

View reviewed changes

athos approved these changes Mar 16, 2023

View reviewed changes

nuggetoriniku requested a review from athos March 17, 2023 06:24

athos approved these changes Mar 17, 2023

View reviewed changes

alumi approved these changes Mar 27, 2023

View reviewed changes

Expand argument range of stringify-data-line-qual beyond Integers

381e3d0

nuggetoriniku force-pushed the fix/overflow-qual branch from cad8521 to 381e3d0 Compare March 29, 2023 01:59

alumi merged commit a2d5fe5 into master Apr 11, 2023

alumi deleted the fix/overflow-qual branch April 11, 2023 08:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: overflowing qual value in vcf #268

fix: overflowing qual value in vcf #268

nuggetoriniku commented Feb 22, 2023 •

edited

Loading

codecov bot commented Feb 22, 2023 •

edited

Loading

alumi left a comment

nuggetoriniku commented Mar 2, 2023

athos left a comment

athos commented Mar 7, 2023 •

edited

Loading

nuggetoriniku commented Mar 9, 2023

athos left a comment

athos left a comment

athos commented Mar 16, 2023

nuggetoriniku commented Mar 17, 2023

athos left a comment

alumi left a comment

nuggetoriniku commented Apr 4, 2023

fix: overflowing qual value in vcf #268

fix: overflowing qual value in vcf #268

Conversation

nuggetoriniku commented Feb 22, 2023 • edited Loading

codecov bot commented Feb 22, 2023 • edited Loading

Codecov Report

alumi left a comment

Choose a reason for hiding this comment

nuggetoriniku commented Mar 2, 2023

athos left a comment

Choose a reason for hiding this comment

athos commented Mar 7, 2023 • edited Loading

nuggetoriniku commented Mar 9, 2023

athos left a comment

Choose a reason for hiding this comment

athos left a comment

Choose a reason for hiding this comment

athos commented Mar 16, 2023

nuggetoriniku commented Mar 17, 2023

athos left a comment

Choose a reason for hiding this comment

alumi left a comment

Choose a reason for hiding this comment

nuggetoriniku commented Apr 4, 2023

nuggetoriniku commented Feb 22, 2023 •

edited

Loading

codecov bot commented Feb 22, 2023 •

edited

Loading

athos commented Mar 7, 2023 •

edited

Loading