Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: overflowing qual value in vcf #268

Merged
merged 1 commit into from
Apr 11, 2023
Merged

fix: overflowing qual value in vcf #268

merged 1 commit into from
Apr 11, 2023

Conversation

nuggetoriniku
Copy link
Contributor

@nuggetoriniku nuggetoriniku commented Feb 22, 2023

Use format instead of int to solve the problem of stringifying huge QUAL values.

  • According to hts-specs, the value of the QUAL field in the VCF is only considered to be a float, and there is no mention of an upper limit.
  • The old implementation of cljam was limited to the Integer range.

Fixes #267

@codecov
Copy link

codecov bot commented Feb 22, 2023

Codecov Report

Merging #268 (cad8521) into master (2cfa13e) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head cad8521 differs from pull request most recent head 381e3d0. Consider uploading reports for the commit 381e3d0 to get more accurate results

@@           Coverage Diff           @@
##           master     #268   +/-   ##
=======================================
  Coverage   88.85%   88.85%           
=======================================
  Files          78       78           
  Lines        6442     6444    +2     
  Branches      454      454           
=======================================
+ Hits         5724     5726    +2     
  Misses        264      264           
  Partials      454      454           
Impacted Files Coverage Δ
src/cljam/io/vcf/writer.clj 85.31% <100.00%> (+0.20%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@nuggetoriniku nuggetoriniku self-assigned this Mar 1, 2023
@nuggetoriniku nuggetoriniku marked this pull request as ready for review March 1, 2023 06:00
@nuggetoriniku nuggetoriniku requested review from alumi and a team as code owners March 1, 2023 06:00
@nuggetoriniku nuggetoriniku requested review from athos and removed request for a team March 1, 2023 06:00
Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR, @nuggetoriniku !

Before reviewing the code, I'd like to ask for your help with the followings:

  • Please add a test code for the modification.
  • Please publish your public key which you used to sign your commits.
  • Would you provide more detailed information in your PR comments? For example, how this field is specified in the hts-specs, what is the difference between the spec and the current implementation, and whether there's anything else needs to be modified, etc.

@nuggetoriniku
Copy link
Contributor Author

Thank you for your comment. @alumi
The following was done.

  • Add test code
  • signed commit
  • Append details to description

@nuggetoriniku nuggetoriniku requested a review from alumi March 2, 2023 10:22
Copy link
Member

@athos athos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the issue!

I agree that format is convenient here, but it is somewhat too generic and I'm concerned about the overhead of calling it in a tight loop (especially, the overhead for parsing the format string and the memory allocation of arrays allocated for varargs invocation). I think cljam carefully avoids using such a "rich" function in tight loops.

How about using DecimalFormat instead of format? It’s more performant in time and is also less memory consuming than format:

(require '[criterium.core :as cr])

(cr/quick-bench (format "%.0f" 6e9))
;; Evaluation count : 1777980 in 6 samples of 296330 calls.
;;             Execution time mean : 336.902228 ns
;;    Execution time std-deviation : 15.432298 ns
;;   Execution time lower quantile : 321.841997 ns ( 2.5%)
;;   Execution time upper quantile : 352.636071 ns (97.5%)
;;                   Overhead used : 1.831502 ns

(import '[java.text DecimalFormat])

(def format-decimal
  (let [fmt (DecimalFormat. "0")]
    (fn [^String s]
      (.format fmt s))))

(cr/quick-bench (format-decimal 6e9))
;; Evaluation count : 2697042 in 6 samples of 449507 calls.
;;             Execution time mean : 223.626072 ns
;;    Execution time std-deviation : 3.167582 ns
;;   Execution time lower quantile : 219.338954 ns ( 2.5%)
;;   Execution time upper quantile : 226.908426 ns (97.5%)
;;                   Overhead used : 1.831502 ns

As for memory consumption, you can see how much memory will be used with format by enabling -XX:+PrintGCDetails or so:

;; format causes GC pretty much
(dotimes [_ 1e6] (format "%.0f" 6e9))

;; [104.742s][info   ][gc,start    ] GC(4) Pause Young (Normal) (G1 Evacuation Pause)
;; [104.742s][info   ][gc,task     ] GC(4) Using 9 workers of 9 for evacuation
;; [104.745s][info   ][gc,phases   ] GC(4)   Pre Evacuate Collection Set: 0.1ms
;; [104.745s][info   ][gc,phases   ] GC(4)   Merge Heap Roots: 0.1ms
;; [104.745s][info   ][gc,phases   ] GC(4)   Evacuate Collection Set: 1.5ms
;; [104.745s][info   ][gc,phases   ] GC(4)   Post Evacuate Collection Set: 0.3ms
;; [104.745s][info   ][gc,phases   ] GC(4)   Other: 0.0ms
;; [104.745s][info   ][gc,heap     ] GC(4) Eden regions: 75->0(75)
;; [104.745s][info   ][gc,heap     ] GC(4) Survivor regions: 1->1(10)
;; [104.745s][info   ][gc,heap     ] GC(4) Old regions: 1->1
;; [104.745s][info   ][gc,heap     ] GC(4) Archive regions: 2->2
;; [104.745s][info   ][gc,heap     ] GC(4) Humongous regions: 0->0
;; [104.745s][info   ][gc,metaspace] GC(4) Metaspace: 19398K(21056K)->19398K(21056K) NonClass: 15132K(16192K)->15132K(16192K) Class: 4265K(4864K)->4265K(4864K)
;; [104.745s][info   ][gc          ] GC(4) Pause Young (Normal) (G1 Evacuation Pause) 618M->18M(1040M) 2.101ms
;; [104.745s][info   ][gc,cpu      ] GC(4) User=0.01s Sys=0.00s Real=0.00s
;=> nil

;; whereas DecimalFormat barely does
(dotimes [_ 1e6] (format-decimal 6e9))
;=> nil

@athos
Copy link
Member

athos commented Mar 7, 2023

I reviewed the issue and got a different idea.

I am still not sure if we should serialize a qual of 1e10 as 10000000000 (in that it would take up unnecessary width).
In Java (and in Clojure), a floating point number d is serialized as a raw decimal if d $&lt; 10^7$ while it will be serialized in the scientific notation if d $\geq 10^7$ (see the document):

(str 1e6)
;=> "1000000.0"

(str 1e7)
;=> "1.0E7"

This condition (d $&lt; 10^7$) seems convenient and reasonable in our case too, and even better, $10^7$ fits in int.
So, I think something like the implementation below will suffice to resolve the issue. This implementation also doesn't have the performance issues mentioned in the above comment. What do you think?

 (defn- stringify-data-line-qual
   [x]
   (when x
-    (if (zero? (mod x 1))
+    (if (and (zero? (mod x 1)) (< x 1e7))
       (str (int x))
       (str x))))

@nuggetoriniku
Copy link
Contributor Author

@athos
Thank you for your review comments.
I too think it is a good idea.

I changed implementation to use exponential notation for numbers above $2^{31}-1$ (the Integer maximum).

@nuggetoriniku nuggetoriniku requested a review from athos March 9, 2023 02:17
Copy link
Member

@athos athos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the patch!

I’m concerned that we cannot guarantee the precision of 31 bits since the fraction part of floats is only 24 bits wide, according to IEEE 754. This means that conversions like "float to int to float" may produce a different value from the original input value:

(Float/intBitsToFloat 0x4effffff)
;=> 2.1474835E9

(stringify-data-line-qual (Float/intBitsToFloat 0x4effffff))
;=> "2147483520"
;;           ^-- This '2' comes from the precision error

Although I suggested offline that it's fine to use any value as the threshold as long as it fits into an int, I think it's better not to be greater than $2^{24}$:

(defn- stringify-data-line-qual'
  [x]
  (when x
    (if (and (zero? (mod x 1))
             (< x (bit-shift-left 1 (dec Float/PRECISION))))
      (str (int x))
      (str x))))

(stringify-data-line-qual' (Float/intBitsToFloat 0x4effffff))
;=> "2.1474835E9"

Copy link
Member

@athos athos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the patch! LGTM 👍

@athos
Copy link
Member

athos commented Mar 16, 2023

Sorry, I only saw the updated code. I see some tests failing.
It seems like Float/PRECISION is a fairly new field introduced in Java 19. So, I guess we have to hardcode the value 🤔

@nuggetoriniku
Copy link
Contributor Author

Thank you for your comment! @athos

  • Defined a constant float-precision to replace Float/PRECISION.
  • Added a test case for boundary values.

Copy link
Member

@athos athos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the code! Now the tests got to pass again 🤝
I think the current implementation is totally fine. You could also define the whole (bit-shift-left 1 23) as a constant if you like, though. Either way, I approve.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nuggetoriniku, Thank you for updating the PR. LGTM 👍
Would you squash your commits into one before merging?

@nuggetoriniku
Copy link
Contributor Author

@alumi
Thank you for your comment!
I have squashed the commits:pray:

@alumi alumi merged commit a2d5fe5 into master Apr 11, 2023
@alumi alumi deleted the fix/overflow-qual branch April 11, 2023 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

QUAL value overflows when writing VCF
3 participants