Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for generating CSI file from bgzip compressed VCF. #190

Merged
merged 1 commit into from
Apr 17, 2020

Conversation

niyarin
Copy link
Contributor

@niyarin niyarin commented Feb 26, 2020

Add support for generating CSI file from VCF.

@niyarin niyarin requested a review from alumi as a code owner February 26, 2020 11:38
@niyarin niyarin requested a review from a team February 26, 2020 11:38
@alumi alumi requested review from a team and athos and removed request for a team February 27, 2020 09:20
src/cljam/algo/vcf_indexer.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
test/cljam/io/util/bin_test.clj Outdated Show resolved Hide resolved
src/cljam/io/util/bin.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf.clj Outdated Show resolved Hide resolved
src/cljam/io/util/bin.clj Outdated Show resolved Hide resolved
Copy link
Member

@athos athos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on another feature! I just added some comments on relatively superficial matters.

src/cljam/io/util/bin.clj Outdated Show resolved Hide resolved
src/cljam/io/util/bin.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf.clj Outdated Show resolved Hide resolved
src/cljam/algo/vcf_indexer.clj Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Mar 2, 2020

Codecov Report

Merging #190 into master will increase coverage by 0.14%.
The diff coverage is 92.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #190      +/-   ##
==========================================
+ Coverage   86.67%   86.81%   +0.14%     
==========================================
  Files          76       77       +1     
  Lines        6057     6205     +148     
  Branches      502      516      +14     
==========================================
+ Hits         5250     5387     +137     
+ Misses        305      302       -3     
- Partials      502      516      +14     
Impacted Files Coverage Δ
src/cljam/io/protocols.clj 100.00% <ø> (ø)
src/cljam/io/csi.clj 91.46% <89.06%> (-8.54%) ⬇️
src/cljam/io/util/bin.clj 94.33% <93.33%> (-3.17%) ⬇️
src/cljam/algo/vcf_indexer.clj 100.00% <100.00%> (ø)
src/cljam/io/bam_index/writer.clj 84.37% <100.00%> (+0.08%) ⬆️
src/cljam/io/sam/util.clj 92.30% <100.00%> (+11.22%) ⬆️
src/cljam/io/vcf.clj 96.00% <100.00%> (+0.16%) ⬆️
src/cljam/io/vcf/reader.clj 88.75% <100.00%> (+2.08%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d961d7...2444372. Read the comment docs.

@niyarin niyarin force-pushed the feature/csi-writer-for-vcf branch from 9ae9ce7 to 803b2d5 Compare March 2, 2020 06:59
src/cljam/io/vcf.clj Outdated Show resolved Hide resolved
src/cljam/io/util/bin.clj Outdated Show resolved Hide resolved
test/cljam/io/util/bin_test.clj Outdated Show resolved Hide resolved
test/cljam/algo/vcf_indexer_test.clj Outdated Show resolved Hide resolved
@niyarin
Copy link
Contributor Author

niyarin commented Mar 2, 2020

Thank you for reviewing.
I fixed bad expressions and added some test cases about vcf not ascending order chroms.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! I added a few more comments.

src/cljam/io/csi.clj Outdated Show resolved Hide resolved
src/cljam/algo/vcf_indexer.clj Outdated Show resolved Hide resolved
src/cljam/io/csi.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
src/cljam/io/csi.clj Outdated Show resolved Hide resolved
test/cljam/algo/vcf_indexer_test.clj Show resolved Hide resolved
@niyarin
Copy link
Contributor Author

niyarin commented Mar 16, 2020

Sorry for late, I fixed.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like lein test :all is failing 🔥

@niyarin
Copy link
Contributor Author

niyarin commented Mar 17, 2020

I'm so sorry,I will fix test cases same as others.

src/cljam/io/csi.clj Outdated Show resolved Hide resolved
test/cljam/algo/vcf_indexer_test.clj Outdated Show resolved Hide resolved
test/cljam/io/csi_test.clj Outdated Show resolved Hide resolved
Copy link
Member

@athos athos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more trivial comments.

[bin (->> offsets
(map #(chunk/->Chunk (:file-beg %) (:file-end %)))
reverse)]))
sort
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sort

[offsets shift depth]
(let [chr-offsets (->> (partition-by :chr offsets)
(map #(vector (:chr (first %)) %))
sort
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sort

from offsets {:file-beg :file-end :beg :end :chr }"
[offsets shift depth]
(let [chr-offsets (->> (partition-by :chr offsets)
(map #(vector (:chr (first %)) %))
Copy link
Member

@athos athos Mar 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines seem to be equivalent to (group-by :chr offsets) in this case.


(defn create-index
"Creates a CSI index file from the VCF file."
[in-bgziped-vcf out-csi {:keys [shift depth] :or {shift 14 depth 5}}]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/bgziped/bgzipped/

@niyarin niyarin force-pushed the feature/csi-writer-for-vcf branch 3 times, most recently from d55f8f2 to 79451d1 Compare March 23, 2020 07:18
@niyarin
Copy link
Contributor Author

niyarin commented Mar 23, 2020

Thank you for pointing.

I made the following changes.

  • Loffset in CSI decided not to match the implementation of hts_lib.

    • Specifically, the loffset of the key that does not exist during the beg-end of vcf came closer to the target data.
  • Added an operation to move smaller chunks to higher binning in bidx calculation,same as hts_lib.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niyarin
Copy link
Contributor Author

niyarin commented Mar 24, 2020

Sorry, I applied linter.

@niyarin
Copy link
Contributor Author

niyarin commented Mar 25, 2020

Bcftools (hts_lib) will generate an error if some tabix data (such as format,col_seq , names and so on) is not inserted into CSI aux.
So I fixed CSI for portability.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The indexing looks good. Added some more trivial comments.

src/cljam/io/csi.clj Outdated Show resolved Hide resolved
src/cljam/io/csi.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/reader.clj Outdated Show resolved Hide resolved
test/cljam/algo/vcf_indexer_test.clj Outdated Show resolved Hide resolved
src/cljam/io/csi.clj Outdated Show resolved Hide resolved
src/cljam/io/csi.clj Show resolved Hide resolved
src/cljam/io/csi.clj Outdated Show resolved Hide resolved
src/cljam/io/util/bin.clj Show resolved Hide resolved
@niyarin niyarin force-pushed the feature/csi-writer-for-vcf branch 2 times, most recently from 5d3a3aa to 985fed3 Compare April 6, 2020 02:39
@niyarin
Copy link
Contributor Author

niyarin commented Apr 6, 2020

Thank you for pointing.I fixed them.
I think that arguments of reg->bin is [beg,end) (one-based, half-close-half-open).

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes. I've added some suggestions around reg->bin. Would you check them, please?

@@ -68,4 +48,4 @@
[aln]
(let [beg (dec (:pos aln))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(let [beg (dec (:pos aln))
(let [beg (:pos aln)

Comment on lines 76 to 77
(let [bin (util-bin/reg->bin (.pos aln) (inc (.end aln))
linear-index-shift linear-index-depth)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(let [bin (util-bin/reg->bin (.pos aln) (inc (.end aln))
linear-index-shift linear-index-depth)
(let [bin (util-bin/reg->bin
(.pos aln) (.end aln) linear-index-shift linear-index-depth)

Comment on lines 74 to 75
beg (Math/min max-pos (Math/max 0 (dec beg)))
end (Math/min max-pos (Math/max 0 (dec end)))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
beg (Math/min max-pos (Math/max 0 (dec beg)))
end (Math/min max-pos (Math/max 0 (dec end)))]
beg (dec (Math/max 1 (Math/min max-pos beg)))
end (dec (Math/max 1 (Math/min max-pos end)))]

Comment on lines 61 to 66
(deftest reg->bin-test
(is (= (util-bin/reg->bin 1 1 14 6) 37449))
(is (= (util-bin/reg->bin 1 1 14 7) 299593))
(is (= (util-bin/reg->bin 1 32769 15 6) 4681))
(is (= (util-bin/reg->bin 1 2 0 2) 1))
(is (= (util-bin/reg->bin 240877561 240877568 14 6) 52150)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(deftest reg->bin-test
(is (= (util-bin/reg->bin 1 1 14 6) 37449))
(is (= (util-bin/reg->bin 1 1 14 7) 299593))
(is (= (util-bin/reg->bin 1 32769 15 6) 4681))
(is (= (util-bin/reg->bin 1 2 0 2) 1))
(is (= (util-bin/reg->bin 240877561 240877568 14 6) 52150)))
(deftest reg->bin-test
(testing "BAI-compatible"
(are [?start ?end ?bin]
(= ?bin (util-bin/reg->bin ?start ?end 14 5))
0 0 4681
0 1 4681
1 1 4681
1 16384 4681
16384 16384 4681
16384 16385 585
16385 16385 4682
131072 131072 4688
131072 131073 73
8388608 8388609 1
67108864 67108865 0
536870912 536870912 37448
536870912 536870913 37448))
(testing "1-by-1"
(are [?start ?end ?bin]
(= ?bin (util-bin/reg->bin ?start ?end 0 2))
0 1 9
1 1 9
2 2 10
1 2 1
1 8 1
1 9 0
8 9 0
9 9 17
63 64 8
64 64 72
64 64 72))
(testing "various min-shfits and depths"
(is (= (util-bin/reg->bin 1 1 14 6) 37449))
(is (= (util-bin/reg->bin 1 1 14 7) 299593))
(is (= (util-bin/reg->bin 1 32769 15 6) 4681))
(is (= (util-bin/reg->bin 240877561 240877568 14 6) 52150))))

@niyarin
Copy link
Contributor Author

niyarin commented Apr 10, 2020

Thank you,I fixed.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

I found a bug in the random reading of VCF files. Could you fix it in another PR?

;; with test-resources/vcf/test-chr-skipped.vcf.gz.csi
(with-open [r (vcf/reader "test-resources/vcf/test-chr-skipped.vcf.gz")]
        (doall (vcf/read-variants-randomly r {:chr "chr3"} {})))
Execution error (NullPointerException) at cljam.io.csi.CSI/get_chunks (csi.clj:19).

@alumi
Copy link
Member

alumi commented Apr 14, 2020

@niyarin Please don't forget to squash the commits before merging this branch 😀

Copy link
Member

@athos athos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niyarin Thanks for your persistent effort to improve the code for this feature! 🙏Now the implementation works fine 👍

Just one more thing, can you add a mention to CSI, as well as Tabix, in the docstring of cljam.io.vcf/read-variants-randomly as another supported index format? The current description looks to me like it only supports the Tabix format. Also, you may want to avoid using the name tabix-data in the body of that function.

Once you make a commit for the above point and squash all the commits into one (or a few), I will merge the patch soon.

@niyarin
Copy link
Contributor Author

niyarin commented Apr 16, 2020

Thank you,I added a comment that VCF can also use CSI and renamed 'tabix-data' to 'index-data'.

@athos athos merged commit fa2fa54 into master Apr 17, 2020
@athos athos deleted the feature/csi-writer-for-vcf branch April 17, 2020 02:01
@alumi
Copy link
Member

alumi commented Apr 17, 2020

🎉 🎉 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants