-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for generating CSI file from bgzip compressed VCF. #190
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on another feature! I just added some comments on relatively superficial matters.
Codecov Report
@@ Coverage Diff @@
## master #190 +/- ##
==========================================
+ Coverage 86.67% 86.81% +0.14%
==========================================
Files 76 77 +1
Lines 6057 6205 +148
Branches 502 516 +14
==========================================
+ Hits 5250 5387 +137
+ Misses 305 302 -3
- Partials 502 516 +14
Continue to review full report at Codecov.
|
9ae9ce7
to
803b2d5
Compare
Thank you for reviewing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! I added a few more comments.
Sorry for late, I fixed. |
cf1ce51
to
cddf8c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like lein test :all
is failing 🔥
I'm so sorry,I will fix test cases same as others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some more trivial comments.
src/cljam/io/csi.clj
Outdated
[bin (->> offsets | ||
(map #(chunk/->Chunk (:file-beg %) (:file-end %))) | ||
reverse)])) | ||
sort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort |
src/cljam/io/csi.clj
Outdated
[offsets shift depth] | ||
(let [chr-offsets (->> (partition-by :chr offsets) | ||
(map #(vector (:chr (first %)) %)) | ||
sort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort |
src/cljam/io/csi.clj
Outdated
from offsets {:file-beg :file-end :beg :end :chr }" | ||
[offsets shift depth] | ||
(let [chr-offsets (->> (partition-by :chr offsets) | ||
(map #(vector (:chr (first %)) %)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines seem to be equivalent to (group-by :chr offsets)
in this case.
src/cljam/algo/vcf_indexer.clj
Outdated
|
||
(defn create-index | ||
"Creates a CSI index file from the VCF file." | ||
[in-bgziped-vcf out-csi {:keys [shift depth] :or {shift 14 depth 5}}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/bgziped/bgzipped/
d55f8f2
to
79451d1
Compare
Thank you for pointing. I made the following changes.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Run formatter and linters every time before you push your commits
- It seems like generated CSI files are not compatible with bcftools
- Please make sure that both
cljam.io.vcf.read-variants-randomly
andbcftools view
works as well
- Please make sure that both
- Add a test case for VCF files with skipped contigs as in Add support for generating CSI file from bgzip compressed VCF. #190 (comment)
79451d1
to
3bf6899
Compare
Sorry, I applied linter. |
e655e42
to
4fd65bd
Compare
Bcftools (hts_lib) will generate an error if some tabix data (such as format,col_seq , names and so on) is not inserted into CSI aux. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The indexing looks good. Added some more trivial comments.
5d3a3aa
to
985fed3
Compare
Thank you for pointing.I fixed them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes. I've added some suggestions around reg->bin
. Would you check them, please?
src/cljam/io/sam/util.clj
Outdated
@@ -68,4 +48,4 @@ | |||
[aln] | |||
(let [beg (dec (:pos aln)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(let [beg (dec (:pos aln)) | |
(let [beg (:pos aln) |
src/cljam/io/bam_index/writer.clj
Outdated
(let [bin (util-bin/reg->bin (.pos aln) (inc (.end aln)) | ||
linear-index-shift linear-index-depth) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(let [bin (util-bin/reg->bin (.pos aln) (inc (.end aln)) | |
linear-index-shift linear-index-depth) | |
(let [bin (util-bin/reg->bin | |
(.pos aln) (.end aln) linear-index-shift linear-index-depth) |
src/cljam/io/util/bin.clj
Outdated
beg (Math/min max-pos (Math/max 0 (dec beg))) | ||
end (Math/min max-pos (Math/max 0 (dec end)))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beg (Math/min max-pos (Math/max 0 (dec beg))) | |
end (Math/min max-pos (Math/max 0 (dec end)))] | |
beg (dec (Math/max 1 (Math/min max-pos beg))) | |
end (dec (Math/max 1 (Math/min max-pos end)))] |
test/cljam/io/util/bin_test.clj
Outdated
(deftest reg->bin-test | ||
(is (= (util-bin/reg->bin 1 1 14 6) 37449)) | ||
(is (= (util-bin/reg->bin 1 1 14 7) 299593)) | ||
(is (= (util-bin/reg->bin 1 32769 15 6) 4681)) | ||
(is (= (util-bin/reg->bin 1 2 0 2) 1)) | ||
(is (= (util-bin/reg->bin 240877561 240877568 14 6) 52150))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(deftest reg->bin-test | |
(is (= (util-bin/reg->bin 1 1 14 6) 37449)) | |
(is (= (util-bin/reg->bin 1 1 14 7) 299593)) | |
(is (= (util-bin/reg->bin 1 32769 15 6) 4681)) | |
(is (= (util-bin/reg->bin 1 2 0 2) 1)) | |
(is (= (util-bin/reg->bin 240877561 240877568 14 6) 52150))) | |
(deftest reg->bin-test | |
(testing "BAI-compatible" | |
(are [?start ?end ?bin] | |
(= ?bin (util-bin/reg->bin ?start ?end 14 5)) | |
0 0 4681 | |
0 1 4681 | |
1 1 4681 | |
1 16384 4681 | |
16384 16384 4681 | |
16384 16385 585 | |
16385 16385 4682 | |
131072 131072 4688 | |
131072 131073 73 | |
8388608 8388609 1 | |
67108864 67108865 0 | |
536870912 536870912 37448 | |
536870912 536870913 37448)) | |
(testing "1-by-1" | |
(are [?start ?end ?bin] | |
(= ?bin (util-bin/reg->bin ?start ?end 0 2)) | |
0 1 9 | |
1 1 9 | |
2 2 10 | |
1 2 1 | |
1 8 1 | |
1 9 0 | |
8 9 0 | |
9 9 17 | |
63 64 8 | |
64 64 72 | |
64 64 72)) | |
(testing "various min-shfits and depths" | |
(is (= (util-bin/reg->bin 1 1 14 6) 37449)) | |
(is (= (util-bin/reg->bin 1 1 14 7) 299593)) | |
(is (= (util-bin/reg->bin 1 32769 15 6) 4681)) | |
(is (= (util-bin/reg->bin 240877561 240877568 14 6) 52150)))) |
Thank you,I fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
I found a bug in the random reading of VCF files. Could you fix it in another PR?
;; with test-resources/vcf/test-chr-skipped.vcf.gz.csi
(with-open [r (vcf/reader "test-resources/vcf/test-chr-skipped.vcf.gz")]
(doall (vcf/read-variants-randomly r {:chr "chr3"} {})))
Execution error (NullPointerException) at cljam.io.csi.CSI/get_chunks (csi.clj:19).
@niyarin Please don't forget to squash the commits before merging this branch 😀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@niyarin Thanks for your persistent effort to improve the code for this feature! 🙏Now the implementation works fine 👍
Just one more thing, can you add a mention to CSI, as well as Tabix, in the docstring of cljam.io.vcf/read-variants-randomly
as another supported index format? The current description looks to me like it only supports the Tabix format. Also, you may want to avoid using the name tabix-data
in the body of that function.
Once you make a commit for the above point and squash all the commits into one (or a few), I will merge the patch soon.
2c71be5
to
2444372
Compare
Thank you,I added a comment that VCF can also use CSI and renamed 'tabix-data' to 'index-data'. |
🎉 🎉 🎉 |
Add support for generating CSI file from VCF.