-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add some BED manipulation APIs #131
Conversation
Codecov Report
@@ Coverage Diff @@
## master #131 +/- ##
==========================================
+ Coverage 85.24% 85.43% +0.19%
==========================================
Files 69 69
Lines 4359 4416 +57
Branches 420 420
==========================================
+ Hits 3716 3773 +57
Misses 223 223
Partials 420 420
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
Looks like working fine with common cases but not for some edge cases because BED file format permits overlapping regions.
Please add some test cases for that.
src/cljam/io/bed.clj
Outdated
The input sequence will first be sorted with sort-fields, which may cause | ||
an extensive memory use for ones with a large number of elements." | ||
[chrom-lengths xs] | ||
(let [chrs (sort-by chr/chromosome-order-key (keys chrom-lengths))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lengths of chromosomes can be obtained by cljam.io.sam/read-refs or cljam.io.sequence/read-indices, but to make up a map we need to write something like (into {} (map (juxt :name :len)) ~~~)
which I think is a bit redundant.
How about changing (keys chrom-lengths)
to (map first chrom-lengths)
or accepting a sequence of maps containing :name
and :len
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, considering cooperation with those other functions, it would be more useful to accept such a data structure, right?
OK, I'm going to change the function's signature to accept a sequence of maps like {:name ..., :len ...}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also feel ({:name :len})
is more consistent.
src/cljam/io/bed.clj
Outdated
(subtract (cons r1 (next xs)) (next ys))) | ||
(if (fields-< x y) | ||
(subtract (next xs) ys) | ||
(subtract xs (next ys)))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the following case.
(cljam.io.bed/subtract-fields
[{:chr "chr1" :start 1 :end 10} {:chr "chr1" :start 1 :end 11}]
[{:chr "chr1" :start 1 :end 10} {:chr "chr1" :start 9 :end 11}])
=> ({:chr "chr1", :start 1, :end 8} {:chr "chr1", :start 1, :end 8})
src/cljam/io/bed.clj
Outdated
(update :end min (:end y))))) | ||
[]) | ||
[])))] | ||
(intersect (sort-fields xs) (sort-fields ys)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the following case.
(cljam.io.bed/intersect-fields
[{:chr "chr1" :start 1 :end 10} {:chr "chr1" :start 1 :end 11}]
[{:chr "chr1" :start 1 :end 10} {:chr "chr1" :start 9 :end 11}])
=> ({:chr "chr1", :start 1, :end 10} {:chr "chr1", :start 9, :end 10} {:chr "chr1", :start 9, :end 11})
Though there're already some fns throwing exceptions from lazy sequences in cljam, I think pre-checking is better in this particular case because all arguments are realized. |
Thank you for pointing it out! I didn't know that. I'll fix it soon.
I see. I'll give it a try in that way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I added a few trivial comments.
I found that result is strange when illegal regions are supplied, such as :start > :end
.
(complement-fields {"chr1" 1000, "chr2" 800}
[{:chr "chr1" :start 2 :end 1}])
;;=> ({:chr "chr1", :start 1, :end 1}
;; {:chr "chr1", :start 2, :end 1000}
;; {:chr "chr2", :start 1, :end 800})
merge-regions
has similar strange behaviors. I wonder these fns should check regions or not. There is no problem in the case of using them via BED reader because BED reader checks regions while reading.
test/cljam/io/bed_test.clj
Outdated
[{:chr "chr1" :start 301 :end 899} {:chr "chr2" :start 301 :end 800}]) | ||
|
||
(is (thrown? IllegalArgumentException | ||
(doall (bed/complement-fields {"chr1" 1000} [{:chr "chr2" :start 1 :end 100}]))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return value is not used, so dorun
is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, exactly. The prechecking discussed above will remove this explicit realization, though.
src/cljam/util/region.clj
Outdated
;;; ---------- | ||
|
||
(defn overlapped-regions? | ||
"Returns true iff two regions are overlapped with each other." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iff
-> if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the word in the sense of "if and only if", but it might be somewhat confusing. I'll fix it later.
src/cljam/io/bed.clj
Outdated
The input sequence will first be sorted with sort-fields, which may cause | ||
an extensive memory use for ones with a large number of elements." | ||
[chrom-lengths xs] | ||
(let [chrs (sort-by chr/chromosome-order-key (keys chrom-lengths))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also feel ({:name :len})
is more consistent.
Thank you for your review.
Exactly, these fns have a precondition that the input sequences contain only valid regions, so once the condition is violated they can probably produce garbage. I'm not sure it's reasonable to check region's validity in all these fns. If doing so, and then if you want to use some of these fns combined (say, you want the complement of the intersection of BED sequences I would rather propose adding a new API to help users to check the validity of a BED sequence on their own, something like the sequence version of |
If not checking, I think you don't have to add a new validator function. It can be easily written by |
I just fixed the issues other than the one about overlapping regions and the one about the region validity. As for the region validity, I'll update the docstrings for these fns to describe their limitation of failing to correctly handle sequences containing invalid regions, as @totakke suggested immediately above. As for overlapping regions, I'm still wondering how I should resolve the issue. Is it a good approach to first apply |
It's just my opinion but, I think merging
$ bedtools --version
bedtools v2.27.1
$ bedtools subtract -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(echo -e 'chr1\t0\t10\tC\nchr1\t8\t11\tD')
$ bedtools subtract -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(echo -e 'chr1\t1\t10\tC\nchr1\t8\t11\tD')
chr1 0 1 A
chr1 0 1 B
$ bedtools intersect -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(echo -e 'chr1\t0\t10\tC\nchr1\t8\t11\tD')
chr1 0 10 A
chr1 8 10 A
chr1 0 10 B
chr1 8 11 B
$ bedtools intersect -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(echo -e 'chr1\t0\t10\tC') -b <(echo -e 'chr1\t8\t11\tD')
chr1 0 10 A
chr1 8 10 A
chr1 0 10 B
chr1 8 11 B merging $ bedtools intersect -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(bedtools merge -i <(echo -e 'chr1\t0\t10\tC\nchr1\t8\t11\tD'))
chr1 0 10 A
chr1 0 11 B and I think this is more consistent with: $ bedtools subtract -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(bedtools subtract -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(echo -e 'chr1\t0\t10\tC\nchr1\t8\t11\tD'))
chr1 0 10 A
chr1 0 11 B
$ bedtools subtract -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(bedtools complement -i <(echo -e 'chr1\t0\t10\tC\nchr1\t8\t11\tD') -g <(echo -e 'chr1\t20'))
chr1 0 10 A
chr1 0 11 B |
That may be because If sorting is included in a process, I feel merging input regions is intuitive and reasonable. |
Thank you for the detailed explanation! The results from merged inputs look reasonable to me as well. So, I'm going to change the fns so that they apply |
Done. Along with the fix, I modified |
src/cljam/io/bed.clj
Outdated
(if (fields-<= r1 y) | ||
(cons r1 (subtract (next xs) ys)) | ||
(subtract (cons r1 (next xs)) (next ys))) | ||
(if (fields-<= x y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be always true because x
is totally covered by y
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. Good catch! 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect! Thanks!! 😸
Thank you for good enhancement. |
Thank you for a lot of advice and comments! I couldn't do this alone 😂 |
This PR adds the following new APIs to
cljam.io.bed
:intersect-fields
subtract-fields
complement-fields
Each implementation reflects the behavior of
bedtools
' corresponding subcommand. See the links below for details:There is one thing to be confirmed.
complement-fields
now throws an exception if the map passed as the first argument (chrom-lengths
) lacks the entry for some chromosome appeared in the sequence of BED fields. However, the exception will be deferred until the resulting sequence is realized becausecomplement-fields
returns a lazy sequence and the body ofcomplement-fields
is wrapped withlazy-seq
.The design of throwing exceptions from lazy sequences doesn't look very good to me. Is it better to scan the input sequences and check if every chromosome appeared in them can be found in
chrom-lengths
before processing the sequences? Any suggestions?