Add some BED manipulation APIs #131

athos · 2018-05-29T04:55:04Z

This PR adds the following new APIs to cljam.io.bed:

intersect-fields
subtract-fields
complement-fields

Each implementation reflects the behavior of bedtools' corresponding subcommand. See the links below for details:

intersect: http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
subtract: http://bedtools.readthedocs.io/en/latest/content/tools/subtract.html
complement: http://bedtools.readthedocs.io/en/latest/content/tools/complement.html

There is one thing to be confirmed.

complement-fields now throws an exception if the map passed as the first argument (chrom-lengths) lacks the entry for some chromosome appeared in the sequence of BED fields. However, the exception will be deferred until the resulting sequence is realized because complement-fields returns a lazy sequence and the body of complement-fields is wrapped with lazy-seq.

The design of throwing exceptions from lazy sequences doesn't look very good to me. Is it better to scan the input sequences and check if every chromosome appeared in them can be found in chrom-lengths before processing the sequences? Any suggestions?

codecov · 2018-05-29T05:09:09Z

Codecov Report

Merging #131 into master will increase coverage by 0.19%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #131      +/-   ##
==========================================
+ Coverage   85.24%   85.43%   +0.19%     
==========================================
  Files          69       69              
  Lines        4359     4416      +57     
  Branches      420      420              
==========================================
+ Hits         3716     3773      +57     
  Misses        223      223              
  Partials      420      420

Impacted Files	Coverage Δ
src/cljam/util/region.clj	`100% <100%> (ø)`	⬆️
src/cljam/io/bed.clj	`91.42% <100%> (+3.72%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 626320b...7408f3b. Read the comment docs.

alumi

Thank you for the PR!
Looks like working fine with common cases but not for some edge cases because BED file format permits overlapping regions.
Please add some test cases for that.

alumi · 2018-05-29T10:37:44Z

src/cljam/io/bed.clj

+  The input sequence will first be sorted with sort-fields, which may cause
+  an extensive memory use for ones with a large number of elements."
+  [chrom-lengths xs]
+  (let [chrs (sort-by chr/chromosome-order-key (keys chrom-lengths))]


Lengths of chromosomes can be obtained by cljam.io.sam/read-refs or cljam.io.sequence/read-indices, but to make up a map we need to write something like (into {} (map (juxt :name :len)) ~~~) which I think is a bit redundant.

How about changing (keys chrom-lengths) to (map first chrom-lengths) or accepting a sequence of maps containing :name and :len?

So, considering cooperation with those other functions, it would be more useful to accept such a data structure, right?

OK, I'm going to change the function's signature to accept a sequence of maps like {:name ..., :len ...}.

I also feel ({:name :len}) is more consistent.

alumi · 2018-05-29T10:49:38Z

src/cljam/io/bed.clj

+                         (subtract (cons r1 (next xs)) (next ys)))
+                       (if (fields-< x y)
+                         (subtract (next xs) ys)
+                         (subtract xs (next ys))))))


Please check the following case.

(cljam.io.bed/subtract-fields [{:chr "chr1" :start 1 :end 10} {:chr "chr1" :start 1 :end 11}] [{:chr "chr1" :start 1 :end 10} {:chr "chr1" :start 9 :end 11}]) => ({:chr "chr1", :start 1, :end 8} {:chr "chr1", :start 1, :end 8})

alumi · 2018-05-29T10:50:48Z

src/cljam/io/bed.clj

+                             (update :end min (:end y)))))
+                 [])
+               [])))]
+    (intersect (sort-fields xs) (sort-fields ys))))


Please check the following case.

(cljam.io.bed/intersect-fields [{:chr "chr1" :start 1 :end 10} {:chr "chr1" :start 1 :end 11}] [{:chr "chr1" :start 1 :end 10} {:chr "chr1" :start 9 :end 11}]) => ({:chr "chr1", :start 1, :end 10} {:chr "chr1", :start 9, :end 10} {:chr "chr1", :start 9, :end 11})

alumi · 2018-05-29T11:12:29Z

The design of throwing exceptions from lazy sequences doesn't look very good to me. Is it better to scan the input sequences and check if every chromosome appeared in them can be found in chrom-lengths before processing the sequences? Any suggestions?

Though there're already some fns throwing exceptions from lazy sequences in cljam, I think pre-checking is better in this particular case because all arguments are realized.

athos · 2018-05-30T02:11:12Z

Looks like working fine with common cases but not for some edge cases because BED file format permits overlapping regions.

Thank you for pointing it out! I didn't know that. I'll fix it soon.

I think pre-checking is better in this particular case because all arguments are realized.

I see. I'll give it a try in that way.

totakke

Thanks. I added a few trivial comments.

I found that result is strange when illegal regions are supplied, such as :start > :end.

(complement-fields {"chr1" 1000, "chr2" 800}
                   [{:chr "chr1" :start 2 :end 1}])
;;=> ({:chr "chr1", :start 1, :end 1}
;;    {:chr "chr1", :start 2, :end 1000}
;;    {:chr "chr2", :start 1, :end 800})

merge-regions has similar strange behaviors. I wonder these fns should check regions or not. There is no problem in the case of using them via BED reader because BED reader checks regions while reading.

totakke · 2018-05-30T02:22:51Z

test/cljam/io/bed_test.clj

+    [{:chr "chr1" :start 301 :end 899} {:chr "chr2" :start 301 :end 800}])
+
+  (is (thrown? IllegalArgumentException
+               (doall (bed/complement-fields {"chr1" 1000} [{:chr "chr2" :start 1 :end 100}])))))


The return value is not used, so dorun is better.

Ah, exactly. The prechecking discussed above will remove this explicit realization, though.

totakke · 2018-05-30T02:29:23Z

src/cljam/util/region.clj

+;;; ----------
+
+(defn overlapped-regions?
+  "Returns true iff two regions are overlapped with each other."


iff -> if

I used the word in the sense of "if and only if", but it might be somewhat confusing. I'll fix it later.

totakke · 2018-05-30T03:25:14Z

src/cljam/io/bed.clj

+  The input sequence will first be sorted with sort-fields, which may cause
+  an extensive memory use for ones with a large number of elements."
+  [chrom-lengths xs]
+  (let [chrs (sort-by chr/chromosome-order-key (keys chrom-lengths))]


I also feel ({:name :len}) is more consistent.

athos · 2018-05-30T04:55:12Z

Thank you for your review.

I found that result is strange when illegal regions are supplied, such as :start > :end.

Exactly, these fns have a precondition that the input sequences contain only valid regions, so once the condition is violated they can probably produce garbage.

I'm not sure it's reasonable to check region's validity in all these fns. If doing so, and then if you want to use some of these fns combined (say, you want the complement of the intersection of BED sequences xs and ys), the validity check will be run more than once unnecessarily.

I would rather propose adding a new API to help users to check the validity of a BED sequence on their own, something like the sequence version of util.region/valid-region?.

totakke · 2018-05-30T05:40:15Z

If not checking, I think you don't have to add a new validator function. It can be easily written by every? util.region/valid-region?. Instead, you should add a statement to the docstring that behavior is not supported when illegal regions are supplied.

athos · 2018-05-30T07:09:11Z

I just fixed the issues other than the one about overlapping regions and the one about the region validity.

As for the region validity, I'll update the docstrings for these fns to describe their limitation of failing to correctly handle sequences containing invalid regions, as @totakke suggested immediately above.

As for overlapping regions, I'm still wondering how I should resolve the issue. Is it a good approach to first apply merge-fields to the input sequences instead of just doing sort-fields? I have no idea what the expected result should be for overlapping cases.

alumi · 2018-05-30T09:10:00Z

As for overlapping regions, I'm still wondering how I should resolve the issue. Is it a good approach to first apply merge-fields to the input sequences instead of just doing sort-fields? I have no idea what the expected result should be for overlapping cases.

It's just my opinion but, I think merging ys is reasonable.

bedtools subtract acts like that:

$ bedtools --version
bedtools v2.27.1
$ bedtools subtract -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(echo -e 'chr1\t0\t10\tC\nchr1\t8\t11\tD')
$ bedtools subtract -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(echo -e 'chr1\t1\t10\tC\nchr1\t8\t11\tD')
chr1	0	1	A
chr1	0	1	B

bedtools intersect produces results that are unintuitive to me:

$ bedtools intersect -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(echo -e 'chr1\t0\t10\tC\nchr1\t8\t11\tD')
chr1	0	10	A
chr1	8	10	A
chr1	0	10	B
chr1	8	11	B
$ bedtools intersect -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(echo -e 'chr1\t0\t10\tC') -b <(echo -e 'chr1\t8\t11\tD')
chr1	0	10	A
chr1	8	10	A
chr1	0	10	B
chr1	8	11	B

merging ys gives followings:

$ bedtools intersect -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(bedtools merge -i <(echo -e 'chr1\t0\t10\tC\nchr1\t8\t11\tD'))
chr1	0	10	A
chr1	0	11	B

and I think this is more consistent with:

$ bedtools subtract -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(bedtools subtract -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(echo -e 'chr1\t0\t10\tC\nchr1\t8\t11\tD'))
chr1	0	10	A
chr1	0	11	B
$ bedtools subtract -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(bedtools complement -i <(echo -e 'chr1\t0\t10\tC\nchr1\t8\t11\tD') -g <(echo -e 'chr1\t20'))
chr1	0	10	A
chr1	0	11	B

totakke · 2018-05-30T10:19:20Z

That may be because bedtools intersect accepts unsorted regions. bedtools clearly distinguishes sort and other processes.

If sorting is included in a process, I feel merging input regions is intuitive and reasonable.

athos · 2018-05-31T02:39:31Z

Thank you for the detailed explanation!

The results from merged inputs look reasonable to me as well. So, I'm going to change the fns so that they apply merge-fields instead of sort-fields to their second argument.

athos · 2018-05-31T03:55:13Z

Done.

Along with the fix, I modified fields-< to tweak which sequence should be taken next when the current head elements of them have the same :end.

alumi · 2018-05-31T04:09:29Z

src/cljam/io/bed.clj

+                       (if (fields-<= r1 y)
+                         (cons r1 (subtract (next xs) ys))
+                         (subtract (cons r1 (next xs)) (next ys)))
+                       (if (fields-<= x y)


Seems to be always true because x is totally covered by y 🤔

Ah, right. Good catch! 🙏

alumi

Perfect! Thanks!! 😸

totakke · 2018-05-31T05:08:11Z

Thank you for good enhancement.

athos · 2018-05-31T05:12:58Z

Thank you for a lot of advice and comments! I couldn't do this alone 😂

athos added 3 commits May 29, 2018 13:49

Implement BED intersection/subtraction/complement

6ac3e25

Add unit tests for BED intersection/subtraction/complement

e3e5ada

Fix off-by-one bugs found in complement-fields

7681c5f

athos assigned totakke May 29, 2018

athos requested review from totakke and alumi May 29, 2018 04:55

alumi requested changes May 29, 2018

View reviewed changes

athos added 2 commits May 30, 2018 12:10

Change signature of complemet-fields to ease cooperation with other APIs

c17efb8

Precheck if length of every related chromosome is specified

499f1c0

totakke reviewed May 30, 2018

View reviewed changes

Fix typo

baeff03

Add notes to describe the precondition around region validity

04a5ab3

Merge input sequences to handle overlapped regions correctly

b3cbe80

alumi reviewed May 31, 2018

View reviewed changes

Remove redundant conditional

7408f3b

alumi approved these changes May 31, 2018

View reviewed changes

totakke approved these changes May 31, 2018

View reviewed changes

totakke merged commit 966ae4f into master May 31, 2018

totakke deleted the feature/bed-manipulation-fns branch May 31, 2018 05:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some BED manipulation APIs #131

Add some BED manipulation APIs #131

athos commented May 29, 2018

codecov bot commented May 29, 2018 •

edited

Loading

alumi left a comment

alumi May 29, 2018

athos May 30, 2018

totakke May 30, 2018

alumi May 29, 2018

alumi May 29, 2018

alumi commented May 29, 2018

athos commented May 30, 2018

totakke left a comment

totakke May 30, 2018

athos May 30, 2018

totakke May 30, 2018

athos May 30, 2018

totakke May 30, 2018

athos commented May 30, 2018

totakke commented May 30, 2018

athos commented May 30, 2018 •

edited

Loading

alumi commented May 30, 2018

totakke commented May 30, 2018

athos commented May 31, 2018

athos commented May 31, 2018

alumi May 31, 2018

athos May 31, 2018

alumi left a comment

totakke commented May 31, 2018

athos commented May 31, 2018

Add some BED manipulation APIs #131

Add some BED manipulation APIs #131

Conversation

athos commented May 29, 2018

codecov bot commented May 29, 2018 • edited Loading

Codecov Report

alumi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alumi commented May 29, 2018

athos commented May 30, 2018

totakke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

athos commented May 30, 2018

totakke commented May 30, 2018

athos commented May 30, 2018 • edited Loading

alumi commented May 30, 2018

totakke commented May 30, 2018

athos commented May 31, 2018

athos commented May 31, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alumi left a comment

Choose a reason for hiding this comment

totakke commented May 31, 2018

athos commented May 31, 2018

codecov bot commented May 29, 2018 •

edited

Loading

athos commented May 30, 2018 •

edited

Loading