Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some BED manipulation APIs #131

Merged
merged 9 commits into from
May 31, 2018
Merged

Conversation

athos
Copy link
Member

@athos athos commented May 29, 2018

This PR adds the following new APIs to cljam.io.bed:

  • intersect-fields
  • subtract-fields
  • complement-fields

Each implementation reflects the behavior of bedtools' corresponding subcommand. See the links below for details:

There is one thing to be confirmed.

complement-fields now throws an exception if the map passed as the first argument (chrom-lengths) lacks the entry for some chromosome appeared in the sequence of BED fields. However, the exception will be deferred until the resulting sequence is realized because complement-fields returns a lazy sequence and the body of complement-fields is wrapped with lazy-seq.

The design of throwing exceptions from lazy sequences doesn't look very good to me. Is it better to scan the input sequences and check if every chromosome appeared in them can be found in chrom-lengths before processing the sequences? Any suggestions?

@athos athos requested review from totakke and alumi May 29, 2018 04:55
@codecov
Copy link

codecov bot commented May 29, 2018

Codecov Report

Merging #131 into master will increase coverage by 0.19%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #131      +/-   ##
==========================================
+ Coverage   85.24%   85.43%   +0.19%     
==========================================
  Files          69       69              
  Lines        4359     4416      +57     
  Branches      420      420              
==========================================
+ Hits         3716     3773      +57     
  Misses        223      223              
  Partials      420      420
Impacted Files Coverage Δ
src/cljam/util/region.clj 100% <100%> (ø) ⬆️
src/cljam/io/bed.clj 91.42% <100%> (+3.72%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 626320b...7408f3b. Read the comment docs.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!
Looks like working fine with common cases but not for some edge cases because BED file format permits overlapping regions.
Please add some test cases for that.

The input sequence will first be sorted with sort-fields, which may cause
an extensive memory use for ones with a large number of elements."
[chrom-lengths xs]
(let [chrs (sort-by chr/chromosome-order-key (keys chrom-lengths))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lengths of chromosomes can be obtained by cljam.io.sam/read-refs or cljam.io.sequence/read-indices, but to make up a map we need to write something like (into {} (map (juxt :name :len)) ~~~) which I think is a bit redundant.

How about changing (keys chrom-lengths) to (map first chrom-lengths) or accepting a sequence of maps containing :name and :len?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, considering cooperation with those other functions, it would be more useful to accept such a data structure, right?

OK, I'm going to change the function's signature to accept a sequence of maps like {:name ..., :len ...}.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel ({:name :len}) is more consistent.

(subtract (cons r1 (next xs)) (next ys)))
(if (fields-< x y)
(subtract (next xs) ys)
(subtract xs (next ys))))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the following case.

(cljam.io.bed/subtract-fields
  [{:chr "chr1" :start 1 :end 10} {:chr "chr1" :start 1 :end 11}]
  [{:chr "chr1" :start 1 :end 10} {:chr "chr1" :start 9 :end 11}])
=> ({:chr "chr1", :start 1, :end 8} {:chr "chr1", :start 1, :end 8})

(update :end min (:end y)))))
[])
[])))]
(intersect (sort-fields xs) (sort-fields ys))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the following case.

(cljam.io.bed/intersect-fields
  [{:chr "chr1" :start 1 :end 10} {:chr "chr1" :start 1 :end 11}]
  [{:chr "chr1" :start 1 :end 10} {:chr "chr1" :start 9 :end 11}])
=> ({:chr "chr1", :start 1, :end 10} {:chr "chr1", :start 9, :end 10} {:chr "chr1", :start 9, :end 11})

@alumi
Copy link
Member

alumi commented May 29, 2018

The design of throwing exceptions from lazy sequences doesn't look very good to me. Is it better to scan the input sequences and check if every chromosome appeared in them can be found in chrom-lengths before processing the sequences? Any suggestions?

Though there're already some fns throwing exceptions from lazy sequences in cljam, I think pre-checking is better in this particular case because all arguments are realized.

@athos
Copy link
Member Author

athos commented May 30, 2018

Looks like working fine with common cases but not for some edge cases because BED file format permits overlapping regions.

Thank you for pointing it out! I didn't know that. I'll fix it soon.

I think pre-checking is better in this particular case because all arguments are realized.

I see. I'll give it a try in that way.

Copy link
Member

@totakke totakke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I added a few trivial comments.

I found that result is strange when illegal regions are supplied, such as :start > :end.

(complement-fields {"chr1" 1000, "chr2" 800}
                   [{:chr "chr1" :start 2 :end 1}])
;;=> ({:chr "chr1", :start 1, :end 1}
;;    {:chr "chr1", :start 2, :end 1000}
;;    {:chr "chr2", :start 1, :end 800})

merge-regions has similar strange behaviors. I wonder these fns should check regions or not. There is no problem in the case of using them via BED reader because BED reader checks regions while reading.

[{:chr "chr1" :start 301 :end 899} {:chr "chr2" :start 301 :end 800}])

(is (thrown? IllegalArgumentException
(doall (bed/complement-fields {"chr1" 1000} [{:chr "chr2" :start 1 :end 100}])))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value is not used, so dorun is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, exactly. The prechecking discussed above will remove this explicit realization, though.

;;; ----------

(defn overlapped-regions?
"Returns true iff two regions are overlapped with each other."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iff -> if

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the word in the sense of "if and only if", but it might be somewhat confusing. I'll fix it later.

The input sequence will first be sorted with sort-fields, which may cause
an extensive memory use for ones with a large number of elements."
[chrom-lengths xs]
(let [chrs (sort-by chr/chromosome-order-key (keys chrom-lengths))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel ({:name :len}) is more consistent.

@athos
Copy link
Member Author

athos commented May 30, 2018

Thank you for your review.

I found that result is strange when illegal regions are supplied, such as :start > :end.

Exactly, these fns have a precondition that the input sequences contain only valid regions, so once the condition is violated they can probably produce garbage.

I'm not sure it's reasonable to check region's validity in all these fns. If doing so, and then if you want to use some of these fns combined (say, you want the complement of the intersection of BED sequences xs and ys), the validity check will be run more than once unnecessarily.

I would rather propose adding a new API to help users to check the validity of a BED sequence on their own, something like the sequence version of util.region/valid-region?.

@totakke
Copy link
Member

totakke commented May 30, 2018

If not checking, I think you don't have to add a new validator function. It can be easily written by every? util.region/valid-region?. Instead, you should add a statement to the docstring that behavior is not supported when illegal regions are supplied.

@athos
Copy link
Member Author

athos commented May 30, 2018

I just fixed the issues other than the one about overlapping regions and the one about the region validity.

As for the region validity, I'll update the docstrings for these fns to describe their limitation of failing to correctly handle sequences containing invalid regions, as @totakke suggested immediately above.

As for overlapping regions, I'm still wondering how I should resolve the issue. Is it a good approach to first apply merge-fields to the input sequences instead of just doing sort-fields? I have no idea what the expected result should be for overlapping cases.

@alumi
Copy link
Member

alumi commented May 30, 2018

As for overlapping regions, I'm still wondering how I should resolve the issue. Is it a good approach to first apply merge-fields to the input sequences instead of just doing sort-fields? I have no idea what the expected result should be for overlapping cases.

It's just my opinion but, I think merging ys is reasonable.

bedtools subtract acts like that:

$ bedtools --version
bedtools v2.27.1
$ bedtools subtract -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(echo -e 'chr1\t0\t10\tC\nchr1\t8\t11\tD')
$ bedtools subtract -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(echo -e 'chr1\t1\t10\tC\nchr1\t8\t11\tD')
chr1	0	1	A
chr1	0	1	B

bedtools intersect produces results that are unintuitive to me:

$ bedtools intersect -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(echo -e 'chr1\t0\t10\tC\nchr1\t8\t11\tD')
chr1	0	10	A
chr1	8	10	A
chr1	0	10	B
chr1	8	11	B
$ bedtools intersect -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(echo -e 'chr1\t0\t10\tC') -b <(echo -e 'chr1\t8\t11\tD')
chr1	0	10	A
chr1	8	10	A
chr1	0	10	B
chr1	8	11	B

merging ys gives followings:

$ bedtools intersect -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(bedtools merge -i <(echo -e 'chr1\t0\t10\tC\nchr1\t8\t11\tD'))
chr1	0	10	A
chr1	0	11	B

and I think this is more consistent with:

$ bedtools subtract -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(bedtools subtract -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(echo -e 'chr1\t0\t10\tC\nchr1\t8\t11\tD'))
chr1	0	10	A
chr1	0	11	B
$ bedtools subtract -a <(echo -e 'chr1\t0\t10\tA\nchr1\t0\t11\tB') -b <(bedtools complement -i <(echo -e 'chr1\t0\t10\tC\nchr1\t8\t11\tD') -g <(echo -e 'chr1\t20'))
chr1	0	10	A
chr1	0	11	B

@totakke
Copy link
Member

totakke commented May 30, 2018

That may be because bedtools intersect accepts unsorted regions. bedtools clearly distinguishes sort and other processes.

If sorting is included in a process, I feel merging input regions is intuitive and reasonable.

@athos
Copy link
Member Author

athos commented May 31, 2018

Thank you for the detailed explanation!

The results from merged inputs look reasonable to me as well. So, I'm going to change the fns so that they apply merge-fields instead of sort-fields to their second argument.

@athos
Copy link
Member Author

athos commented May 31, 2018

Done.

Along with the fix, I modified fields-< to tweak which sequence should be taken next when the current head elements of them have the same :end.

(if (fields-<= r1 y)
(cons r1 (subtract (next xs) ys))
(subtract (cons r1 (next xs)) (next ys)))
(if (fields-<= x y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be always true because x is totally covered by y 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. Good catch! 🙏

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! Thanks!! 😸

@totakke totakke merged commit 966ae4f into master May 31, 2018
@totakke
Copy link
Member

totakke commented May 31, 2018

Thank you for good enhancement.

@totakke totakke deleted the feature/bed-manipulation-fns branch May 31, 2018 05:08
@athos
Copy link
Member Author

athos commented May 31, 2018

Thank you for a lot of advice and comments! I couldn't do this alone 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants