Speed up breath first search #99

gpratt · 2017-03-25T07:54:12Z

This should increase the speed of breath first search significantly, without using the recursive approach, that may or may not be broken. Taking care of my problems from issue #31. I get a massive speed increase compared to the previous version.

I tested this update on one full bam file, and saw no changes in the output. Given that this is a change to a core algorithm, it might deserve some additional testing on your end before merging it into master.

Merging master updates with patch speedup

TomSmithCGAT · 2017-03-25T10:26:48Z

Thanks @gpratt! It looks like the regression test including unmapped reads failed. I'll have a look into this soon and merge when the tests all pass

gpratt · 2017-03-26T05:25:37Z

Ok, I figured out what the problem is, not sure how you want to deal with it.

The issue lies in the sort command of _group_directional . Specifically the command:

cluster = sorted(cluster, key=lambda x: counts[x], reverse=True)

According to this helpful wiki on python sorts if the key for two items is the same, the first item in the list will be the first sorted as well.

Because the cluster set gets created in a slightly different order between the two the different versions of breath first search, the sorting at this step, and the final output will all be in a slightly different order. One easy way to deal with the issue would be to sort by both count, and umi, guaranteeing stability from here on out, but it wouldn't be backwards compatible, and the tests will still probably fail. I've made that fix in the new commit. This causes the tests for both group_directional and group_directional_unmapped to fail, but if we adjust the tests they should remain more stable in the future. Let me know what you think.

TomSmithCGAT · 2017-03-27T07:32:59Z

This sounds like a sensible solution. Any thoughts @IanSudbery?

I'll pull from your fork later to profile dedup and convince myself that the output is identical but just in a different order. Then we can merge and regenerate the test files.

IanSudbery · 2017-03-27T10:07:49Z

I suspect it won’t be identical to a sort order because sorting the UMIs differently at that point will lead to a different UMI being selected as the representative one for that group. What I did when I introduced the sort was to check that the groups assigned were the same: - i.e. that the UG tag from group output was the same once the records had been sorted. (which I did by cutting the last column off the sam file with cut). As for the sort itself, I think the proposed solution looks good. I can’t imagine it causes much in the way of a slowdown, which would be my only worry.

TomSmithCGAT · 2017-03-27T10:14:29Z

As yes, I meant with respect to the UMI sequence rather than the read which as you say will be different. I was going to check the set of UMIs and groups tags was identical. I should have time this afternoon/tomorrow.

IanSudbery · 2017-03-27T11:13:59Z

I was thinking that you might have two seperate UMIs that might have the same count, and which one was selected would change.... But of course this won't happen for directional, because two UMIs with the same count would never be the best UMI in a group!

Might this be a problem for adjacency?

TomSmithCGAT · 2017-03-27T12:15:08Z

@IanSudbery Actually they can be in the same group if the counts are very low due to the '-1' in the threshold. For example, see the output below from the group directional test where the output order and group UMI are different. My one concern about sorting on counts and UMI is that this will slightly bias the composition of the final UMIs. I guess this is probably a minor concern though. The only other option would be to break ties randomly.

Ignore my comment regarding testing this with dedup. These changes only affect the output of group.

master:

SRR2057595.13577605_CATCC       16      chr19   4078437 255     42M     *       0       0       *       *       XA:i:1  MD:Z:5A36       NM:i:1  UG:i:118        BX:Z:CATCC
SRR2057595.6226935_TATCC        16      chr19   4078412 255     67M     *       0       0       *       *       XA:i:2  MD:Z:3G26A36    NM:i:2  UG:i:118        BX:Z:CATCC

gpratt_fork:

SRR2057595.6226935_TATCC        16      chr19   4078412 255     67M     *       0       0       *       *       XA:i:2  MD:Z:3G26A36    NM:i:2  UG:i:118        BX:Z:TATCC
SRR2057595.13577605_CATCC       16      chr19   4078437 255     42M     *       0       0       *       *       XA:i:1  MD:Z:5A36       NM:i:1  UG:i:118        BX:Z:TATCC

IanSudbery · 2017-03-27T13:15:24Z

Well, not sorting is effectively picking one at random. As long as we are happy that the new output is just as valid as the old, we could just overwrite the old tests.

TomSmithCGAT · 2017-03-27T15:20:30Z

I was thinking random but stable irrespective of the order of UMIs when the hash seed is set, so even if we change the search function in the future, the tests would not fail. This would require the breaking of ties to e.g randomly select from the hashed sequences, assuming pythons method for hashing strings isn't going to change any time soon. This is probably going to far though given the issue here is just getting the tests to pass.

I'm inclined to just accept the new output as valid and avoid the bias arising from the additional level of sorting.

TomSmithCGAT · 2017-03-27T16:09:21Z

@gpratt would you mind rolling back the last commit (e7d3b28) and then I'll merge your fork into master and regenerate the test files.

This reverts commit e7d3b28. Allows for unordered, but correct groups in the group command

gpratt · 2017-03-27T17:41:07Z

@TomSmithCGAT Done

TomSmithCGAT · 2017-03-27T19:38:07Z

Thanks!

gpratt · 2017-03-27T19:48:58Z

Any time! You guys a great easy to read code base, and a really useful tool. Always happy to try and help make it a bit better.

IanSudbery · 2017-03-28T10:23:57Z

Thanks gpratt. Pleasure to work with you. Do you mind me asking how much it sped up your processing by?

gpratt · 2017-03-28T14:42:14Z

I'm reprocessing all of the encode eClip data (I'm in the lab that generated that data) There were ~20 datasets that didn't finish after running for 48 hours, they now run in less than one hour. So depending on how you want to count it, days or weeks :) Same thing with the first fix I submitted.

…

On Mar 28, 2017 3:23 AM, "Ian Sudbery" ***@***.***> wrote: Thanks gpratt. Pleasure to work with you. Do you mind me asking how much it sped up your processing by? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#99 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABNUW9Lf38zcA0nj16BhPt1jvKTmDX3xks5rqN--gaJpZM4MpAGu> .

IanSudbery · 2017-03-29T17:36:08Z

Cheers. That's quite some improvement!

Gabriel Pratt added 2 commits March 25, 2017 00:33

much faster serial breath first search

d21327e

Merge remote-tracking branch 'upstream/master'

e261ae3

Merging master updates with patch speedup

created more stable directional sort

e7d3b28

Gabriel Pratt added 2 commits March 27, 2017 10:23

Revert "created more stable directional sort"

7e1de9c

This reverts commit e7d3b28. Allows for unordered, but correct groups in the group command

fixing style issues

03076e8

TomSmithCGAT merged commit b8843fd into CGATOxford:master Mar 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up breath first search #99

Speed up breath first search #99

gpratt commented Mar 25, 2017

TomSmithCGAT commented Mar 25, 2017

gpratt commented Mar 26, 2017

TomSmithCGAT commented Mar 27, 2017

IanSudbery commented Mar 27, 2017 via email

TomSmithCGAT commented Mar 27, 2017

IanSudbery commented Mar 27, 2017

TomSmithCGAT commented Mar 27, 2017

IanSudbery commented Mar 27, 2017 via email

TomSmithCGAT commented Mar 27, 2017

TomSmithCGAT commented Mar 27, 2017

gpratt commented Mar 27, 2017

TomSmithCGAT commented Mar 27, 2017

gpratt commented Mar 27, 2017

IanSudbery commented Mar 28, 2017

gpratt commented Mar 28, 2017 via email

IanSudbery commented Mar 29, 2017

Speed up breath first search #99

Speed up breath first search #99

Conversation

gpratt commented Mar 25, 2017

TomSmithCGAT commented Mar 25, 2017

gpratt commented Mar 26, 2017

TomSmithCGAT commented Mar 27, 2017

IanSudbery commented Mar 27, 2017 via email

TomSmithCGAT commented Mar 27, 2017

IanSudbery commented Mar 27, 2017

TomSmithCGAT commented Mar 27, 2017

IanSudbery commented Mar 27, 2017 via email

TomSmithCGAT commented Mar 27, 2017

TomSmithCGAT commented Mar 27, 2017

gpratt commented Mar 27, 2017

TomSmithCGAT commented Mar 27, 2017

gpratt commented Mar 27, 2017

IanSudbery commented Mar 28, 2017

gpratt commented Mar 28, 2017 via email

IanSudbery commented Mar 29, 2017