-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up breath first search #99
Conversation
Merging master updates with patch speedup
Thanks @gpratt! It looks like the regression test including unmapped reads failed. I'll have a look into this soon and merge when the tests all pass |
Ok, I figured out what the problem is, not sure how you want to deal with it. The issue lies in the sort command of
According to this helpful wiki on python sorts if the key for two items is the same, the first item in the list will be the first sorted as well. Because the cluster set gets created in a slightly different order between the two the different versions of breath first search, the sorting at this step, and the final output will all be in a slightly different order. One easy way to deal with the issue would be to sort by both count, and umi, guaranteeing stability from here on out, but it wouldn't be backwards compatible, and the tests will still probably fail. I've made that fix in the new commit. This causes the tests for both group_directional and group_directional_unmapped to fail, but if we adjust the tests they should remain more stable in the future. Let me know what you think. |
This sounds like a sensible solution. Any thoughts @IanSudbery? I'll pull from your fork later to profile dedup and convince myself that the output is identical but just in a different order. Then we can merge and regenerate the test files. |
I suspect it won’t be identical to a sort order because sorting the UMIs
differently at that point will lead to a different UMI being selected as
the representative one for that group. What I did when I introduced the
sort was to check that the groups assigned were the same: - i.e. that the
UG tag from group output was the same once the records had been sorted.
(which I did by cutting the last column off the sam file with cut).
As for the sort itself, I think the proposed solution looks good. I can’t
imagine it causes much in the way of a slowdown, which would be my only
worry.
|
As yes, I meant with respect to the UMI sequence rather than the read which as you say will be different. I was going to check the set of UMIs and groups tags was identical. I should have time this afternoon/tomorrow. |
I was thinking that you might have two seperate UMIs that might have the same count, and which one was selected would change.... But of course this won't happen for directional, because two UMIs with the same count would never be the best UMI in a group! Might this be a problem for adjacency? |
@IanSudbery Actually they can be in the same group if the counts are very low due to the '-1' in the threshold. For example, see the output below from the group directional test where the output order and group UMI are different. My one concern about sorting on counts and UMI is that this will slightly bias the composition of the final UMIs. I guess this is probably a minor concern though. The only other option would be to break ties randomly. Ignore my comment regarding testing this with dedup. These changes only affect the output of group. master:
gpratt_fork:
|
Well, not sorting is effectively picking one at random. As long as we are
happy that the new output is just as valid as the old, we could just
overwrite the old tests.
|
I was thinking random but stable irrespective of the order of UMIs when the hash seed is set, so even if we change the search function in the future, the tests would not fail. This would require the breaking of ties to e.g randomly select from the hashed sequences, assuming pythons method for hashing strings isn't going to change any time soon. This is probably going to far though given the issue here is just getting the tests to pass. I'm inclined to just accept the new output as valid and avoid the bias arising from the additional level of sorting. |
This reverts commit e7d3b28. Allows for unordered, but correct groups in the group command
@TomSmithCGAT Done |
Thanks! |
Any time! You guys a great easy to read code base, and a really useful tool. Always happy to try and help make it a bit better. |
Thanks gpratt. Pleasure to work with you. Do you mind me asking how much it sped up your processing by? |
I'm reprocessing all of the encode eClip data (I'm in the lab that
generated that data)
There were ~20 datasets that didn't finish after running for 48 hours, they
now run in less than one hour. So depending on how you want to count it,
days or weeks :)
Same thing with the first fix I submitted.
…On Mar 28, 2017 3:23 AM, "Ian Sudbery" ***@***.***> wrote:
Thanks gpratt. Pleasure to work with you. Do you mind me asking how much
it sped up your processing by?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#99 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABNUW9Lf38zcA0nj16BhPt1jvKTmDX3xks5rqN--gaJpZM4MpAGu>
.
|
Cheers. That's quite some improvement! |
This should increase the speed of breath first search significantly, without using the recursive approach, that may or may not be broken. Taking care of my problems from issue #31. I get a massive speed increase compared to the previous version.
I tested this update on one full bam file, and saw no changes in the output. Given that this is a change to a core algorithm, it might deserve some additional testing on your end before merging it into master.