Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up breath first search #99

Merged
merged 5 commits into from
Mar 27, 2017
Merged

Conversation

gpratt
Copy link
Contributor

@gpratt gpratt commented Mar 25, 2017

This should increase the speed of breath first search significantly, without using the recursive approach, that may or may not be broken. Taking care of my problems from issue #31. I get a massive speed increase compared to the previous version.

I tested this update on one full bam file, and saw no changes in the output. Given that this is a change to a core algorithm, it might deserve some additional testing on your end before merging it into master.

Gabriel Pratt added 2 commits March 25, 2017 00:33
@TomSmithCGAT
Copy link
Member

Thanks @gpratt! It looks like the regression test including unmapped reads failed. I'll have a look into this soon and merge when the tests all pass

@gpratt
Copy link
Contributor Author

gpratt commented Mar 26, 2017

Ok, I figured out what the problem is, not sure how you want to deal with it.

The issue lies in the sort command of _group_directional . Specifically the command:

cluster = sorted(cluster, key=lambda x: counts[x], reverse=True)

According to this helpful wiki on python sorts if the key for two items is the same, the first item in the list will be the first sorted as well.

Because the cluster set gets created in a slightly different order between the two the different versions of breath first search, the sorting at this step, and the final output will all be in a slightly different order. One easy way to deal with the issue would be to sort by both count, and umi, guaranteeing stability from here on out, but it wouldn't be backwards compatible, and the tests will still probably fail. I've made that fix in the new commit. This causes the tests for both group_directional and group_directional_unmapped to fail, but if we adjust the tests they should remain more stable in the future. Let me know what you think.

@TomSmithCGAT
Copy link
Member

This sounds like a sensible solution. Any thoughts @IanSudbery?

I'll pull from your fork later to profile dedup and convince myself that the output is identical but just in a different order. Then we can merge and regenerate the test files.

@IanSudbery
Copy link
Member

IanSudbery commented Mar 27, 2017 via email

@TomSmithCGAT
Copy link
Member

As yes, I meant with respect to the UMI sequence rather than the read which as you say will be different. I was going to check the set of UMIs and groups tags was identical. I should have time this afternoon/tomorrow.

@IanSudbery
Copy link
Member

I was thinking that you might have two seperate UMIs that might have the same count, and which one was selected would change.... But of course this won't happen for directional, because two UMIs with the same count would never be the best UMI in a group!

Might this be a problem for adjacency?

@TomSmithCGAT
Copy link
Member

@IanSudbery Actually they can be in the same group if the counts are very low due to the '-1' in the threshold. For example, see the output below from the group directional test where the output order and group UMI are different. My one concern about sorting on counts and UMI is that this will slightly bias the composition of the final UMIs. I guess this is probably a minor concern though. The only other option would be to break ties randomly.

Ignore my comment regarding testing this with dedup. These changes only affect the output of group.

master:

SRR2057595.13577605_CATCC       16      chr19   4078437 255     42M     *       0       0       *       *       XA:i:1  MD:Z:5A36       NM:i:1  UG:i:118        BX:Z:CATCC
SRR2057595.6226935_TATCC        16      chr19   4078412 255     67M     *       0       0       *       *       XA:i:2  MD:Z:3G26A36    NM:i:2  UG:i:118        BX:Z:CATCC

gpratt_fork:

SRR2057595.6226935_TATCC        16      chr19   4078412 255     67M     *       0       0       *       *       XA:i:2  MD:Z:3G26A36    NM:i:2  UG:i:118        BX:Z:TATCC
SRR2057595.13577605_CATCC       16      chr19   4078437 255     42M     *       0       0       *       *       XA:i:1  MD:Z:5A36       NM:i:1  UG:i:118        BX:Z:TATCC

@IanSudbery
Copy link
Member

IanSudbery commented Mar 27, 2017 via email

@TomSmithCGAT
Copy link
Member

I was thinking random but stable irrespective of the order of UMIs when the hash seed is set, so even if we change the search function in the future, the tests would not fail. This would require the breaking of ties to e.g randomly select from the hashed sequences, assuming pythons method for hashing strings isn't going to change any time soon. This is probably going to far though given the issue here is just getting the tests to pass.

I'm inclined to just accept the new output as valid and avoid the bias arising from the additional level of sorting.

@TomSmithCGAT
Copy link
Member

@gpratt would you mind rolling back the last commit (e7d3b28) and then I'll merge your fork into master and regenerate the test files.

Gabriel Pratt added 2 commits March 27, 2017 10:23
This reverts commit e7d3b28. Allows for unordered, but correct groups in the group command
@gpratt
Copy link
Contributor Author

gpratt commented Mar 27, 2017

@TomSmithCGAT Done

@TomSmithCGAT
Copy link
Member

Thanks!

@TomSmithCGAT TomSmithCGAT merged commit b8843fd into CGATOxford:master Mar 27, 2017
@gpratt
Copy link
Contributor Author

gpratt commented Mar 27, 2017

Any time! You guys a great easy to read code base, and a really useful tool. Always happy to try and help make it a bit better.

@IanSudbery
Copy link
Member

Thanks gpratt. Pleasure to work with you. Do you mind me asking how much it sped up your processing by?

@gpratt
Copy link
Contributor Author

gpratt commented Mar 28, 2017 via email

@IanSudbery
Copy link
Member

Cheers. That's quite some improvement!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants