Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduper_Peer_Review #2

Open
MaxHills opened this issue Oct 27, 2018 · 0 comments
Open

Deduper_Peer_Review #2

MaxHills opened this issue Oct 27, 2018 · 0 comments

Comments

@MaxHills
Copy link

Your pseudo-code is very readable. This made it easier for me to determine that it was also logical and sensible. Concerning the beginning of your script, sets are unordered, but quickly searchable. So, to determine if the UMI for a read is known, if you have a set of 96 known UMIs, you can use a conditional such as:
if UMI in set_of_UMIs:
do some stuff
But, to contain chromosome, position, and strandedness, you'll need a dictionary. Sets are unordered, so they cannot be indexed, and they have no "keys" like dictionaries, and chromosome number and position would be indistinguishable, as they're both simply numbers.

You may also want some statistical variables like a counter variable for number of duplicates, so you can print an informative output file telling the user how many duplicates were removed, how many low-quality, how many unknown UMIs, etc... could be helpful.

Also, consider the special cases for a reverse strand, when parsing the CIGAR string. You may need conditionals for Ns, Is, Ds to correct for POS, to test for true duplicates.

Otherwise, very thorough job, and very easy to follow. Well done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant