Deduper_Peer_Review #2

MaxHills · 2018-10-27T02:35:29Z

Your pseudo-code is very readable. This made it easier for me to determine that it was also logical and sensible. Concerning the beginning of your script, sets are unordered, but quickly searchable. So, to determine if the UMI for a read is known, if you have a set of 96 known UMIs, you can use a conditional such as:
if UMI in set_of_UMIs:
do some stuff
But, to contain chromosome, position, and strandedness, you'll need a dictionary. Sets are unordered, so they cannot be indexed, and they have no "keys" like dictionaries, and chromosome number and position would be indistinguishable, as they're both simply numbers.

You may also want some statistical variables like a counter variable for number of duplicates, so you can print an informative output file telling the user how many duplicates were removed, how many low-quality, how many unknown UMIs, etc... could be helpful.

Also, consider the special cases for a reverse strand, when parsing the CIGAR string. You may need conditionals for Ns, Is, Ds to correct for POS, to test for true duplicates.

Otherwise, very thorough job, and very easy to follow. Well done!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduper_Peer_Review #2

Deduper_Peer_Review #2

MaxHills commented Oct 27, 2018

Deduper_Peer_Review #2

Deduper_Peer_Review #2

Comments

MaxHills commented Oct 27, 2018