Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/1 and /2 suffixes in paired-end reads #325

Closed
gcorre opened this issue Mar 13, 2019 · 17 comments
Closed

/1 and /2 suffixes in paired-end reads #325

gcorre opened this issue Mar 13, 2019 · 17 comments

Comments

@gcorre
Copy link

gcorre commented Mar 13, 2019

Hi,
Can umi-tools analyze paired-end reads with /1 and /2 suffixes instead of 1:N and 2:N ?

trying with v0.5.5 and get the error message below with umi-tools extract:

umi-tools extract 
--bc-pattern=XXXXXX 
--bc-pattern2 CCCCCCNNNN 
--stdin H619_RG_C000QSP_6_1_CC9HDACXX_S0_L001_R1_001.fastq.gz 
--stdout H619_RG_C000QSP_6_1_CC9HDACXX_S0_L001_R1_001_extracted.fastq.gz 
--read2-in H619_RG_C000QSP_6_1_CC9HDACXX_S0_L001_R2_001.fastq.gz 
--read2-out H619_RG_C000QSP_6_1_CC9HDACXX_S0_L001_R2_001_extracted.fastq.gz 
--filter-cell-barcode 
--whitelist=whitelist.txt

Read pairs do not match
H3:1:CC9HDACXX:6:2209:1046:2076/1 != H3:1:CC9HDACXX:6:2209:1046:2076/2

Thanks,

@TomSmithCGAT
Copy link
Member

Hmm, yeah, there is an expectation that the read names will be exactly the same. We could easily add an option to either skip this check or strip the suffix before checking. Thoughts @IanSudbery?

@IanSudbery
Copy link
Member

We can set to ignore in extract, or set a specific allowance for /1 and /2.

@TomSmithCGAT
Copy link
Member

OK. I'll set this up on a branch for @gcorre to test out

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Mar 13, 2019

Hi @gcorre - Could you try installing umi_tool from the {TS}-IgnoreReadPairSuffix branch and use the --read-name-suffix-strip option. This will strip the suffixes from the read names before the check they are identical. Let me know if you need instructions for installation from the branch.

This branch also introduces --read1-suffix and --read2-suffix options which are defaulted to \1 and \2, respectively so you don't need to set these.

@gcorre
Copy link
Author

gcorre commented Mar 14, 2019

Hi,
I tried the lastest branch and it works for the extract part.

@h3:1:CC9HDACXX:7:1109:1204:2091/1
NACTGGGAGATGCGTGGGCTGACATCTGCAGGCCGAAAGAGCCGTGGCCTT
@h3:1:CC9HDACXX:7:1109:1204:2091/1_AGCAGG_AGGT
NACTGGGAGATGCGTGGGCTGACATCTGCAGGCCGAAAGAGCCGTGGCCTT

@h3:1:CC9HDACXX:7:1109:1204:2091/2
AGCAGGAGGTTTTTTTTTTTTTTTTTTTTTGCTGGTCTTAATTGGTTTTTA
@h3:1:CC9HDACXX:7:1109:1204:2091/2_AGCAGG_AGGT
TTTTTTTTTTTTTTTTTTTTGCTGGTCTTAATTGGTTTTTA

but (there is always one !)

when mapping with STAR, the suffix is trimmed including the UMI-CellBC on its right so the information is lost for the next steps:
H3:1:CC9HDACXX:7:1109:1163:2144 99 chrM 9505 255 5S46M = ........ H3:1:CC9HDACXX:7:1109:1163:2144 147 chrM 9630 255 41M = ..........

Would it be possible to add the UMI-cell-barcode before the /1 /2 suffix like:
@h3:1:CC9HDACXX:7:1109:1204:2091_AGCAGG_AGGT/1
@h3:1:CC9HDACXX:7:1109:1204:2091_AGCAGG_AGGT/2

best

@TomSmithCGAT
Copy link
Member

hi @gcorre. Thanks for testing this out. The STAR manual does indeed state that these suffixes are removed (copied below)

--outSAMreadID
    default: Standard
    string: read ID record type
        Standard
            first word (until space) from the FASTx read ID line, removing /1,/2 from the end
        Number
            read number (index) in the FASTx file

I'll update the branch today to add the UMI before the suffix.

@IanSudbery
Copy link
Member

Or UMI-tools could just remove the suffixes. I don't think they are needed for anything, and I'm pretty sure all aligners remove them.

@gcorre
Copy link
Author

gcorre commented Mar 14, 2019

Thanks,
Indeed it works without the suffixes with STAR but for for compatibility with other programs (cutadapt, trimmomatic, ?) maybe it is more safe too keep a standard format.

best

@TomSmithCGAT
Copy link
Member

Yeah, this would be my concern about removing the suffixes too. I don't know of any tool which actually uses the suffixes but I'm loath to remove them!

@gcorre - The latest version of the branch should insert the UMIs inbetween the read name and suffix.

@gcorre
Copy link
Author

gcorre commented Mar 15, 2019

hi, after a test on 4 single cells libraries (140 million reads each), everythink looks OK in terms of read name and processing time is not significantly longer (around 2h each).
best,

@TomSmithCGAT
Copy link
Member

Great. Let me know when you've run the alignment and deduplication and confirmed this all works OK. Then I'll merge this into the master and close the issue.

@IanSudbery
Copy link
Member

IanSudbery commented Mar 15, 2019 via email

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Mar 15, 2019

Yeah I figured you might prefer this route. Hadn't considered that we could just use space by default but this does seem like adding an unnecessary cell to strip().

Optimally, the check for stripping should occur once in the call to __init__ and not with each read. So how about two options (for the read1 and read2 suffixes), default both to None. If the suffix options are provided, switch to a read extractor function which takes the suffix argument(s). Ditto for the function to make the new read name which is also currently re-checking the options for each read. There's already a test file so we can make any changes safely when we have confirmation that the current output is OK.

@IanSudbery
Copy link
Member

Sorry, my point wasn't that we could use as the default delimiter, but that effectively we already do. As if if the read name is @read1:2343:3243 1:N:0 then the UMI is added to @read1:2343:3243

@TomSmithCGAT
Copy link
Member

Ah, right, got it.

@gcorre
Copy link
Author

gcorre commented Mar 15, 2019

Indeed, a choice of the delimiter may be the solution (space by default and user defined).

here is a log of the complete pipeline on a subset of my libraries (2.5M reads) with leading rows of outputs:
log_UMI-tools_git.txt

best
th

@TomSmithCGAT
Copy link
Member

This is now resolved in the master branch with option --ignore-read-pair-suffixes and will be available in the next release shortly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants