/1 and /2 suffixes in paired-end reads #325

gcorre · 2019-03-13T16:06:52Z

Hi,
Can umi-tools analyze paired-end reads with /1 and /2 suffixes instead of 1:N and 2:N ?

trying with v0.5.5 and get the error message below with umi-tools extract:

umi-tools extract 
--bc-pattern=XXXXXX 
--bc-pattern2 CCCCCCNNNN 
--stdin H619_RG_C000QSP_6_1_CC9HDACXX_S0_L001_R1_001.fastq.gz 
--stdout H619_RG_C000QSP_6_1_CC9HDACXX_S0_L001_R1_001_extracted.fastq.gz 
--read2-in H619_RG_C000QSP_6_1_CC9HDACXX_S0_L001_R2_001.fastq.gz 
--read2-out H619_RG_C000QSP_6_1_CC9HDACXX_S0_L001_R2_001_extracted.fastq.gz 
--filter-cell-barcode 
--whitelist=whitelist.txt

Read pairs do not match
H3:1:CC9HDACXX:6:2209:1046:2076/1 != H3:1:CC9HDACXX:6:2209:1046:2076/2

Thanks,

The text was updated successfully, but these errors were encountered:

TomSmithCGAT · 2019-03-13T16:12:18Z

Hmm, yeah, there is an expectation that the read names will be exactly the same. We could easily add an option to either skip this check or strip the suffix before checking. Thoughts @IanSudbery?

IanSudbery · 2019-03-13T17:33:48Z

We can set to ignore in extract, or set a specific allowance for /1 and /2.

TomSmithCGAT · 2019-03-13T17:36:41Z

OK. I'll set this up on a branch for @gcorre to test out

TomSmithCGAT · 2019-03-13T18:35:40Z

Hi @gcorre - Could you try installing umi_tool from the {TS}-IgnoreReadPairSuffix branch and use the --read-name-suffix-strip option. This will strip the suffixes from the read names before the check they are identical. Let me know if you need instructions for installation from the branch.

This branch also introduces --read1-suffix and --read2-suffix options which are defaulted to \1 and \2, respectively so you don't need to set these.

gcorre · 2019-03-14T08:44:43Z

Hi,
I tried the lastest branch and it works for the extract part.

@h3:1:CC9HDACXX:7:1109:1204:2091/1
NACTGGGAGATGCGTGGGCTGACATCTGCAGGCCGAAAGAGCCGTGGCCTT
@h3:1:CC9HDACXX:7:1109:1204:2091/1_AGCAGG_AGGT
NACTGGGAGATGCGTGGGCTGACATCTGCAGGCCGAAAGAGCCGTGGCCTT

@h3:1:CC9HDACXX:7:1109:1204:2091/2
AGCAGGAGGTTTTTTTTTTTTTTTTTTTTTGCTGGTCTTAATTGGTTTTTA
@h3:1:CC9HDACXX:7:1109:1204:2091/2_AGCAGG_AGGT
TTTTTTTTTTTTTTTTTTTTGCTGGTCTTAATTGGTTTTTA

but (there is always one !)

when mapping with STAR, the suffix is trimmed including the UMI-CellBC on its right so the information is lost for the next steps:
H3:1:CC9HDACXX:7:1109:1163:2144 99 chrM 9505 255 5S46M = ........ H3:1:CC9HDACXX:7:1109:1163:2144 147 chrM 9630 255 41M = ..........

Would it be possible to add the UMI-cell-barcode before the /1 /2 suffix like:
@h3:1:CC9HDACXX:7:1109:1204:2091_AGCAGG_AGGT/1
@h3:1:CC9HDACXX:7:1109:1204:2091_AGCAGG_AGGT/2

best

TomSmithCGAT · 2019-03-14T10:09:06Z

hi @gcorre. Thanks for testing this out. The STAR manual does indeed state that these suffixes are removed (copied below)

--outSAMreadID
    default: Standard
    string: read ID record type
        Standard
            first word (until space) from the FASTx read ID line, removing /1,/2 from the end
        Number
            read number (index) in the FASTx file

I'll update the branch today to add the UMI before the suffix.

IanSudbery · 2019-03-14T10:57:31Z

Or UMI-tools could just remove the suffixes. I don't think they are needed for anything, and I'm pretty sure all aligners remove them.

gcorre · 2019-03-14T13:32:06Z

Thanks,
Indeed it works without the suffixes with STAR but for for compatibility with other programs (cutadapt, trimmomatic, ?) maybe it is more safe too keep a standard format.

best

TomSmithCGAT · 2019-03-14T16:25:33Z

Yeah, this would be my concern about removing the suffixes too. I don't know of any tool which actually uses the suffixes but I'm loath to remove them!

@gcorre - The latest version of the branch should insert the UMIs inbetween the read name and suffix.

gcorre · 2019-03-15T14:16:27Z

hi, after a test on 4 single cells libraries (140 million reads each), everythink looks OK in terms of read name and processing time is not significantly longer (around 2h each).
best,

TomSmithCGAT · 2019-03-15T16:16:13Z

Great. Let me know when you've run the alignment and deduplication and confirmed this all works OK. Then I'll merge this into the master and close the issue.

IanSudbery · 2019-03-15T16:36:14Z

I wonder if instead of having two options we should just have one, that takes the delimiter and only uses it if one is provided. In fact I reckon that we already have a default, which is space.

…

On Fri, 15 Mar 2019, 4:16 pm Tom Smith, ***@***.***> wrote: Great. Let me know when you've run the alignment and deduplication and confirmed this all works OK. Then I'll merge this into the master and close the issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#325 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFJFjo81ivtg5ebDxY-BGKhS2q1nLZ36ks5vW8dOgaJpZM4btnh9> .

TomSmithCGAT · 2019-03-15T16:47:33Z

Yeah I figured you might prefer this route. Hadn't considered that we could just use space by default but this does seem like adding an unnecessary cell to strip().

Optimally, the check for stripping should occur once in the call to __init__ and not with each read. So how about two options (for the read1 and read2 suffixes), default both to None. If the suffix options are provided, switch to a read extractor function which takes the suffix argument(s). Ditto for the function to make the new read name which is also currently re-checking the options for each read. There's already a test file so we can make any changes safely when we have confirmation that the current output is OK.

IanSudbery · 2019-03-15T17:02:16Z

Sorry, my point wasn't that we could use as the default delimiter, but that effectively we already do. As if if the read name is @read1:2343:3243 1:N:0 then the UMI is added to @read1:2343:3243

TomSmithCGAT · 2019-03-15T17:03:25Z

Ah, right, got it.

gcorre · 2019-03-15T17:36:20Z

Indeed, a choice of the delimiter may be the solution (space by default and user defined).

here is a log of the complete pipeline on a subset of my libraries (2.5M reads) with leading rows of outputs:
log_UMI-tools_git.txt

best
th

TomSmithCGAT · 2020-09-14T20:07:44Z

This is now resolved in the master branch with option --ignore-read-pair-suffixes and will be available in the next release shortly

TomSmithCGAT mentioned this issue Mar 13, 2019

adds option to strip suffix from read names before identity test #326

Closed

TomSmithCGAT mentioned this issue Mar 2, 2020

read names and barcode positions #391

Closed

PierreBSC mentioned this issue Jul 3, 2020

Problems preprocessing COVID-19 Sample from Paper PierreBSC/Viral-Track#9

Closed

yeredh mentioned this issue Jul 3, 2020

Error preprocessing COVID-19 sample from SRA #418

Closed

TomSmithCGAT mentioned this issue Jul 3, 2020

{ts} ignore read pair suffixes #421

Merged

TomSmithCGAT closed this as completed Sep 14, 2020

TomSmithCGAT mentioned this issue Apr 21, 2022

Custom whitelist #525

Closed

TomSmithCGAT mentioned this issue Apr 11, 2023

bug for handling /1 suffix in single-ended reads #580

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/1 and /2 suffixes in paired-end reads #325

/1 and /2 suffixes in paired-end reads #325

gcorre commented Mar 13, 2019

TomSmithCGAT commented Mar 13, 2019

IanSudbery commented Mar 13, 2019

TomSmithCGAT commented Mar 13, 2019

TomSmithCGAT commented Mar 13, 2019 •

edited

Loading

gcorre commented Mar 14, 2019 •

edited

Loading

TomSmithCGAT commented Mar 14, 2019

IanSudbery commented Mar 14, 2019

gcorre commented Mar 14, 2019

TomSmithCGAT commented Mar 14, 2019

gcorre commented Mar 15, 2019

TomSmithCGAT commented Mar 15, 2019

IanSudbery commented Mar 15, 2019 via email

TomSmithCGAT commented Mar 15, 2019 •

edited

Loading

IanSudbery commented Mar 15, 2019

TomSmithCGAT commented Mar 15, 2019

gcorre commented Mar 15, 2019

TomSmithCGAT commented Sep 14, 2020

/1 and /2 suffixes in paired-end reads #325

/1 and /2 suffixes in paired-end reads #325

Comments

gcorre commented Mar 13, 2019

TomSmithCGAT commented Mar 13, 2019

IanSudbery commented Mar 13, 2019

TomSmithCGAT commented Mar 13, 2019

TomSmithCGAT commented Mar 13, 2019 • edited Loading

gcorre commented Mar 14, 2019 • edited Loading

TomSmithCGAT commented Mar 14, 2019

IanSudbery commented Mar 14, 2019

gcorre commented Mar 14, 2019

TomSmithCGAT commented Mar 14, 2019

gcorre commented Mar 15, 2019

TomSmithCGAT commented Mar 15, 2019

IanSudbery commented Mar 15, 2019 via email

TomSmithCGAT commented Mar 15, 2019 • edited Loading

IanSudbery commented Mar 15, 2019

TomSmithCGAT commented Mar 15, 2019

gcorre commented Mar 15, 2019

TomSmithCGAT commented Sep 14, 2020

TomSmithCGAT commented Mar 13, 2019 •

edited

Loading

gcorre commented Mar 14, 2019 •

edited

Loading

TomSmithCGAT commented Mar 15, 2019 •

edited

Loading