bug for handling /1 suffix in single-ended reads #580

anoronh4 · 2023-03-01T21:09:45Z

I have noticed inconsistent behavior in how umi_tools extract handles the /1 suffix depending on whether the inputs are paired end or single end. when the input is paired-end read id is changed from @SRR5665260.1.36873006/1 to @SRR5665260.1.36873006_CAG, basically removing the read number as it separates them into two distinct files. If I take only the read1 fastq file and run umi_tools extract, the same read becomes @SRR5665260.1.36873006/1_CAG, even though i am using --ignore-read-pair-suffixes in both cases. I am using v1.1.2.

This is causing issues downstream for me when i run alignment with STAR. STAR discards everything after / in the read ID. just wondering if this can be fixed on the umi_tools end.

The text was updated successfully, but these errors were encountered:

TomSmithCGAT · 2023-03-02T10:19:38Z

Hmmm.. That's strange and I can't reproduce with umi_tools 1.1.3. From the tests directory, if I run the following for paired end and then single end extraction.

umi_tools extract  --extract-method=string --read2-in=scrb_seq_fastq.2.gz --bc-pattern=NNNNNNNNNN -L test.log  --stdin=scrb_seq_fastq.1.gz --read2-out=delete_me.2.fastq > delete_me.1.fastq

umi_tools extract  --extract-method=string --bc-pattern=NNNNNNNNNN -L test.log  --stdin=scrb_seq_fastq.1.gz > delete_me.fastq

The headers have the expected format in both cases:

==> delete_me.1.fastq <==
@SRR1058032.1_AATAACTTCC HISEQ:653:H12WDADXX:1:1101:1210:2217 length=17

==> delete_me.2.fastq <==
@SRR1058032.1_AATAACTTCC HISEQ:653:H12WDADXX:1:1101:1210:2217 length=34

==> delete_me.fastq <==
@SRR1058032.1_AATAACTTCC HISEQ:653:H12WDADXX:1:1101:1210:2217 length=17

TomSmithCGAT · 2023-03-02T10:21:46Z

Could you please try updating to 1.1.3 to see if this resolves the issue. There was a small update to allow a user-defined umi separator in the output (#548), so it's possible the behaviour has changed, but our interated regression testing didn't pick that up if so 🤷

anoronh4 · 2023-04-10T16:01:00Z

Sorry for not seeing your reply earlier. I continue to see the issue with both 1.1.3 and 1.1.4. my command is

umi_tools extract \
    -I test_rnaseq_1.fastq.gz \
    -S SAMPLE_SINGLE_END_UMI@SRR5665260.1.36873006@@SAMPLE_SINGLE_END_UMI_T1.umi_extract.fastq.gz \
    --bc-pattern="NNN"  --ignore-read-pair-suffixes

You can grab this same input to test here: https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_rnaseq_1.fastq.gz . i think in the example you used the header might have been formatted differently than in mine -- there is no flowcell information in the header in my case.

TomSmithCGAT · 2023-04-11T07:46:10Z

Thanks for the example 😁 I get the same output.

While I look into it, you can try using the --readNameSeparator option with STAR. The manual states:

default: /
string(s): character(s) separating the part of the read names that will be
trimmed in output (read name after space is always trimmed)```
 
So setting that to a space if possible, or some other character that shouldn't exist (e.g ':') should work as an immediate solution.

TomSmithCGAT · 2023-04-11T08:17:13Z

umi_tools needs to ensure that the read names for paired reads are the same after adding the UMI or else downstream tools will throw errors. The information after the first space in field 1 of a fastq entry is not considered part of the read name as is optional (fastq format). Hence, umi_tools adds the UMIs to the end of the first part of field 1.

I think I'm right in saying there is no strict convention for denoting read 1/2 in field 1. Usually it's e.g /1 at the very end of the field with at least one space in between the read name and the read number. However, sometimes, field 1 can contain the read number at the end of the read name. This is why star removes everything after the '/' and why umi_tools allows the user to ignore the read suffix when checking that the two reads have the same name (--ignore-read-pair-suffixes; #325). This option will also remove the read pair suffixes (/1 / /2) from the read name to ensure they are identical.

In your case, you have a paired end read file with the read numbers as part of the read name. As you show above, this is handled appropriately with --ignore-read-pair-suffixes when processing both reads, but not when processing just one read. The obvious question is why do you want to use just one read? Given that you do though, it would be possible to make umi_tools remove the read pair suffix in single end mode as well.

@IanSudbery, would just need to add the equivalent of

UMI-tools/umi_tools/umi_methods.py

Lines 119 to 132 in d98ebac

    
           def getReadIDNoSuffix(read): 
        
               return(read.identifier.split(' ')[0]) 
        
           def getReadIDSuffix(read): 
        
               read_id = read.identifier.split(' ')[0] 
        
               if not read_id[-2:] in ['/1', '/2']: 
        
                   raise ValueError( 
        
                       'read suffix must be /1 or /2. Observed: %s' % read_id[-2:]) 
        
               return(read_id[:-2]) 
        
           if has_suffix: 
        
               getReadID = getReadIDSuffix 
        
           else: 
        
               getReadID = getReadIDNoSuffix

to the single end fastq parser. Wouldn't add any overhead when the option is not used and mimimal when used. What do you think?

IanSudbery · 2023-04-11T09:59:34Z

Not quite that simple. There isn't a single end fastq parser that is the equivalent of the joinedFastqParser. This is presumably where the problem is coming from. The structure here is that the joinedParser is called with two fastq iteratrors as parameters. In the single-end case there is no parser, just the raw iterator.

I don't think there should be any problem with moving the suffix code to the iterator. Except that I'll need to make sure it always gets called correctly.

TomSmithCGAT · 2023-04-11T11:40:39Z

Ah, yeah, I was scanning the code too lightly. Nice work on the PR 👍

anoronh4 · 2023-04-13T16:22:24Z

i installed the latest updates including this PR and unfortunately it errored out.

$ umi_tools extract     -I test_rnaseq_1.fastq.gz     -S extract.fastq.gz     --bc-pattern="NNN"  --ignore-read-pair-suffixes
# UMI-tools version: 1.1.4
# output generated by extract -I test_rnaseq_1.fastq.gz -S extract.fastq.gz --bc-pattern=NNN --ignore-read-pair-suffixes
# job started at Thu Apr 13 12:19:49 2023 on terra -- 786c956e-8418-4a4a-a3e9-493934d944b7
# pid: 15240, system: Linux 3.10.0-1160.45.1.el7.x86_64 #1 SMP Wed Oct 13 17:20:51 UTC 2021 x86_64
# blacklist                               : None
# compresslevel                           : 6
# correct_umi_threshold                   : 0
# either_read                             : False
# either_read_resolve                     : discard
# error_correct_cell                      : False
# extract_method                          : string
# filter_cell_barcode                     : None
# filter_cell_barcodes                    : False
# filter_umi                              : None
# filtered_out                            : None
# filtered_out2                           : None
# ignore_suffix                           : True
# log2stderr                              : False
# loglevel                                : 1
# pattern                                 : NNN
# pattern2                                : None
# prime3                                  : None
# quality_encoding                        : None
# quality_filter_mask                     : None
# quality_filter_threshold                : None
# random_seed                             : None
# read2_in                                : None
# read2_out                               : False
# read2_stdout                            : False
# reads_subset                            : None
# reconcile                               : False
# retain_umi                              : None
# short_help                              : None
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
# stdin                                   : <_io.TextIOWrapper name='test_rnaseq_1.fastq.gz' encoding='ascii'>
# stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
# stdout                                  : <_io.TextIOWrapper name='extract.fastq.gz' encoding='ascii'>
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tmpdir                                  : None
# umi_correct_log                         : None
# umi_separator                           : _
# umi_whitelist                           : None
# umi_whitelist_paired                    : None
# whitelist                               : None
2023-04-13 12:19:49,025 INFO Starting barcode extraction
Traceback (most recent call last):
  File "/home/noronhaa/anaconda3/envs/umitools_test/bin/umi_tools", line 33, in <module>
    sys.exit(load_entry_point('umi-tools==1.1.4', 'console_scripts', 'umi_tools')())
  File "/home/noronhaa/anaconda3/envs/umitools_test/lib/python3.6/site-packages/umi_tools-1.1.4-py3.6-linux-x86_64.egg/umi_tools/umi_tools.py", line 66, in main
    module.main(sys.argv)
  File "/home/noronhaa/anaconda3/envs/umitools_test/lib/python3.6/site-packages/umi_tools-1.1.4-py3.6-linux-x86_64.egg/umi_tools/extract.py", line 439, in main
    for read in read1s:
  File "/home/noronhaa/anaconda3/envs/umitools_test/lib/python3.6/site-packages/umi_tools-1.1.4-py3.6-linux-x86_64.egg/umi_tools/umi_methods.py", line 116, in fastqIterate
    line1 = removeReadIDSuffix(line1)
  File "/home/noronhaa/anaconda3/envs/umitools_test/lib/python3.6/site-packages/umi_tools-1.1.4-py3.6-linux-x86_64.egg/umi_tools/umi_methods.py", line 92, in removeReadIDSuffix
    'read suffix must be /1 or /2. Observed: %s' % read_id[-2:])
ValueError: read suffix must be /1 or /2. Observed: 1

anoronh4 · 2023-11-07T20:47:14Z

is there any update here? i've tested out the master branch by installing from a clone into my conda environment (python setup.py install) and then tried to compare the behaviors of the two. seems like there's a problem now with paired end fastqs as well. here's a breakdown of what i see:

version	type	error	correct header	command
1.1.4	PE	none	true	umi_tools extract -I test_rnaseq_1.fastq.gz --read2-in=test_rnaseq_2.fastq.gz -S test_rnaseq_1.umi_extract.fastq.gz --read2-out=test_rnaseq_2.umi_extract.fastq.gz --bc-pattern="NNNXX" --ignore-read-pair-suffixes
1.1.4	SE	none	false	umi_tools extract -I test_rnaseq_1.fastq.gz -S test_rnaseq_1.umi_extract.fastq.gz --bc-pattern="NNNXX" --ignore-read-pair-suffixes
master	PE	ValueError: read suffix must be /1 or /2. Observed: 1	N/A	umi_tools extract -I test_rnaseq_1.fastq.gz --read2-in=test_rnaseq_2.fastq.gz -S test_rnaseq_1.umi_extract.fastq.gz --read2-out=test_rnaseq_2.umi_extract.fastq.gz --bc-pattern="NNNXX" --ignore-read-pair-suffixes
master	SE	ValueError: read suffix must be /1 or /2. Observed: 1	N/A	umi_tools extract -I test_rnaseq_1.fastq.gz -S test_rnaseq_1.umi_extract.fastq.gz --bc-pattern="NNNXX" --ignore-read-pair-suffixes

so now both PE and SE are failing in the latest code.

to use the same data as me, you can grab them using the URLs:

wget https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_rnaseq_1.fastq.gz 
wget https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_rnaseq_2.fastq.gz

IanSudbery · 2023-11-08T11:06:28Z

Hi Anna,

Looks like I fixed this back in april and then never merged the fix :(. Sorry.

anoronh4 · 2023-11-10T17:25:33Z

ok great! i thought the fix was in master because #591 was merged already, and didn't realize any more updates had been planned. i tested out the unmerged branch, looks like it resolves the issue i was experiencing at least.

anoronh4 mentioned this issue Mar 1, 2023

UMI retention with single-end reads bug mskcc/forte#13

Open

IanSudbery mentioned this issue Apr 11, 2023

Read suffixes removed from single end data #591

Merged

IanSudbery mentioned this issue Nov 8, 2023

{is} fix read suffix line end #611

Merged

IanSudbery closed this as completed in #611 Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug for handling /1 suffix in single-ended reads #580

bug for handling /1 suffix in single-ended reads #580

anoronh4 commented Mar 1, 2023 •

edited

Loading

TomSmithCGAT commented Mar 2, 2023 •

edited

Loading

TomSmithCGAT commented Mar 2, 2023

anoronh4 commented Apr 10, 2023 •

edited

Loading

TomSmithCGAT commented Apr 11, 2023

TomSmithCGAT commented Apr 11, 2023 •

edited

Loading

IanSudbery commented Apr 11, 2023

TomSmithCGAT commented Apr 11, 2023

anoronh4 commented Apr 13, 2023

anoronh4 commented Nov 7, 2023 •

edited

Loading

IanSudbery commented Nov 8, 2023

anoronh4 commented Nov 10, 2023 •

edited

Loading

bug for handling /1 suffix in single-ended reads #580

bug for handling /1 suffix in single-ended reads #580

Comments

anoronh4 commented Mar 1, 2023 • edited Loading

TomSmithCGAT commented Mar 2, 2023 • edited Loading

TomSmithCGAT commented Mar 2, 2023

anoronh4 commented Apr 10, 2023 • edited Loading

TomSmithCGAT commented Apr 11, 2023

TomSmithCGAT commented Apr 11, 2023 • edited Loading

IanSudbery commented Apr 11, 2023

TomSmithCGAT commented Apr 11, 2023

anoronh4 commented Apr 13, 2023

anoronh4 commented Nov 7, 2023 • edited Loading

IanSudbery commented Nov 8, 2023

anoronh4 commented Nov 10, 2023 • edited Loading

anoronh4 commented Mar 1, 2023 •

edited

Loading

TomSmithCGAT commented Mar 2, 2023 •

edited

Loading

anoronh4 commented Apr 10, 2023 •

edited

Loading

TomSmithCGAT commented Apr 11, 2023 •

edited

Loading

anoronh4 commented Nov 7, 2023 •

edited

Loading

anoronh4 commented Nov 10, 2023 •

edited

Loading