Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug for handling /1 suffix in single-ended reads #580

Closed
anoronh4 opened this issue Mar 1, 2023 · 11 comments · Fixed by #611
Closed

bug for handling /1 suffix in single-ended reads #580

anoronh4 opened this issue Mar 1, 2023 · 11 comments · Fixed by #611

Comments

@anoronh4
Copy link

anoronh4 commented Mar 1, 2023

I have noticed inconsistent behavior in how umi_tools extract handles the /1 suffix depending on whether the inputs are paired end or single end. when the input is paired-end read id is changed from @SRR5665260.1.36873006/1 to @SRR5665260.1.36873006_CAG, basically removing the read number as it separates them into two distinct files. If I take only the read1 fastq file and run umi_tools extract, the same read becomes @SRR5665260.1.36873006/1_CAG, even though i am using --ignore-read-pair-suffixes in both cases. I am using v1.1.2.

This is causing issues downstream for me when i run alignment with STAR. STAR discards everything after / in the read ID. just wondering if this can be fixed on the umi_tools end.

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Mar 2, 2023

Hmmm.. That's strange and I can't reproduce with umi_tools 1.1.3. From the tests directory, if I run the following for paired end and then single end extraction.

umi_tools extract  --extract-method=string --read2-in=scrb_seq_fastq.2.gz --bc-pattern=NNNNNNNNNN -L test.log  --stdin=scrb_seq_fastq.1.gz --read2-out=delete_me.2.fastq > delete_me.1.fastq

umi_tools extract  --extract-method=string --bc-pattern=NNNNNNNNNN -L test.log  --stdin=scrb_seq_fastq.1.gz > delete_me.fastq

The headers have the expected format in both cases:

==> delete_me.1.fastq <==
@SRR1058032.1_AATAACTTCC HISEQ:653:H12WDADXX:1:1101:1210:2217 length=17

==> delete_me.2.fastq <==
@SRR1058032.1_AATAACTTCC HISEQ:653:H12WDADXX:1:1101:1210:2217 length=34

==> delete_me.fastq <==
@SRR1058032.1_AATAACTTCC HISEQ:653:H12WDADXX:1:1101:1210:2217 length=17

@TomSmithCGAT
Copy link
Member

Could you please try updating to 1.1.3 to see if this resolves the issue. There was a small update to allow a user-defined umi separator in the output (#548), so it's possible the behaviour has changed, but our interated regression testing didn't pick that up if so 🤷

@anoronh4
Copy link
Author

anoronh4 commented Apr 10, 2023

Sorry for not seeing your reply earlier. I continue to see the issue with both 1.1.3 and 1.1.4. my command is

umi_tools extract \
    -I test_rnaseq_1.fastq.gz \
    -S SAMPLE_SINGLE_END_UMI@SRR5665260.1.36873006@@SAMPLE_SINGLE_END_UMI_T1.umi_extract.fastq.gz \
    --bc-pattern="NNN"  --ignore-read-pair-suffixes

You can grab this same input to test here: https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_rnaseq_1.fastq.gz . i think in the example you used the header might have been formatted differently than in mine -- there is no flowcell information in the header in my case.

@TomSmithCGAT
Copy link
Member

Thanks for the example 😁 I get the same output.

While I look into it, you can try using the --readNameSeparator option with STAR. The manual states:

default: /
string(s): character(s) separating the part of the read names that will be
trimmed in output (read name after space is always trimmed)```
 
So setting that to a space if possible, or some other character that shouldn't exist (e.g ':') should work as an immediate solution.

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Apr 11, 2023

umi_tools needs to ensure that the read names for paired reads are the same after adding the UMI or else downstream tools will throw errors. The information after the first space in field 1 of a fastq entry is not considered part of the read name as is optional (fastq format). Hence, umi_tools adds the UMIs to the end of the first part of field 1.

I think I'm right in saying there is no strict convention for denoting read 1/2 in field 1. Usually it's e.g /1 at the very end of the field with at least one space in between the read name and the read number. However, sometimes, field 1 can contain the read number at the end of the read name. This is why star removes everything after the '/' and why umi_tools allows the user to ignore the read suffix when checking that the two reads have the same name (--ignore-read-pair-suffixes; #325). This option will also remove the read pair suffixes (/1 / /2) from the read name to ensure they are identical.

In your case, you have a paired end read file with the read numbers as part of the read name. As you show above, this is handled appropriately with --ignore-read-pair-suffixes when processing both reads, but not when processing just one read. The obvious question is why do you want to use just one read? Given that you do though, it would be possible to make umi_tools remove the read pair suffix in single end mode as well.

@IanSudbery, would just need to add the equivalent of

def getReadIDNoSuffix(read):
return(read.identifier.split(' ')[0])
def getReadIDSuffix(read):
read_id = read.identifier.split(' ')[0]
if not read_id[-2:] in ['/1', '/2']:
raise ValueError(
'read suffix must be /1 or /2. Observed: %s' % read_id[-2:])
return(read_id[:-2])
if has_suffix:
getReadID = getReadIDSuffix
else:
getReadID = getReadIDNoSuffix
to the single end fastq parser. Wouldn't add any overhead when the option is not used and mimimal when used. What do you think?

@IanSudbery
Copy link
Member

Not quite that simple. There isn't a single end fastq parser that is the equivalent of the joinedFastqParser. This is presumably where the problem is coming from. The structure here is that the joinedParser is called with two fastq iteratrors as parameters. In the single-end case there is no parser, just the raw iterator.

I don't think there should be any problem with moving the suffix code to the iterator. Except that I'll need to make sure it always gets called correctly.

@TomSmithCGAT
Copy link
Member

Ah, yeah, I was scanning the code too lightly. Nice work on the PR 👍

@anoronh4
Copy link
Author

i installed the latest updates including this PR and unfortunately it errored out.

$ umi_tools extract     -I test_rnaseq_1.fastq.gz     -S extract.fastq.gz     --bc-pattern="NNN"  --ignore-read-pair-suffixes
# UMI-tools version: 1.1.4
# output generated by extract -I test_rnaseq_1.fastq.gz -S extract.fastq.gz --bc-pattern=NNN --ignore-read-pair-suffixes
# job started at Thu Apr 13 12:19:49 2023 on terra -- 786c956e-8418-4a4a-a3e9-493934d944b7
# pid: 15240, system: Linux 3.10.0-1160.45.1.el7.x86_64 #1 SMP Wed Oct 13 17:20:51 UTC 2021 x86_64
# blacklist                               : None
# compresslevel                           : 6
# correct_umi_threshold                   : 0
# either_read                             : False
# either_read_resolve                     : discard
# error_correct_cell                      : False
# extract_method                          : string
# filter_cell_barcode                     : None
# filter_cell_barcodes                    : False
# filter_umi                              : None
# filtered_out                            : None
# filtered_out2                           : None
# ignore_suffix                           : True
# log2stderr                              : False
# loglevel                                : 1
# pattern                                 : NNN
# pattern2                                : None
# prime3                                  : None
# quality_encoding                        : None
# quality_filter_mask                     : None
# quality_filter_threshold                : None
# random_seed                             : None
# read2_in                                : None
# read2_out                               : False
# read2_stdout                            : False
# reads_subset                            : None
# reconcile                               : False
# retain_umi                              : None
# short_help                              : None
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
# stdin                                   : <_io.TextIOWrapper name='test_rnaseq_1.fastq.gz' encoding='ascii'>
# stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
# stdout                                  : <_io.TextIOWrapper name='extract.fastq.gz' encoding='ascii'>
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tmpdir                                  : None
# umi_correct_log                         : None
# umi_separator                           : _
# umi_whitelist                           : None
# umi_whitelist_paired                    : None
# whitelist                               : None
2023-04-13 12:19:49,025 INFO Starting barcode extraction
Traceback (most recent call last):
  File "/home/noronhaa/anaconda3/envs/umitools_test/bin/umi_tools", line 33, in <module>
    sys.exit(load_entry_point('umi-tools==1.1.4', 'console_scripts', 'umi_tools')())
  File "/home/noronhaa/anaconda3/envs/umitools_test/lib/python3.6/site-packages/umi_tools-1.1.4-py3.6-linux-x86_64.egg/umi_tools/umi_tools.py", line 66, in main
    module.main(sys.argv)
  File "/home/noronhaa/anaconda3/envs/umitools_test/lib/python3.6/site-packages/umi_tools-1.1.4-py3.6-linux-x86_64.egg/umi_tools/extract.py", line 439, in main
    for read in read1s:
  File "/home/noronhaa/anaconda3/envs/umitools_test/lib/python3.6/site-packages/umi_tools-1.1.4-py3.6-linux-x86_64.egg/umi_tools/umi_methods.py", line 116, in fastqIterate
    line1 = removeReadIDSuffix(line1)
  File "/home/noronhaa/anaconda3/envs/umitools_test/lib/python3.6/site-packages/umi_tools-1.1.4-py3.6-linux-x86_64.egg/umi_tools/umi_methods.py", line 92, in removeReadIDSuffix
    'read suffix must be /1 or /2. Observed: %s' % read_id[-2:])
ValueError: read suffix must be /1 or /2. Observed: 1

@anoronh4
Copy link
Author

anoronh4 commented Nov 7, 2023

is there any update here? i've tested out the master branch by installing from a clone into my conda environment (python setup.py install) and then tried to compare the behaviors of the two. seems like there's a problem now with paired end fastqs as well. here's a breakdown of what i see:

version type error correct header command
1.1.4 PE none true umi_tools extract -I test_rnaseq_1.fastq.gz --read2-in=test_rnaseq_2.fastq.gz -S test_rnaseq_1.umi_extract.fastq.gz --read2-out=test_rnaseq_2.umi_extract.fastq.gz --bc-pattern="NNNXX" --ignore-read-pair-suffixes
1.1.4 SE none false umi_tools extract -I test_rnaseq_1.fastq.gz -S test_rnaseq_1.umi_extract.fastq.gz --bc-pattern="NNNXX" --ignore-read-pair-suffixes
master PE ValueError: read suffix must be /1 or /2. Observed: 1 N/A umi_tools extract -I test_rnaseq_1.fastq.gz --read2-in=test_rnaseq_2.fastq.gz -S test_rnaseq_1.umi_extract.fastq.gz --read2-out=test_rnaseq_2.umi_extract.fastq.gz --bc-pattern="NNNXX" --ignore-read-pair-suffixes
master SE ValueError: read suffix must be /1 or /2. Observed: 1 N/A umi_tools extract -I test_rnaseq_1.fastq.gz -S test_rnaseq_1.umi_extract.fastq.gz --bc-pattern="NNNXX" --ignore-read-pair-suffixes

so now both PE and SE are failing in the latest code.

to use the same data as me, you can grab them using the URLs:

wget https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_rnaseq_1.fastq.gz 
wget https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_rnaseq_2.fastq.gz 

@IanSudbery
Copy link
Member

Hi Anna,

Looks like I fixed this back in april and then never merged the fix :(. Sorry.

@anoronh4
Copy link
Author

anoronh4 commented Nov 10, 2023

ok great! i thought the fix was in master because #591 was merged already, and didn't realize any more updates had been planned. i tested out the unmerged branch, looks like it resolves the issue i was experiencing at least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants