Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UMI FASTQ file #703

Open
wants to merge 2 commits into
base: modules
Choose a base branch
from
Open

UMI FASTQ file #703

wants to merge 2 commits into from

Conversation

adamrtalbot
Copy link

@adamrtalbot adamrtalbot commented Nov 29, 2022

UMI FASTQ file composed of random 9bp synthetic oligos, all with uniform quality.

Generated by stripping the UMI sequence from the existing FASTQ and turning it into a separate file. This will be a valid reference format for sequencing kits where the UMI is embedded in the index.

UMI FASTQ file composed of random 9bp synthetic oligos, all with uniform quality.

Created synthetically to match existing UMI fastq file(s)
@adamrtalbot adamrtalbot requested a review from lescai November 29, 2022 18:42
Copy link

@lescai lescai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi :)
could you please describe a little more this file?
if this is the use case where UMIs are present in a third FASTQ, then the test dataset should include 3 files: forward and reverse (without UMIs in the sequence), and a UMIs file.

@lescai
Copy link

lescai commented Nov 30, 2022

also, UMIs structure is needed in order to process the sequences

@adamrtalbot
Copy link
Author

adamrtalbot commented Nov 30, 2022

Yep no problem.

The entire read is the UMI sequence, it matches the existing FASTQs that are in the repository. Here is the existing FASTQ files:

# test-datasets/data/genomics/homo_sapiens/illumina/fastq$ zcat test.umi_1.fastq.gz | head -8
@922332/1
ATTTCAGAGAGAGGATCTCGTGTAGAAATTGCTTTGAGCTGTTCTTTGTCATTTTCCCTTAATTCATTGTCTCTAGCTAGTCTGTTACTCTGTAAAATAAAATAATAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTTAAGGTCAGTG
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEEEEEEAEAEE6AAEEEEE/EAAAA<AEEEEAAEEAAAA<EEE/
@928177/1
ACATAAACAAAAGTATATAAGTAATACATATTTATAAATCTATTAAGAAAGCAAGTAATATGTACCTTAAGAATTTAATGGGAAAATAATTAGACTTACTTTAAATGCCAAAAGAAAAAGTGCCCAATCCTTTGATTAGTCAATGCTTTCT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEAEE<EE<EEEEEEEEEEEEEEEEEEEEEEEEEAEEE<EEEEEEEAEAAEEEE<EAEAAAAE<<AAEEAEEAEEE
# test-datasets/data/genomics/homo_sapiens/illumina/fastq$ zcat test.umi_2.fastq.gz | head -8
@922332/2
TATTATTTTATTTTACAGAGTAACAGACTAGCTAGAGACAATGAATTAAGGGAAAATGACAAAGAACAGCTCAAAGCAATTTCTACACGAGATCCTCTCTCTGAAATAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGAACCGCGAT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEA<AEEE<<<<AAEEEEEEEEEEA<EEEAEAE//A<AAE<6
@928177/2
TGAGATTTTTACTGAAGAAAGCATTGACTAATCAAAGGATTGGGCACTTTTTCTTTTGGCATTTAAAGTAAGTCTAATTATTTTCCCATTAAATTCTTAAGGTACATATTACTTGCTTTCTTAATAGATTTATAAATATGTATTACTTATA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEEEAEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEAEEEE/EEEE<EEEEEEEEEEEEE<AAEEEAEAEEEEAEE<AAAEA/AEEEEAEAEEEEEAEEAE/

and here is the new one:

# test-datasets/data/genomics/homo_sapiens/illumina/fastq$ zcat test.umi_umi.fastq.gz | head -8
@922332
TGACCATTT
+
FFFFFFFFF
@928177
TTTGAACAG
+
FFFFFFFFF

As you can see, the UMI FASTQ file matches the existing FASTQ files, saving us some storage. I generated the FASTQs by:

  • Aligning to the human genome
  • Grouping reads by position
  • Randomly assigning them to a UMI family (poisson distribution, lambda 2)
  • Creating a FASTQ file based on those families.

I'll upload the script later today and update here. I've checked the method and it seems to work fine in our pipeline.

The bases mask is +T +T +M where input is test.umi_1.fastq.gz test.umi_2.fastq.gz test.umi_umi.fastq.gz. You could be more explicit with 150T 150T 9M, or use the bases mask to cut out the existing UMIs from those files.

@adamrtalbot
Copy link
Author

I've just checked your development branch, and I think the syntax would be:
ext.args = "--read-structures +T 23S+T +M"

This means it will have the same UMI sequences.
@adamrtalbot
Copy link
Author

Slight change - I've extracted those first 12bp and put them in that FASTQ file. This now should have exactly the same UMI sequences as the existing FASTQ and should create almost identical consensus reads.

test-datasets/data/genomics/homo_sapiens/illumina/fastq$ zcat test.umi_umi.fastq.gz | head -8
@922332
TATTATTTTATT
+
AAAAAEEEEEEE
@928177
TGAGATTTTTAC
+
AAAAAEEEEEEE

@lescai I've checked your subworkflow in development and it already works with three FASTQ files nicely! We just have to add an additional test.

@adamrtalbot
Copy link
Author

@lescai did you have a chance to check this?

@adamrtalbot adamrtalbot mentioned this pull request May 14, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants