Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FreeBayes requires unique lane IDs #311

Closed
szymonwieloch opened this issue Nov 30, 2020 · 7 comments
Closed

FreeBayes requires unique lane IDs #311

szymonwieloch opened this issue Nov 30, 2020 · 7 comments
Labels
help wanted Extra attention is needed input validation
Milestone

Comments

@szymonwieloch
Copy link

While running FreeBayes tool I get the following error message:

ERROR(freebayes): multiple samples (SM) map to the same read group (RG)

  samples e8f63b36-a0b9-406e-905e-69b7621f33ba_sample and e8f63b36-a0b9-406e-905e-69b7621f33ba_control map to 1

  As freebayes operates on a virtually merged stream of its input files,
  it will not be possible to determine what sample an alignment belongs to
  at runtime.

  To resolve the issue, ensure that RG ids are unique to one sample
  across all the input files to freebayes.

  See bamaddrg (https://github.com/ekg/bamaddrg) for a method which can
  add RG tags to alignments.

I've done some experimentation and consultation and the reason for this problem seems to be my input file configuration:

e8f63b36-a0b9-406e-905e-69b7621f33ba    XX      1       e8f63b36-a0b9-406e-905e-69b7621f33ba_sample     1       e8f63b36-a0b9-406e-905e-69b7621f33ba_R1.fastq.gz        e8f63b36-a0b9-406e-905e-69b7621f33ba_R2.fastq.gz
e8f63b36-a0b9-406e-905e-69b7621f33ba    XX      0       e8f63b36-a0b9-406e-905e-69b7621f33ba_control    1       d7caf077-cf7c-4272-a74e-3d31ab61852c_R1.fastq.gz        d7caf077-cf7c-4272-a74e-3d31ab61852c_R2.fastq.gz
a35760da-3f13-40a3-8537-e7e841baa6a1    XX      1       a35760da-3f13-40a3-8537-e7e841baa6a1_sample     1       a35760da-3f13-40a3-8537-e7e841baa6a1_R1.fastq.gz        a35760da-3f13-40a3-8537-e7e841baa6a1_R2.fastq.gz
a35760da-3f13-40a3-8537-e7e841baa6a1    XX      0       a35760da-3f13-40a3-8537-e7e841baa6a1_control    1       a80d728b-8dc0-428d-acdb-07d03bfe19a3_R1.fastq.gz        a80d728b-8dc0-428d-acdb-07d03bfe19a3_R2.fastq.gz

Strictly speaking, the problem is related to the lane column (5th). After replacing 1, 1, 1, 1 with 1, 2, 3, 4 - the pipeline works fine.

@apeltzer suggested that this may be a feature, not a bug:

I suspect that Sarek merges the two resulting BAM files (one for the tumor, one for normal) into a single BAM file and thus needs unique read groups to do so.

Still, we would prefer to consult that because the documentation is not clear about it (at least I haven't found anything that would explain this behavior). A fix in the docs would be really nice too. At the moment my observation is that the FreeBayes tool requires a unique lane column value for each subject. I was told that @maxulysse may be the right person to ask about it.

Big thanks in advance!

@maxulysse maxulysse added the help wanted Extra attention is needed label Jan 26, 2021
@marchoeppner
Copy link

marchoeppner commented Jul 7, 2021

Just to add to this, since I stumbled across the exact same problem.

Basically, Sarek currently uses the lane as the read group ID. That does not seem to be a good solution (as evidenced by the error raised during Freebayes).

A lane is commonly understood to be a number, between 1 and 4 (at most). The read group id, by definition, must be a unique identifier of that particular collection of reads. The most unambigious solution would be to use flowcell ID + library ID + lane. Technically, that information can be derived from the fastQ input files. A somewhat crummy work-around may be to use the sampleID + lane, which are already present in the TSV format. But that is technically not a truly unique ID. But it would suppress the error and, within a given Sarek run, probably not cause any problems.

@marchoeppner
Copy link

Maybe to further clarify, the first line of a fastQ file looks like this, usually:

@A00686:168:H3HGMDSX2:3:1101:2501:1000 1:N:0:GTCTAATGGC+CCTGACCACT

So that could be translated into the following read group id:
H3HGMDSX2.3.GTCTAATGGC+CCTGACCACT

Or, when using a (made-up) library ID (usually part of the fastq file name) instead of the barcode:

H3HGMDSX2.3.J0367871

@cjfields
Copy link

cjfields commented Apr 19, 2022

Just to note, I also see this with Sarek-generated BAMs when running outside of Sarek (v2.7.1), using Sentieon TNscope:

D1 XY 0 D1 L00M D1_L00M_R1_001.fastq.gz D1_L00M_R2_001.fastq.gz
D1 XY 0 D1 L00M D5_L00M_R1_001.fastq.gz D5_L00M_R2_001.fastq.gz
sentieon driver -t 12 -r Homo_sapiens_assembly38.fasta -i D1.deduped.bam -q D1.recal.table -i D5.deduped.bam -q D5.recal.table --algo TNscope --tumor_sample D1 --normal_sample D5 --dbsnp dbsnp_146.hg38.vcf.gz TNscope_D1_vs_D5.vcf
....
Readgroup L00M with different attributes is present in multiple bam files: D1.deduped.bam, D5.deduped.bam 
Error: Invalid input BAM files

Relevant read groups in BAM headers:

$ samtools view -H D1.deduped.bam | grep '^@RG'
@RG	ID:L00M	PU:L00M	SM:D1	LB:D1	PL:illumina
$ samtools view -H D5.deduped.bam | grep '^@RG'
@RG	ID:L00M	PU:L00M	SM:D5	LB:D5	PL:illumina

NOTE: there is a work-around for Sentieon, so not horribly pressing, but I'm sure this might bite others a bit more

@FriederikeHanssen
Copy link
Contributor

So flowcell ID before was retrieved here but only when providing the fastq files as wildcard/no tsv input. @maxulysse do you recall why you didn't run this piece of code on all fastq files?

sarek/main.nf

Line 4246 in 68b9930

def flowcellLaneFromFastq(path) {

And the a random number was added for good measure, so in this case this problem should never occur:

sarek/main.nf

Line 4150 in 68b9930

rgId = "${flowcell}.${sampleId}.${lane}.${random}"

@maxulysse
Copy link
Member

Yes, this was only done when we had no ideas on how many fastq pairs we had (so folder input)

@FriederikeHanssen
Copy link
Contributor

Any reason not to always retrieve this info? (Although sampleID-laneID for read group ID works as expected)

@FriederikeHanssen
Copy link
Contributor

Fixed in #549

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed input validation
Projects
None yet
Development

No branches or pull requests

5 participants