Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for non-standard chromosome names containing [:-] characters #1630

Merged
merged 4 commits into from
Jun 27, 2023

Conversation

pd3
Copy link
Member

@pd3 pd3 commented Jun 5, 2023

Note hts_parse_region() cannot be used because it requires the header and without the header the caller does not learn the contig name.

Resolves #1620

Note hts_parse_region() cannot be used because it requires the header
and without the header the caller does not learn the contig name.

Resolves samtools#1620
@jkbonfield
Copy link
Contributor

Apparently I had misremembered the public hts_parse_region API as having the header as an optional field. Pity :/

I'm inclined to think that if we have a requirement to parse regions without a header being present, then perhaps this API should be modified to cope with it. My preference would be to avoid code duplication where possible, but I haven't really looked in detail to know how much work is involved and whether we just replace duplication in one place with it in another (eg a separate code-path for headerless parsing).

@jkbonfield
Copy link
Contributor

jkbonfield commented Jun 6, 2023

Can you give me an example of how to trigger this? I'm unable to do so, so cannot test the fix.

I tried modifying test/test-bcf-sr.pl as follows:

ubuntu@jkb-dev-pmu:~/htslib$ git diff
diff --git a/htscodecs b/htscodecs
--- a/htscodecs
+++ b/htscodecs
@@ -1 +1 @@
-Subproject commit d4aed585929e2dab9dd8e6a2b74484dfc347c0f2
+Subproject commit d4aed585929e2dab9dd8e6a2b74484dfc347c0f2-dirty
diff --git a/test/test-bcf-sr.pl b/test/test-bcf-sr.pl
index cd9859c..f65d120 100755
--- a/test/test-bcf-sr.pl
+++ b/test/test-bcf-sr.pl
@@ -119,7 +119,7 @@ sub save_vcf
     open(my $fh,"| $FindBin::Bin/../bgzip -c > $fname") or error("$FindBin::Bin/../bgzip -c > $fname: !");
     print $fh qq[##fileformat=VCFv4.3\n];
     print $fh qq[##FILTER=<ID=PASS,Description="All filters passed">\n];
-    print $fh qq[##contig=<ID=1>\n];
+    print $fh qq[##contig=<ID=1:2>\n];
     print $fh qq[##contig=<ID=2>\n];
     print $fh '#'. join("\t", qw(CHROM POS ID  REF ALT QUAL    FILTER  INFO))."\n";
     for my $var (@$vars)
@@ -133,7 +133,7 @@ sub save_vcf
             $ref = $xref;
             push @alts,$alt;
         }
-        print $fh join("\t", (1,100,'.',$ref,join(',',@alts),'.','.','.'))."\n";
+        print $fh join("\t", ("1:2",100,'.',$ref,join(',',@alts),'.','.','.'))."\n";
     }
     for my $var (@$vars)
     {
@@ -146,7 +146,7 @@ sub save_vcf
             $ref = $xref;
             push @alts,$alt;
         }
-        print $fh join("\t", (1,300,'.',$ref,join(',',@alts),'.','.','.'))."\n";
+        print $fh join("\t", ("1:2",300,'.',$ref,join(',',@alts),'.','.','.'))."\n";
     }
     for my $var (@$vars)
     {

So my files look like this:

ubuntu@jkb-dev-pmu:~/htslib$ zcat /tmp/a/1.vcf.gz
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1:2>
##contig=<ID=2>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1:2	100	.	A	T	.	.	.
1:2	100	.	A	G,T	.	.	.
1:2	300	.	A	T	.	.	.
1:2	300	.	A	G,T	.	.	.
2	100	.	A	T	.	.	.
2	100	.	A	G,T	.	.	.

Running ./test/test-bcf-sr.pl -t /tmp/a -s 0 works still. Although I see lots of errors in the output files, but it returns 0 exit status. The original also produced the same errors and also succeeded. I don't understand what's happening here in the tests. Are the failures "expected failures" and hence passes? I'm assuming so, as most checks don't redirect stderr, so the ones that do are presumably expecting an error and deliberately silencing it.

[Edit: yes, I see it now in "Failed to detect $badness: $cmd". It's specifically checking on duff input to make sure it doesn't pass. That's good. :-) ]

Is it simply that test-bcf-sr.c doesn't test the region query part of the API, so colons in names never crops up as an issue? I don't particularly like having the test for this function in bcftools, as it means we can't do CI for this code.

@pd3
Copy link
Member Author

pd3 commented Jun 15, 2023

I just checked the code and can confirm that test-bcf-sr does not test indexes at all. That's done properly only through the bcftools test suite, and there are hundreds of them. If you want to test it natively in htslib, best might be to write a simplified version of vcfview.c. Tests for this pull requests are https://github.com/samtools/bcftools/pull/1938/files.

pd3 and others added 3 commits June 15, 2023 10:18
Add -O,--output-fmt option so it can write vcf or bcf as well as
its original summary format.

Add -o,--output option so it's possible to write to a file without
shell redirection.

Add --args option so input files can be listed directly on the
command line instead of via a fofn, to make basic tests easier.

Add -r,--regions and -t,--targets options, which behave the
same as the equivalents in `bcftools view`.

Add the --no-index option to the usage text.

Simplify writing the original format.  Everything can be sent
directly to the output file without going via a kstring.  The
output writing parts are also moved into separate functions to
keep main() from getting too big.

Add a few extra error checks.

Call exit(EXIT_FAILURE) on failure, not exit(-1).  Make the -h
option return success.
Add some tests to exercise the --regions / --targets synced reader
options.  Currently this only includes tests for the chromosomes
with [:-] characters in the name, but it could be expanded easily
to do others.  Test files have been borrowed from pull request
samtools/bcftools#1938.

Move the synced reader no-index tests from test-bcf-sr.pl to
test.pl.  The former isn't a good place for them as it gets
called 10 times, but the no-index test only needs to run once.
It also allows the code running the test to be simplified a bit.

Also fix the exit code on test-bcf-sr.pl failure from -1 to 1.

Co-authored-by: Petr Danecek <pd3@sanger.ac.uk>
@daviesrob
Copy link
Member

Merge commit removed...

@jkbonfield jkbonfield merged commit e86126b into samtools:develop Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Synced reader does not support regions with contigs including colons
3 participants