-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
htsget: specify how to request unmapped (unplaced) reads #320
Conversation
@saupchurch interest |
Is it worth talking about requesting a subset of unmapped reads here? Also, guarantees about the ordering of unmapped reads? |
htsget.md
Outdated
@@ -165,7 +165,7 @@ The server SHOULD reply with an `UnsupportedFormat` error if the requested forma | |||
`referenceName` | |||
_optional_ | |||
</td><td> | |||
The reference sequence name, for example "chr1", "1", or "chrX". If unspecified, all reads (mapped and unmapped) are returned. [^b] | |||
The reference sequence name, for example "chr1", "1", or "chrX". If "*", unmapped (unplaced) reads are returned. If unspecified, all reads (mapped and unmapped) are returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that for correctly formatting in markdown you should scape the *
(using \*
)
htsget.md
Outdated
@@ -175,7 +175,7 @@ _optional 32-bit unsigned integer_ | |||
</td><td> | |||
The start position of the range on the reference, 0-based, inclusive. | |||
|
|||
The server SHOULD respond with an `InvalidInput` error if `start` is specified and a reference is not specified (see `referenceName`). | |||
The server SHOULD respond with an `InvalidInput` error if `start` is specified and a reference is unspecified or * (see `referenceName`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quotation marks around *
here, and below too.
htsget.md
Outdated
@@ -185,7 +185,7 @@ _optional 32-bit unsigned integer_ | |||
</td><td> | |||
The end position of the range on the reference, 0-based exclusive. | |||
|
|||
The server SHOULD respond with an `InvalidInput` error if `end` is specified and a reference is not specified (see `referenceName`). | |||
The server SHOULD respond with an `InvalidInput` error if `end` is specified and a reference is unspecified or * (see `referenceName`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
I'm not sure I'm loving In any case you should say what you're using unplaced to mean, maybe by adapting the footnote in the SAM spec. |
Having thought about it a little longer: both this and #311 (header-only) are wanting to add something to be used instead of the existing referenceName/start/end query parameters. Perhaps we can come up with a common parameter that would suit both of these use cases? For example, |
I'm not sure of the value of subsets of unmapped reads. The genesis of this request comes from fulfilling an RNA-seq use case. For users who wish to re-align a dataset they need to regenerate the full set of initial reads from the BAM/CRAM file (as the fastq is not in scope of htsget). As long as |
The use case I was imagining was using htsget to serve a distributed computation. Since the unmapped reads can be a significant chunk of the bam, if you want to process them as part of your job you'll want to divide them into multiple shards, otherwise you end up with one extremely long running task dealing with the unmapped reads. For that purpose it would be very useful to be able to
|
This depends on the nature of the index. For cram it's possible to do random access at the block level and I wrote a tool to do this way back when in the early days of the streaming API: https://groups.google.com/forum/#!searchin/ga4gh-dwg-directory-api-and-streaming/cram_filter|sort:date/ga4gh-dwg-directory-api-and-streaming/oWYF1ASAZ0Y/T82vyqMcDQAJ (see the cram_filter comment). However I'm not sure this is easily possible in any of the BAI, CSI style indices as they're indexed positions rather than records. That said, as far as the protocol / language goes, we can define this now and do implementations as and when they mature. |
@jkbonfield That's a good point. The new splitting index can potentially help. If the exact granularity is specified you can figure out what read number the unmapped reads start at by finding the first split before the unmapped, counting to the start of that position, and then using that as a starting point for jumping into the unmapped ones. An exact "give me this number of reads" might not be necessary either. Something like "I want the first of N chunks of reads, do your best to split them evenly for me." would be sufficient for my use case. |
htsget.md
Outdated
@@ -165,7 +165,7 @@ The server SHOULD reply with an `UnsupportedFormat` error if the requested forma | |||
`referenceName` | |||
_optional_ | |||
</td><td> | |||
The reference sequence name, for example "chr1", "1", or "chrX". If unspecified, all reads (mapped and unmapped) are returned. [^b] | |||
The reference sequence name, for example "chr1", "1", or "chrX". If "\*", unplaced unmapped reads are returned. If unspecified, all reads (mapped and unmapped) are returned. (*Unplaced* reads are a subset of unmapped reads; see the [SAM specification](https://github.com/samtools/hts-specs/raw/master/SAMv1.pdf) for details of this concept) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer a description in place here rather than a link. But if it's going to be a link, the URL to use is http://samtools.github.io/hts-specs/SAMv1.pdf.
Our 'htsnexus' server implements this, example usage from its tests: https://github.com/dnanexus-rnd/htsnexus/blob/master/test/htsnexus_integration.t#L68 The asterisk appears not to be an issue in URLs and can be provided on a bash shell by double-quoting it as in that example. |
Seems to work fine in the Python client:
Note that the '*' is URL encoded as
Which looks like it's working, right? We get a few records for the last reference name because of block overlaps before getting the '*' records. |
I guess if you're typing |
LGTM, I vote to merge. I don't think we need to state that '*' must be URL encoded as %2A --- clients should do that automatically before sending off the request anyway. |
That question of URL-encoding is exactly the sort of thing I was concerned about. Do the HTTP standards require I see |
I think the client should be expected to automatically URL encode any |
@jmarshall I understand your doubts about overloading referenceName and the URL encoding; equally I hope it's obvious the approach has an expedient rationale (reflecting SAM). Could I press you to please declare if we should consider these doubts as blockers? (which I'm sure we'd respect of course) |
Jerome is correct that clients will need to percent-encode caller-supplied referenceNames as a matter of course. I was thinking also of the convenience of doing things manually, e.g. when hand-typing a request for unplaced unmapped reads. However I have now tested a basic FORM GET and note that both Safari and Firefox in fact don't percent-encode There remain design considerations. Samtools et al use a We don't have to be succinct in our URL query parameters, so we have split that out into I am ambivalent on this — there are pros and cons on both sides. It would be good to hear other group members' opinions. (I have further editorial comments on the proposed text, should the consensus be to go with |
@lbergelson has suggested extending this in the future to be able to request a shard of the unplaced reads (see #320 (comment)). For example, the protocol could pretend that there is always a scaled total of 10,000 unplaced entries, so that Then, under this PR's proposal, Or under this other idea, IMHO the in-band signalling inherent in servers needing to look at the value of one parameter to figure out how to interpret other parameters is a bad URI design and will make it harder to extend the protocol in future. (I'm happy to be outvoted on this, but that would require other group members to actually declare an opinion!) |
Editorial comments on the text as proposed:
|
@jmarshall I've incorporated these most recent suggestions. Thanks! |
Thanks. Some notes on how the suggestions have been incorporated:
|
LAST CALL on #320 please! |
htsget.md
Outdated
The reference sequence name, for example "chr1", "1", or "chrX". If unspecified, all reads (mapped and unmapped) are returned. [^b] | ||
The reference sequence name, for example "chr1", "1", or "chrX". If unspecified, all records are returned regardless of position. | ||
|
||
For the reads endpoint: if "\*", unplaced unmapped reads are returned. If unspecified, all reads (mapped and unmapped) are returned. (*Unplaced unmapped* reads are the subset of unmapped reads completely without location information, i.e., with RNAME and POS field values of "*" and 0 respectively. See the [SAM specification](http://samtools.github.io/hts-specs/SAMv1.pdf) for details of placed and unplaced unmapped reads.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I missed this in the discussion, but why the backslash in "*"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha. Markdown, that's why. My bad; I should get some more coffee I think...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back in the first round of review, @magicDGS figured * needed escaping to avoid it being interpreted as markdown emphasis/bold/whatever. I've just tested this in GH Pages's kramdown and GFM and both appear happy either way in this context. ¯\_(ツ)_/¯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting @jmarshall; I just realized in my markdown editor. By the way, there is still an unescaped *
, which also renders in pure markdown as emphasis...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is indeed, later in that line. I've now tried commenting out let html_no_rendering=1
in my .vimrc, and see that vim also highlights that wrongly without the backslash. So I'm now in @magicDGS's camp and agree that the lowest common denominator is that we should escape all of these literal "\*"
characters 😄
I've already weighed in here, but for wrapping up purposes I'm in favour of the change. I agree with the points that @jmarshall makes, but since |
I don't think we should demand an error from the server for specifying |
@dkj: The purpose of requiring an error in this case now is to enable the protocol to define a meaning for it in the future. (This is the same argument as here, about MAPQ for unmapped reads.) If particular implementations wanted to experiment with defining it to mean e.g. such sharding, they should do so in a way that's obviously experimental — cf GitHub's preview APIs being enabled by particular |
@jmarshall, thanks: the use of |
I hate to advocate for complexity, but in the interests of being a grown-up protocol… 😄 |
I escaped the last *, thank you for pointing it out! @magicDGS @jmarshall Any further comments anyone? |
LGTM: squash and merge? |
There's whitespace on the blank line before “For the reads endpoint” that should be tidied up while squash/rebasing. All previously noted editorial trivia has been fixed — thanks. |
so...this thread has grown large...and I'm not sure this is an appropriate place for the following idea....but it might be: in GATK we use "unmapped" to specific the reads that have no sequence location (so neither they nor their mates are mapped). We can run tools only on the unmapped reads by specifying This will break is someone has a sequence named "unmapped"....so I propose adding |
Final agreement for this PR to be merged on the call today. Squash and merge? |
@yfarjoun the idea is relevant to this htsget-specific PR in the way you mention, but it also pertains to the whole SAM spec so I think it should be discussed at top level. The use of the |
@mlin: Please delete the branch when it is no longer relevant. |
@mlin: Please delete this branch if it is no longer relevant. |
...by setting
referenceName=*
. This is kind of the obvious approach following SAM where unmapped reads have RNAME=*, but other ideas are welcome!