Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

htsget: specify how to request unmapped (unplaced) reads #320

Merged
merged 14 commits into from
Jan 23, 2019

Conversation

mlin
Copy link
Member

@mlin mlin commented Jun 1, 2018

...by setting referenceName=*. This is kind of the obvious approach following SAM where unmapped reads have RNAME=*, but other ideas are welcome!

@mlin mlin added the htsget label Jun 1, 2018
@mlin
Copy link
Member Author

mlin commented Jun 1, 2018

@saupchurch interest

@lbergelson
Copy link
Member

Is it worth talking about requesting a subset of unmapped reads here? Also, guarantees about the ordering of unmapped reads?

htsget.md Outdated
@@ -165,7 +165,7 @@ The server SHOULD reply with an `UnsupportedFormat` error if the requested forma
`referenceName`
_optional_
</td><td>
The reference sequence name, for example "chr1", "1", or "chrX". If unspecified, all reads (mapped and unmapped) are returned. [^b]
The reference sequence name, for example "chr1", "1", or "chrX". If "*", unmapped (unplaced) reads are returned. If unspecified, all reads (mapped and unmapped) are returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for correctly formatting in markdown you should scape the * (using \*)

htsget.md Outdated
@@ -175,7 +175,7 @@ _optional 32-bit unsigned integer_
</td><td>
The start position of the range on the reference, 0-based, inclusive.

The server SHOULD respond with an `InvalidInput` error if `start` is specified and a reference is not specified (see `referenceName`).
The server SHOULD respond with an `InvalidInput` error if `start` is specified and a reference is unspecified or * (see `referenceName`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quotation marks around * here, and below too.

htsget.md Outdated
@@ -185,7 +185,7 @@ _optional 32-bit unsigned integer_
</td><td>
The end position of the range on the reference, 0-based exclusive.

The server SHOULD respond with an `InvalidInput` error if `end` is specified and a reference is not specified (see `referenceName`).
The server SHOULD respond with an `InvalidInput` error if `end` is specified and a reference is unspecified or * (see `referenceName`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@jmarshall
Copy link
Member

I'm not sure I'm loving referenceName=* — this is a textual URL and we have plenty of space to invent a new key like whatIwantis=unmapped (to be used instead of referenceName/start/end), but I haven't thought about it long enough to have a specific idea.

In any case you should say what you're using unplaced to mean, maybe by adapting the footnote in the SAM spec.

@jmarshall
Copy link
Member

jmarshall commented Jun 5, 2018

Having thought about it a little longer: both this and #311 (header-only) are wanting to add something to be used instead of the existing referenceName/start/end query parameters. Perhaps we can come up with a common parameter that would suit both of these use cases?

For example, class=unplaced and class=headers. Other name suggestions are welcome; I tried to think of something generic and similar to the existing orthogonal to referenceName/start/end parameters (namely fields and tags).

@saupchurch
Copy link

I'm not sure of the value of subsets of unmapped reads. The genesis of this request comes from fulfilling an RNA-seq use case. For users who wish to re-align a dataset they need to regenerate the full set of initial reads from the BAM/CRAM file (as the fastq is not in scope of htsget). As long as referenceName=* or equivalent returns everything we will be able to support this use case.

@lbergelson
Copy link
Member

lbergelson commented Jun 14, 2018

The use case I was imagining was using htsget to serve a distributed computation. Since the unmapped reads can be a significant chunk of the bam, if you want to process them as part of your job you'll want to divide them into multiple shards, otherwise you end up with one extremely long running task dealing with the unmapped reads.

For that purpose it would be very useful to be able to

  1. find out how many unmapped reads there are
  2. request a subset of them at a time

@jkbonfield
Copy link
Contributor

This depends on the nature of the index.

For cram it's possible to do random access at the block level and I wrote a tool to do this way back when in the early days of the streaming API: https://groups.google.com/forum/#!searchin/ga4gh-dwg-directory-api-and-streaming/cram_filter|sort:date/ga4gh-dwg-directory-api-and-streaming/oWYF1ASAZ0Y/T82vyqMcDQAJ (see the cram_filter comment).

However I'm not sure this is easily possible in any of the BAI, CSI style indices as they're indexed positions rather than records.

That said, as far as the protocol / language goes, we can define this now and do implementations as and when they mature.

@lbergelson
Copy link
Member

lbergelson commented Jun 15, 2018

@jkbonfield That's a good point. The new splitting index can potentially help. If the exact granularity is specified you can figure out what read number the unmapped reads start at by finding the first split before the unmapped, counting to the start of that position, and then using that as a starting point for jumping into the unmapped ones.

An exact "give me this number of reads" might not be necessary either. Something like "I want the first of N chunks of reads, do your best to split them evenly for me." would be sufficient for my use case.

htsget.md Outdated
@@ -165,7 +165,7 @@ The server SHOULD reply with an `UnsupportedFormat` error if the requested forma
`referenceName`
_optional_
</td><td>
The reference sequence name, for example "chr1", "1", or "chrX". If unspecified, all reads (mapped and unmapped) are returned. [^b]
The reference sequence name, for example "chr1", "1", or "chrX". If "\*", unplaced unmapped reads are returned. If unspecified, all reads (mapped and unmapped) are returned. (*Unplaced* reads are a subset of unmapped reads; see the [SAM specification](https://github.com/samtools/hts-specs/raw/master/SAMv1.pdf) for details of this concept)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer a description in place here rather than a link. But if it's going to be a link, the URL to use is http://samtools.github.io/hts-specs/SAMv1.pdf.

@mlin
Copy link
Member Author

mlin commented Jul 11, 2018

Our 'htsnexus' server implements this, example usage from its tests: https://github.com/dnanexus-rnd/htsnexus/blob/master/test/htsnexus_integration.t#L68

The asterisk appears not to be an issue in URLs and can be provided on a bash shell by double-quoting it as in that example.

@jeromekelleher
Copy link
Member

Seems to work fine in the Python client:

$ htsget http://htsnexus.rnd.dnanex.us/v1/reads/BroadHiSeqX_b37/NA12878 -r '*' -O tmp.bam -vv
2018-07-11 17:34:23,958 handle_ticket_request(url=http://htsnexus.rnd.dnanex.us/v1/reads/BroadHiSeqX_b37/NA12878?referenceName=%2A, headers={})
2018-07-11 17:34:23,961 Starting new HTTP connection (1): htsnexus.rnd.dnanex.us
2018-07-11 17:34:24,157 http://htsnexus.rnd.dnanex.us:80 "GET /v1/reads/BroadHiSeqX_b37/NA12878?referenceName=%2A HTTP/1.1" 200 None
2018-07-11 17:34:24,160 handle_data_uri(application/octet-stream;base64, length=8850)
2018-07-11 17:34:24,160 handle_http_url(url=https://dl.dnanex.us/F/D/Pb1QjgQx9j2bZ8Q44x50xf4fQV3YZBgkvkz23FFB/NA12878_recompressed.bam, headers={'referer': 'http://htsnexus.rnd.dnanex.us/v1/reads/BroadHiSeqX_b37/NA12878?referenceName=%2A', 'range': 'bytes=140474753596-140786687527'})
2018-07-11 17:34:24,163 Starting new HTTPS connection (1): dl.dnanex.us
2018-07-11 17:34:31,820 https://dl.dnanex.us:443 "GET /F/D/Pb1QjgQx9j2bZ8Q44x50xf4fQV3YZBgkvkz23FFB/NA12878_recompressed.bam HTTP/1.1" 206 311933932

Note that the '*' is URL encoded as referenceName=%2A here. Looking at the downloaded file we get

$ samtools view tmp.bam | cut -f 1,2,3 | head -n 40
HK3T5CCXX160204:8:2215:26545:9607	97	NC_007605
HJYFJCCXX160204:8:2123:12763:31441	73	NC_007605
HJYFJCCXX160204:8:2123:12763:31441	133	NC_007605
HK35MCCXX160204:5:2221:26778:11242	345	NC_007605
HK35NCCXX160204:2:1206:6664:3577	409	NC_007605
HK3T5CCXX160204:6:2203:15666:37348	345	NC_007605
HK3T5CCXX160204:4:2120:5041:37260	409	NC_007605
HK3T5CCXX160204:6:2204:30218:57319	165	NC_007605
HK3T5CCXX160204:6:2204:30218:57319	1113	NC_007605
HJYFJCCXX160204:8:1121:9993:6636	161	NC_007605
HK35MCCXX160204:2:2117:14387:54612	345	NC_007605
HK3T5CCXX160204:2:1106:25296:72983	409	NC_007605
HK3T5CCXX160204:5:2117:11201:31195	97	NC_007605
HK35MCCXX160204:6:1217:10348:49127	147	NC_007605
HK35MCCXX160204:3:1117:17289:55069	163	NC_007605
HK35NCCXX160204:1:2221:21542:59833	409	NC_007605
HJYFJCCXX160204:4:1120:12936:70346	97	NC_007605
HK3T5CCXX160204:2:2114:21937:40899	409	NC_007605
HK35MCCXX160204:5:2108:25702:26290	161	NC_007605
HK3T5CCXX160204:3:2117:8237:46560	147	NC_007605
HJYFJCCXX160204:4:1101:10003:15162	77	*
HJYFJCCXX160204:4:1101:10003:15162	141	*
HJYFJCCXX160204:4:1101:10013:22423	77	*
HJYFJCCXX160204:4:1101:10013:22423	141	*
HJYFJCCXX160204:4:1101:10135:54700	77	*
HJYFJCCXX160204:4:1101:10135:54700	141	*
HJYFJCCXX160204:4:1101:10155:51711	77	*
HJYFJCCXX160204:4:1101:10155:51711	141	*
HJYFJCCXX160204:4:1101:10257:22739	77	*
HJYFJCCXX160204:4:1101:10257:22739	141	*
HJYFJCCXX160204:4:1101:10277:44292	77	*
HJYFJCCXX160204:4:1101:10277:44292	141	*
HJYFJCCXX160204:4:1101:10368:6126	77	*
HJYFJCCXX160204:4:1101:10368:6126	141	*
HJYFJCCXX160204:4:1101:10429:16217	77	*
HJYFJCCXX160204:4:1101:10429:16217	141	*
HJYFJCCXX160204:4:1101:10510:41462	77	*
HJYFJCCXX160204:4:1101:10510:41462	141	*
HJYFJCCXX160204:4:1101:10541:19399	77	*
HJYFJCCXX160204:4:1101:10541:19399	141	*

Which looks like it's working, right? We get a few records for the last reference name because of block overlaps before getting the '*' records.

@jmarshall
Copy link
Member

I guess if you're typing https://example.org/reads/foo.bam?referenceName=chr1&end=1000 on the command line you're already quoting it to escape the ? and &

@jeromekelleher
Copy link
Member

LGTM, I vote to merge. I don't think we need to state that '*' must be URL encoded as %2A --- clients should do that automatically before sending off the request anyway.

@jmarshall
Copy link
Member

That question of URL-encoding is exactly the sort of thing I was concerned about.

Do the HTTP standards require * data to be URL-encoded in the query-part of an URL? What happens if you don't? Does it behoove htsget to assign such a character special meaning in its specification?

I see * is in RFC 3986 as a sub-delims character, which should be precent-encoded “unless these characters are specifically allowed by the URI scheme to represent data in that component”. It's unclear whether they are specifically allowed in HTTP(S) query parts. Hmmm…

@jeromekelleher
Copy link
Member

I think the client should be expected to automatically URL encode any referenceName string, regardless. There's nothing preventing a referenceName of "a:b+c" (probably; I'm not sure what the SAM spec allows and doesn't), which would have to be URL encoded to be transferred correctly. There's no reason to treat '*' any differently. Clients should be expected to URL encode referenceName strings as a matter of course, this is just part of using HTTP.

@mlin
Copy link
Member Author

mlin commented Jul 13, 2018

@jmarshall I understand your doubts about overloading referenceName and the URL encoding; equally I hope it's obvious the approach has an expedient rationale (reflecting SAM). Could I press you to please declare if we should consider these doubts as blockers? (which I'm sure we'd respect of course)

@jmarshall
Copy link
Member

jmarshall commented Jul 19, 2018

Jerome is correct that clients will need to percent-encode caller-supplied referenceNames as a matter of course. I was thinking also of the convenience of doing things manually, e.g. when hand-typing a request for unplaced unmapped reads. However I have now tested a basic FORM GET and note that both Safari and Firefox in fact don't percent-encode * in HTTP query parts. So I withdraw any percent-encoding-related concerns about giving the * character a special meaning in our protocol.

There remain design considerations.

Samtools et al use a chr[:start-end] notation because it is succinct, human-friendly, and quick and easy to type. HTSlib (samtools/htslib@c8bdd76) rather obscurely shoehorned * into that as a way to ask for the unplaced reads at the end of the file, because it could be fitted into the syntax (as in-band signalling, but happily * is reserved as a reference name).

We don't have to be succinct in our URL query parameters, so we have split that out into referenceName/start/end. So we have a choice as to whether to follow the HTSlib practice with referenceName=* (which the uninitiated may think is asking for all references!) or to further split this out as something similarly textual, e.g., referenceUnmapped=only or class=unplaced, thus avoiding in-band signalling.

I am ambivalent on this — there are pros and cons on both sides. It would be good to hear other group members' opinions.

(I have further editorial comments on the proposed text, should the consensus be to go with *.)

@jmarshall
Copy link
Member

@lbergelson has suggested extending this in the future to be able to request a shard of the unplaced reads (see #320 (comment)). For example, the protocol could pretend that there is always a scaled total of 10,000 unplaced entries, so that start=1000/end=2000 would represent asking for the second 10% shard out of the pool of unplaced reads.

Then, under this PR's proposal, ?referenceName=*&start=1000&end=2000 would request this second 10% shard. Servers determining how to interpret start/end (whether as real coordinates or scaled counts) would need to look at the value of referenceName and see whether it was * or not.

Or under this other idea, ?class=unplaced&start=1000&end=2000 would request the same thing. Servers determining how to interpret start/end would look at which one of referenceName or class is present.

IMHO the in-band signalling inherent in servers needing to look at the value of one parameter to figure out how to interpret other parameters is a bad URI design and will make it harder to extend the protocol in future.

(I'm happy to be outvoted on this, but that would require other group members to actually declare an opinion!)

@jmarshall
Copy link
Member

jmarshall commented Sep 4, 2018

Editorial comments on the text as proposed:

  • Better as “For the reads endpoint” or similar, rather than “Reads formats only” — this is about the type of data, not the particular formats.

  • The proposal is for the spec to use the concept of unplaced unmapped reads, so I really think it behooves the spec to define the concept itself rather than merely point to the SAM spec. Something like

    Unplaced unmapped reads are the records completely without location information, i.e., with RNAME and POS field values of "*" and 0. (See the SAM specification for further details of placed and unplaced unmapped reads.)

  • The server SHOULD respond with an InvalidInput error if start is specified and a reference is unspecified or "*" (see referenceName).

    This is no longer grammatical. Maybe something like
    if start is specified and either no reference is specified or referenceName is "*".

    (Recall that using generic reference rather than specific referenceName in this piece of text leaves the door open for alternative ways of specifying references; cf htsget referenceMD5 query parameter #210.)

@mlin
Copy link
Member Author

mlin commented Sep 16, 2018

@jmarshall I've incorporated these most recent suggestions. Thanks!

@jmarshall
Copy link
Member

Thanks. Some notes on how the suggestions have been incorporated:

  1. The term defined in the SAM spec is “unplaced unmapped read”. I think it's worth using that explicit term with both adjectives.

  2. IMHO being specific in saying that the SAM spec has details of “placed and unplaced unmapped reads” is better than details of some “concept”. (Also missing punctuation.)

  3. The comma added to if start is specified and either no reference is specified, or referenceName is "*" adds ambiguity and confusion about which of and and or has precedence.

  4. The previous “no longer grammatical” comment of course applies to the end paragraph as well.

@mlin
Copy link
Member Author

mlin commented Oct 2, 2018

LAST CALL on #320 please!

htsget.md Outdated
The reference sequence name, for example "chr1", "1", or "chrX". If unspecified, all reads (mapped and unmapped) are returned. [^b]
The reference sequence name, for example "chr1", "1", or "chrX". If unspecified, all records are returned regardless of position.

For the reads endpoint: if "\*", unplaced unmapped reads are returned. If unspecified, all reads (mapped and unmapped) are returned. (*Unplaced unmapped* reads are the subset of unmapped reads completely without location information, i.e., with RNAME and POS field values of "*" and 0 respectively. See the [SAM specification](http://samtools.github.io/hts-specs/SAMv1.pdf) for details of placed and unplaced unmapped reads.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I missed this in the discussion, but why the backslash in "*"?

Copy link
Member

@jeromekelleher jeromekelleher Oct 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha. Markdown, that's why. My bad; I should get some more coffee I think...

Copy link
Member

@jmarshall jmarshall Oct 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back in the first round of review, @magicDGS figured * needed escaping to avoid it being interpreted as markdown emphasis/bold/whatever. I've just tested this in GH Pages's kramdown and GFM and both appear happy either way in this context. ¯\_(ツ)_/¯

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting @jmarshall; I just realized in my markdown editor. By the way, there is still an unescaped *, which also renders in pure markdown as emphasis...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is indeed, later in that line. I've now tried commenting out let html_no_rendering=1 in my .vimrc, and see that vim also highlights that wrongly without the backslash. So I'm now in @magicDGS's camp and agree that the lowest common denominator is that we should escape all of these literal "\*" characters 😄

@jeromekelleher
Copy link
Member

I've already weighed in here, but for wrapping up purposes I'm in favour of the change. I agree with the points that @jmarshall makes, but since reference=* is already embedded in the SAM/BAM/CRAM workflow it seems sensible to keep compatibility with that.

@dkj
Copy link
Member

dkj commented Oct 2, 2018

I don't think we should demand an error from the server for specifying start or end with unplaced referenceName=* requests #320 (comment) , although the server could decide to return an error. Prefer to stick with error response requirement if no referenceName and start or end specified. (But would rather this were merged as is than blocked on this!)

@jmarshall
Copy link
Member

@dkj: The purpose of requiring an error in this case now is to enable the protocol to define a meaning for it in the future. (This is the same argument as here, about MAPQ for unmapped reads.)

If particular implementations wanted to experiment with defining it to mean e.g. such sharding, they should do so in a way that's obviously experimental — cf GitHub's preview APIs being enabled by particular Accept keywords.

@dkj
Copy link
Member

dkj commented Nov 28, 2018

@jmarshall, thanks: the use of Accept headers would address my concern about not being able to move the protocol on, and so I'm content with the error from the server for specifying start or end with unplaced referenceName=*

@jmarshall
Copy link
Member

I hate to advocate for complexity, but in the interests of being a grown-up protocol… 😄

@mlin
Copy link
Member Author

mlin commented Nov 28, 2018

I escaped the last *, thank you for pointing it out! @magicDGS @jmarshall

Any further comments anyone?

@jeromekelleher
Copy link
Member

LGTM: squash and merge?

@jmarshall
Copy link
Member

There's whitespace on the blank line before “For the reads endpoint” that should be tidied up while squash/rebasing. All previously noted editorial trivia has been fixed — thanks.

@yfarjoun
Copy link
Contributor

yfarjoun commented Dec 4, 2018

so...this thread has grown large...and I'm not sure this is an appropriate place for the following idea....but it might be:

in GATK we use "unmapped" to specific the reads that have no sequence location (so neither they nor their mates are mapped). We can run tools only on the unmapped reads by specifying -L unmapped on the command-line.

This will break is someone has a sequence named "unmapped"....so I propose adding unmapped to the list of un-allowed sequence names, and then this API can also use unmapped to access the unmapped reads.....

@jeromekelleher
Copy link
Member

Final agreement for this PR to be merged on the call today. Squash and merge?

@mlin
Copy link
Member Author

mlin commented Jan 23, 2019

@yfarjoun the idea is relevant to this htsget-specific PR in the way you mention, but it also pertains to the whole SAM spec so I think it should be discussed at top level. The use of the * symbol here (despite the unfortunate potential for confusion in URL encoding discussed above!) is justified by its existing use for that purpose in SAM.

@mlin mlin merged commit c773172 into master Jan 23, 2019
@jmarshall
Copy link
Member

@mlin: Please delete the branch when it is no longer relevant.

@jmarshall
Copy link
Member

@mlin: Please delete this branch if it is no longer relevant.

@mlin mlin deleted the mlin-htsget-unmapped-reads branch March 11, 2019 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants