Skip to content

Commit

Permalink
split parse_genomic_range
Browse files Browse the repository at this point in the history
  • Loading branch information
mlin committed Dec 7, 2020
1 parent 1fa5d98 commit f8b8e71
Show file tree
Hide file tree
Showing 3 changed files with 62 additions and 35 deletions.
28 changes: 20 additions & 8 deletions docs/guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -1028,31 +1028,43 @@ But this plan strongly depends on the contiguity assumption.

#### Parse genomic range string

The SQL function `parse_genomic_range(txt, part)` processes a string such as "chr1:2,345-6,789" into any of its three parts (chromosome name, begin position, and end position).
These SQL functions process a string like "chr1:2,345-6,789" into its three parts (sequence/chromosome name, begin position, and end position).

=== "SQL"
``` sql
SELECT parse_genomic_range('chr1:2,345-6,789', 1) -- 'chr1'
SELECT parse_genomic_range('chr1:2,345-6,789', 2) -- 2344
SELECT parse_genomic_range('chr1:2,345-6,789', 3) -- 6789
SELECT parse_genomic_range_sequence('chr1:2,345-6,789', 1) -- 'chr1'
SELECT parse_genomic_range_begin('chr1:2,345-6,789', 2) -- 2344 (!)
SELECT parse_genomic_range_end('chr1:2,345-6,789', 3) -- 6789
```

Important: [the begin position returned is one less than the text number](https://genome.ucsc.edu/FAQ/FAQtracks#tracks1), while the end position is equal to the text number.
<small>
[The begin position returned is one less than the text number](https://genome.ucsc.edu/FAQ/FAQtracks#tracks1), while the end position is equal to the text number.
</small>

#### Two-bit encoding for nucleotide sequences

The extension supplies SQL functions to pack a DNA/RNA sequence TEXT value into a smaller BLOB value, using two bits per nucleotide. (Review [SQLite Datatypes](https://www.sqlite.org/datatype3.html) on the important differences between TEXT and BLOB values & columns.) Storing a large database of sequences using such BLOBs instead of TEXT can improve application I/O efficiency, with up to 4X more nucleotides cached in the same memory space. It is not, however, expected to greatly shrink the database file on disk, owing to the automatic storage compression.

The encoding is case-insensitive and considers `T` and `U` equivalent.

*Encoding:*

=== "SQL"
``` sql
SELECT nucleotides_twobit('TCAG')
```

Given a TEXT value consisting of characters from the set `ACGTUacgtu`, compute a two-bit-encoded BLOB value that can later be decoded using `twobit_dna()` or `twobit_rna()`. Given any other ASCII TEXT value (including empty), pass it through unchanged as TEXT. Given NULL, return NULL. Any other input is an error.

Typically used to populate a BLOB column `C` with `INSERT INTO some_table(...,C) VALUES(...,nucleotides_twobit(?))`. This works even if some of the sequences contain `N`s or other characters, in which case those sequences are stored as the original TEXT values. Make sure the column has schema type `BLOB` to avoid spurious coercions, and by convention, the column should be named *_twobit.
Typically used to populate a BLOB column `C` with e.g.

```sql
INSERT INTO some_table(...,C) VALUES(...,nucleotides_twobit(?))
```

This works even if some of the sequences contain `N`s or other characters, in which case those sequences are stored as the original TEXT values. Make sure the column has schema type `BLOB` to avoid spurious coercions, and by convention, the column should be named *_twobit.

*Decoding:*

=== "SQL"
``` sql
Expand All @@ -1064,11 +1076,11 @@ Typically used to populate a BLOB column `C` with `INSERT INTO some_table(...,C)

Given a two-bit-encoded BLOB value, decode the nucleotide sequence as uppercased TEXT, with `T`'s for `twobit_dna()` and `U`'s for `twobit_rna()`. Given a TEXT value, pass it through unchanged. Given NULL, return NULL. Any other first input is an error.

The optional `Y` and `Z` arguments can be used to compute [`substr(twobit_dna(X),Y,Z)`](https://sqlite.org/lang_corefunc.html#substr) more efficiently, without decoding the whole sequence. Unfortunately however, [SQLite internals](https://sqlite.org/forum/forumpost/756c1a1e48?t=h) make this operation still liable to use time & memory proportional to the full length of X, not Z. If frequent random access into long sequences is needed, then consider splitting them across multiple rows.
The optional `Y` and `Z` arguments can be used to compute [`substr(twobit_dna(X),Y,Z)`](https://sqlite.org/lang_corefunc.html#substr) more efficiently, without decoding the whole sequence. <small>Unfortunately however, [SQLite internals](https://sqlite.org/forum/forumpost/756c1a1e48?t=h) make this operation still liable to use time & memory proportional to the full length of X, not Z. If frequent random access into long sequences is needed, then consider splitting them across multiple rows.</small>

Take care to only use BLOBs originally produced by `nucleotides_twobit()`, as other BLOBs may decode to spurious nucleotide sequences. If you `SELECT twobit_dna(C) FROM some_table` on a column with mixed BLOB and TEXT values as suggested above, note that the results actually stored as TEXT preserve their case and T/U letters, unlike decoded BLOBs.

**↪ Two-bit sequence length**
*Length:*

=== "SQL"
``` sql
Expand Down
50 changes: 31 additions & 19 deletions src/genomicsqlite.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1433,7 +1433,7 @@ static void sqlfn_twobit_rna(sqlite3_context *ctx, int argc, sqlite3_value **arg
}

/**************************************************************************************************
* parse_genomic_range()
* parse_genomic_range_*()
**************************************************************************************************/

static uint64_t parse_genomic_range_pos(const string &txt, size_t ofs1, size_t ofs2) {
Expand All @@ -1444,15 +1444,15 @@ static uint64_t parse_genomic_range_pos(const string &txt, size_t ofs1, size_t o
auto c = txt[i];
if (c >= '0' && c <= '9') {
if (ans > 922337203685477579ULL) { // (2**63-10)//10
throw std::runtime_error("parse_genomic_range() position overflow in `" + txt +
throw std::runtime_error("parse_genomic_range(): position overflow in `" + txt +
"`");
}
ans *= 10;
ans += c - '0';
} else if (c == ',') {
continue;
} else {
throw std::runtime_error("parse_genomic_range() can't read `" + txt + "`");
throw std::runtime_error("parse_genomic_range(): can't read `" + txt + "`");
}
}
return ans;
Expand Down Expand Up @@ -1480,26 +1480,36 @@ static std::tuple<string, uint64_t, uint64_t> parse_genomic_range(const string &
return std::make_tuple(chrom, begin_pos - 1, end_pos);
}

static void sqlfn_parse_genomic_range(sqlite3_context *ctx, int argc, sqlite3_value **argv) {
static void sqlfn_parse_genomic_range_sequence(sqlite3_context *ctx, int argc,
sqlite3_value **argv) {
string txt;
sqlite3_int64 which_part;
ARG_TEXT(txt, 0);
ARG(which_part, 1, SQLITE_INTEGER, int64);

try {
auto t = parse_genomic_range(txt);
auto &chrom = get<0>(t);
switch (which_part) {
case 1:
return sqlite3_result_text(ctx, chrom.c_str(), chrom.size(), SQLITE_TRANSIENT);
case 2:
return sqlite3_result_int64(ctx, get<1>(t));
case 3:
return sqlite3_result_int64(ctx, get<2>(t));
default:
throw std::runtime_error(
"parse_genomic_range(): expected part 1, 2, or 3 (parameter 2)");
}
return sqlite3_result_text(ctx, chrom.c_str(), chrom.size(), SQLITE_TRANSIENT);
} catch (std::exception &exn) {
sqlite3_result_error(ctx, exn.what(), -1);
}
}

static void sqlfn_parse_genomic_range_begin(sqlite3_context *ctx, int argc, sqlite3_value **argv) {
string txt;
ARG_TEXT(txt, 0);
try {
auto t = parse_genomic_range(txt);
return sqlite3_result_int64(ctx, get<1>(t));
} catch (std::exception &exn) {
sqlite3_result_error(ctx, exn.what(), -1);
}
}

static void sqlfn_parse_genomic_range_end(sqlite3_context *ctx, int argc, sqlite3_value **argv) {
string txt;
ARG_TEXT(txt, 0);
try {
auto t = parse_genomic_range(txt);
return sqlite3_result_int64(ctx, get<2>(t));
} catch (std::exception &exn) {
sqlite3_result_error(ctx, exn.what(), -1);
}
Expand Down Expand Up @@ -1553,7 +1563,9 @@ static int register_genomicsqlite_functions(sqlite3 *db, const char **pzErrMsg,
{FPNM(twobit_rna), 1, SQLITE_DETERMINISTIC},
{FPNM(twobit_rna), 2, SQLITE_DETERMINISTIC},
{FPNM(twobit_rna), 3, SQLITE_DETERMINISTIC},
{FPNM(parse_genomic_range), 2, SQLITE_DETERMINISTIC}};
{FPNM(parse_genomic_range_sequence), 1, SQLITE_DETERMINISTIC},
{FPNM(parse_genomic_range_begin), 1, SQLITE_DETERMINISTIC},
{FPNM(parse_genomic_range_end), 1, SQLITE_DETERMINISTIC}};

int rc;
for (int i = 0; i < sizeof(fntab) / sizeof(fntab[0]); ++i) {
Expand Down
19 changes: 11 additions & 8 deletions test/test_parse_genomic_range.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,13 @@

def test_parse_genomic_range():
con = genomicsqlite.connect(":memory:")
query = "SELECT parse_genomic_range(?,?)"
for (txt, chrom, begin_pos, end_pos) in [
("chr1:2,345-06,789", "chr1", 2344, 6789),
("π:1-9,223,372,036,854,775,799", "π", 0, 9223372036854775799),
]:
assert next(con.execute(query, (txt, 1)))[0] == chrom
assert next(con.execute(query, (txt, 2)))[0] == begin_pos
assert next(con.execute(query, (txt, 3)))[0] == end_pos
assert next(con.execute("SELECT parse_genomic_range_sequence(?)", (txt,)))[0] == chrom
assert next(con.execute("SELECT parse_genomic_range_begin(?)", (txt,)))[0] == begin_pos
assert next(con.execute("SELECT parse_genomic_range_end(?)", (txt,)))[0] == end_pos

for txt in [
"",
Expand All @@ -30,8 +29,12 @@ def test_parse_genomic_range():
"chr1:2345-deadbeef",
"chr1:1-9,223,372,036,854,775,800",
]:
with pytest.raises(sqlite3.OperationalError) as exc:
con.execute("SELECT parse_genomic_range_sequence(?)", (txt,))
assert "parse_genomic_range():" in str(exc.value)
with pytest.raises(sqlite3.OperationalError):
con.execute(query, (txt, 1))

with pytest.raises(sqlite3.OperationalError):
con.execute(query, ("chr1:2-3", 0))
con.execute("SELECT parse_genomic_range_begin(?)", (txt,))
assert "parse_genomic_range():" in str(exc.value)
with pytest.raises(sqlite3.OperationalError):
con.execute("SELECT parse_genomic_range_end(?)", (txt,))
assert "parse_genomic_range():" in str(exc.value)

0 comments on commit f8b8e71

Please sign in to comment.