Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should we store signatures in SQLite databases? #1930

Closed
ctb opened this issue Apr 5, 2022 · 11 comments · Fixed by #1808
Closed

should we store signatures in SQLite databases? #1930

ctb opened this issue Apr 5, 2022 · 11 comments · Fixed by #1808

Comments

@ctb
Copy link
Contributor

ctb commented Apr 5, 2022

In PR #1808, I'm nearing a decision point.

This PR is an experimental PR that adds SqliteCollectionManifest and SqliteIndex, for storing manifests and signatures in a SQLite database. SqliteCollectionManifest is indepenently useful and can be wrapped in a StandaloneManifestIndex, while SqliteIndex builds on SqliteCollectionManifest.

My conundrum is that the manifest class is almost certainly worth merging, but I'm not sure about the index class. This issue is to track this discussion independently of the PR.

Questions:

Two big questions arise:

should we merge SqliteIndex?

SqliteCollectionManifest is really useful, but it's not clear to me that SqliteIndex should be merged.

arguments for SqliteIndex:

  • it works! working code is good!
  • low startup overhead (benchmark!)
  • low memory (benchmark!)
  • multithread/multiprocess for read-only db
  • another example of on-disk index
  • efficient (?) on-the-fly rescaling - benchmark!

arguments against SqliteIndex:

  • more code to support
  • code is reasonably complicated

should we support scaled=1 in SqliteIndex?

It adds a dependency, makes the code more complicated, and is not that fast. On the other hand it's kind of neat and it seems to work ok :).

@ctb
Copy link
Contributor Author

ctb commented Apr 5, 2022

some early benchmarks - a 5k random subset of gtdb

numbers

file sizes: about a factor of 8 larger.

-rw-r--r--  1 t  staff   1.1G Apr  4 06:22 gtdb-k31-5k-random.sqldb
-rw-r--r--  1 t  staff   136M Apr  4 06:17 gtdb-k31-5k-random.zip

again zip:

% time sourmash search gtdb-k31-5k-random.query.sig.gz gtdb-k31-5k-random.zip --threshold=0.1 

...

real    0m13.572s
user    0m13.729s
sys     0m0.176s

against sqldb:

% time sourmash search gtdb-k31-5k-random.query.sig.gz gtdb-k31-5k-random.sqldb --threshold=0.1 

...

real    0m3.072s
user    0m3.158s
sys     0m0.289s

against sbt.zip:

time sourmash search gtdb-k31-5k-random.query.sig.gz gtdb-k31-5k-random.sbt.zip --threshold=0.1 

...

real    0m0.723s
user    0m1.023s
sys     0m0.087s

@ctb
Copy link
Contributor Author

ctb commented Apr 5, 2022

Now that I've written all of the above out...

...a big selling point for SqliteIndex would be if it supported LCA-style analysis of individual hashes, which has been consistently popular with users. And this is actually pretty straightforward to do! 🎉

Once implemented, SqliteIndex would support low memory, fast-loading, and on-disk LCA database functionality.

@ctb
Copy link
Contributor Author

ctb commented Apr 6, 2022

preliminary implementation that actually seems to work just fine is here: #1933

🎉

@phiweger
Copy link

phiweger commented Apr 6, 2022

If you care for my 2 c, I think having an sqlite index is a great idea. Besides performance, users can query it easily using SQL for unthought-of use cases. Would it be possible/ desirable to replace .zip indices, thereby removing the need to support both?

@ctb
Copy link
Contributor Author

ctb commented Apr 6, 2022

If you care for my 2 c, I think having an sqlite index is a great idea. Besides performance, users can query it easily using SQL for unthought-of use cases.

good point in re flexibility of raw SQL queries! Although I hope that users don't try to figure out the scaled stuff themselves; I think I have a good handle on it but it's taken 5+ years :)

Would it be possible/ desirable to replace .zip indices, thereby removing the need to support both?

zip indices are parallelizable and also flexible in ways that SQLite indices are not; they're probably going to remain our default on-disk storage format. Plus they're a legacy format that we have to support anyway for a while.

In general I'm just wary of introducing big new complicated pieces of code without clear justification of some sort :). I think the LCA database version of SqliteIndex clearly meets that justification!

@ctb
Copy link
Contributor Author

ctb commented Apr 7, 2022

while this is something we should test more robustly, based on this stackoverflow post, "SQLite Concurrent Access" we should be fine for multithreaded/multiprocess reads from SQLite databases, and we could even be good for multithreaded writes if we configure things carefully, maybe using a write-ahead log.

@luizirber
Copy link
Member

I'm OK with merging SqliteIndex and SqliteCollectionManifest, but I also think we should be trying to use more Index and CollectionManifest (with should probably be an abstract class) across the codebase to avoid hardcoding functionality to specific backends.

I don't think we should remove ZipStorage just yet, because it is useful for sticking other data inside. Tho you can have binary blobs in sqlite too, hmm...

@ctb
Copy link
Contributor Author

ctb commented Apr 8, 2022

Agree on pulling CollectionManifest back to an abstract base class. I also have an opportunity to refactor LCA_Database in a similar way, since the SQLite version is pretty cool and is now equally well tested to the JSON/in-memory version (uses the same tests, with a test fixture; done in #1933). We're facing an interesting nesting/mixin design challenge for the three main SQLite classes at this point...

With Index I've tried to use the base class as much as possible, but of course changes pile up as we're working on specific features. Probably could use some holistic refactoring.

I think I also need to invest in better/more uniform tests. We have an awful lot of copy-pasta and it might be useful to build a uniform set of functional tests to apply to Index.

I'm mostly just struggling with where to cut off the SQLite PRs. If I do all the refactoring in one PR it's going to be a monster.

@ctb
Copy link
Contributor Author

ctb commented Apr 10, 2022

Current strategy: define/redefine/clarify API expectations for Index, CollectionManifest, and LCA_Database here - #1936.

Then plod through the various issues in the SQLite PRs, refactoring expectations into #1936.

Seems to be working so far 🤔

@ctb
Copy link
Contributor Author

ctb commented Apr 13, 2022

Only somewhat random additional thought: the additional tests, as well as the detailed SqliteIndex etc implementation, are really nice for people thinking about how to write their own Index etc classes for internal purposes. For example, if organizations have their own internal data storage/data management system, it should be increasingly straightforward for them to write specialized classes that make use of it.

@ctb
Copy link
Contributor Author

ctb commented Apr 17, 2022

lca summarize is really nice and fast with the sqlite database implementation! #1958

@ctb ctb closed this as completed in #1808 Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants