Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ongoing transition to DB backend (SQLite) #184

Open
wants to merge 51 commits into
base: new_data_store
Choose a base branch
from

Conversation

hendrikweisser
Copy link

Changes to Whitelist.py coming up next.

@hendrikweisser
Copy link
Author

Whitelist is updated and seems to work, but the read references are now database keys, so SampComp needs updating as well.

Copy link
Owner

@tleonardi tleonardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hendrikweisser,
thanks a lot for all these updates, great job!
I only have one change to suggest: would it be possible to move the DB logic from Whitelist to DataStore and then from Whitelist only call DataStore methods?
This way everything DB related is in DataStore, and Whitlist becomes agnostic of the storage backend. In the future if we were to change storage backend or add alternative ones we would only have to modify/rewrite the DataStore without having to touch Whitlist, SampComp etc.
What do you think?

@hendrikweisser
Copy link
Author

Hi @tleonardi!
Totally agree about moving the DB logic. In fact I've already done that, after I realised that much of the same functionality would be needed in SampComp. I just wasn't sure about whether it should go into DataStore; for the moment I've moved it to a new class DatabaseWrapper. Is DataStore intended to specifically handle storing of data? It currently creates the database if that doesn't exist and adds tables, which isn't ideal for just read access.

Do we want one class that handles general DB functionality? Or one lower-level/read access class plus DataStore for writing? (My intention was to adapt DataStore to use DatabaseWrapper, but perhaps combining them into one class would indeed be cleaner.)

I'll push my updates from last night so you can have a look.

@hendrikweisser
Copy link
Author

One more point: I've implemented the read-level filtering in Whitelist via constraints in the database query. So the database and filtering are very closely linked. I think it makes sense to keep this code in Whitelist, but it's not agnostic to the storage backend. We could move the code to the DB class, but then that class will contain quite a bit of the filtering logic. (I assume that it's more efficient to filter during the query than after it, but haven't tested that.)

@hendrikweisser
Copy link
Author

I went ahead and put all the database code in DataStore. I still need to test it and probably fix some bugs, but I wanted to avoid you looking at an outdated version.

@hendrikweisser hendrikweisser changed the title Add read-level kmer stats to database (for whitelisting) Ongoing transition to DB backend (SQLite) Mar 23, 2021
@hendrikweisser
Copy link
Author

I've made some more changes. SampComp now works with data from the SQLite DB, including the statistical tests in txCompare. It fails when SampCompDB gets involved to handle the results, but the plan was anyway to replace SampCompDB and shelve with more SQLite. That I still need to do.

nanocompore/SampComp.py Outdated Show resolved Hide resolved
@hendrikweisser
Copy link
Author

I just pushed more udpates. In principle, the whole pipeline (Eventalign_collapse -> Whitelist -> SampComp -> save_report) should work now. Combining adjacent p-values in a sequence context and multiple testing correction are supported and (lightly) tested, but a lot of the other options aren't tested and may still need updating. In the export class (now called PostProcess - should probably rename to DataExport), save_to_bed isn't updated yet. __main__.py isn't updated yet, either.

@hendrikweisser
Copy link
Author

Oh, SampCompDB.py should be obsolete now and can probably be removed.

@hendrikweisser
Copy link
Author

@tleonardi: As discussed I've updated the CLI options. Input files (e.g. "-i") must be specified as full paths, but for output files there's the option to specify a directory ("-d") and use the default filename. I've commented out the YAML input option for SampComp, but if you think it's useful just put it back in. There are some other changes (e.g. due to the simplification of tests performed by SampComp) but they should be straightforward.

I've done some light testing and it works on my example data. (Unfortunately the problem that TxComp spawns way too many threads persists for me.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants