Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade search to display more information? #2002

Open
ctb opened this issue Apr 27, 2022 · 5 comments
Open

upgrade search to display more information? #2002

ctb opened this issue Apr 27, 2022 · 5 comments
Labels
5.0 issues to address for a 5.0 release
Milestone

Comments

@ctb
Copy link
Contributor

ctb commented Apr 27, 2022

As I was thinking about the ANI stuff #1967 #2001 I came up with an idea. 💡

right now, search outputs largely useless CSV files, with minimal information. (see #1390 and #1555 for relevant issues.) As long as we support num MinHashes in search (which will be forever, probably, per #1354) in sourmash, we are stuck with some command that does command-line comparison with Jaccard.

since search is useless, I've found myself using prefetch a lot more , because it outputs so much more information in the CSV. it does not give good human readable output.

so, back to search: the problem is that search is the first thing people are going to try out, because it's so ...obviously the command you want to use! 'search'! you're not going to use prefetch to do a search!

SO.

BUT.

what if we:

  1. renamed the current search to jaccard (and upgrade it with ANI output, as per display ANI in search results? #2001);
  2. renamed prefetch to search and upgraded its output to by default ANI (and then aliased it to prefetch);
  3. won, profited?

I think we could add jaccard and do the prefetch upgrade (without the renames) as part of this next release, and then do the prefetch -> search rename as of sourmash 5.0 with a deprecation warning for search now.

this is in line with our increasingly solid belief that FracMinHash/scaled sketches are the way to go, and it also makes ANI nice and visible in prefetch, which I like (again, #2001). note that after compute is removed in #1286, you will have to work hard to build num sketches anyway, as sourmash sketch builds scaled sketches by default.

@phiweger @luizirber @bluegenes @taylorreiter any thoughts, hot takes, etc?

@ctb ctb added the 5.0 issues to address for a 5.0 release label Apr 27, 2022
@bluegenes
Copy link
Contributor

👍🏻 . I definitely want prefetch-style output, and while it would now be pretty easy to add the columns to search output, this way would prevent us from basically having a duplicate command

renamed the current search to jaccard

My only issue with using jaccard is that search currently also enable abundances searches (cosine/angular similarity). I suppose we could also have cosine to do abund searches? Or use jaccard to mean either?

I also think we need to be a bit clearer about how prefetch (-->search) uses abundances. For gather, we have both abundance-weighted and flat values -- I would propose standardizing prefetch (search) output columns to names that explicitly state whether or not abundance information was used (and ideally, report both for abund comparisons)

@ctb
Copy link
Contributor Author

ctb commented Apr 27, 2022

My only issue with using jaccard is that search currently also enable abundances searches (cosine/angular similarity). I suppose we could also have cosine to do abund searches? Or use jaccard to mean either?

I, uhh, have no idea :). I kind of like the idea of angular or something, but then we'd have a proliferation of such things. Sigh.

Hmm, do we even allow cos/angular similarity on num sketches? I'm not sure we should.

@bluegenes
Copy link
Contributor

Hmm, do we even allow cos/angular similarity on num sketches? I'm not sure we should.

as far as I can tell, we do, so I kept it enabled for search...

@ctb
Copy link
Contributor Author

ctb commented Feb 5, 2024

Note new plugin mgsearch in #2970 that at least starts to get to the new information we want displayed.

@ctb
Copy link
Contributor Author

ctb commented Aug 16, 2024

multisearch in the branchwater plugin does a nice job of providing the relevant information, and it's a lot faster, too!

Note that cos similarity can be accurately estimated by FracMinHash per https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11160586/!

Two specific thoughts:

  • remove cos similarity estimation from MinHash - that's just a bug
  • describe problems with current search more clearly in the documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5.0 issues to address for a 5.0 release
Projects
None yet
Development

No branches or pull requests

3 participants