Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplication of signatures seen in large SBT databases #1171

Closed
ctb opened this issue Aug 14, 2020 · 24 comments · Fixed by #1568
Closed

duplication of signatures seen in large SBT databases #1171

ctb opened this issue Aug 14, 2020 · 24 comments · Fixed by #1568

Comments

@ctb
Copy link
Contributor

ctb commented Aug 14, 2020

start with #849 (comment) and go from there :)

@ctb
Copy link
Contributor Author

ctb commented Aug 14, 2020

note, all that --from-file does is append the list of filenames in from_file to the inp_files list used to construct the SBT.

I do note that there is no deduplication of filenames... we could maybe use a set of names or md5sums to track and remove duplicates.

@nmb85
Copy link

nmb85 commented Aug 14, 2020

Good point. Just to make sure that there weren't duplicated filenames in the text file with which I observed the original behavior, I ran sort | uniq -c | sort -nr and verified that there were no duplicated filenames in the text file fed to --from-file

@luizirber
Copy link
Member

I added rules to work with the assembly_summary.txt that caused the problem, but it ended up generating no duplicates.

Next: I will short-circuit the --from-file logic in sourmash index, and see if duplicates are being inserted in the index...

@nmb85
Copy link

nmb85 commented Aug 15, 2020

@luizirber It's fascinating watching you troubleshoot in the snakefile. I am learning lots about the Python API from working through what you've laid out; thanks for the awesome side-effect! I have one quick question (not a big deal at all, please take your time or ignore completely): how does the cli sourmash index command set the default number of internal nodes and what does changing the number of internal nodes effectively do to search, gather, and compare commands? If it's easiest to just point to a reference on SBTs that you recommend, then that's fine, too! Muito obrigado!

@luizirber
Copy link
Member

@luizirber It's fascinating watching you troubleshoot in the snakefile. I am learning lots about the Python API from working through what you've laid out; thanks for the awesome side-effect!

I think this sort of "public debugging" fits at least a bit in each of research/teaching/service/scicomm, so seems like a good use of time to write it down =]

I have one quick question (not a big deal at all, please take your time or ignore completely): how does the cli sourmash index command set the default number of internal nodes and what does changing the number of internal nodes effectively do to search, gather, and compare commands? If it's easiest to just point to a reference on SBTs that you recommend, then that's fine, too! Muito obrigado!

you mean the -d flag? When we started exploring SBTs I did this to check how it would behave with a layout closer to a B-Tree than a binary tree, so I started testing with d=5 and d=10. I think the best place to point is this notebook, although it is quite old (it was the exploration that led to the SBT impl in sourmash) it still has many valid points.

Main issue with d != 2 is "correcting" the search: the internal nodes end up saturating faster, and search/gather end up being slower (because they don't truncate the search so early). It might be interesting to explore d=5 or d=10 again as the SBTs get larger, because a lot of the new genbank-bacteria size ends up being too many internal nodes...

@luizirber
Copy link
Member

Tried using the subset of signatures that are duplicated in my SBT in luizirber/2020-08-14-debug-sbt@d7f0147, but that didn't generate duplicated signatures in the new SBT. So, doesn't seem to be an issue with the signatures...

@nmb85
Copy link

nmb85 commented Aug 15, 2020

Zounds! I didn't expect to see you posting on Saturday!

I think this sort of "public debugging" fits at least a bit in each of research/teaching/service/scicomm, so seems like a good use of time to write it down =]

Thank you 😄

you mean the -d flag?

Yes, it's a mystery parameter to me, and I was wondering if it is something that I need to fiddle with, but it sounds like I should leave it alone for now. In the meantime, I'll check out your notebook link: thank you!

I'm going to just try to follow what you're doing here if you don't mind, cheers!

P.S. - Are you a fan of maté 🧉?

@nmb85
Copy link

nmb85 commented Aug 31, 2020

After running off to learn snakemake and take care of some pressing tasks, I tried building an sbt using the Python API:

import glob, sourmash
from sourmash.sbtmh import SigLeaf

input_filenames = glob.glob("[0-9]*/[0-9]*/[0-9]*/GCA*/*_genomic.sig")
tree = sourmash.create_sbt_index()

for filename in input_filenames:
    sig = sourmash.load_one_signature(filename, ksize = 31)
    leaf = SigLeaf(sig.md5sum(), sig)
    tree.add_node(leaf)

#save the sbt to file as usual
output_filename = tree.save("./updated_python_api_gca_genbank.sbt.zip")

#circumvent the "tree.save" function and count the number of unique/duplicated signatures in the "tree" sbt in memory
with open("descriptions", "w") as desc_outfile:
    for sig in tree.signatures():
        desc_outfile.write(str(sig) + "\n")

When I count the number of signatures in the "tree" object before saving the sbt, there aren't any duplicated signatures, but when I use the "tree.save" function to write the sbt to a file, then use sourmash sig describe on that file, I can see the exact same duplication as I did in the original issue. Is the "save" function that is writing the signatures to disk duplicating some of the signatures?

@luizirber
Copy link
Member

When I count the number of signatures in the "tree" object before saving the sbt, there aren't any duplicated signatures, but when I use the "tree.save" function to write the sbt to a file, then use sourmash sig describe on that file, I can see the exact same duplication as I did in the original issue. Is the "save" function that is writing the signatures to disk duplicating some of the signatures?

Oh, great finding! I'm scratching my head to think what is happening on the .save() code, but I will take a closer look (next week). Finding a reduced subset that triggers the problem would be a great unit test, but I'm starting to think it only triggers with a very large number of sigs...

P.S. - Are you a fan of maté 🧉?

Yup! It's just hard to find good Erva Mate around here, but I still have some that I brought from Brazil =]

@nmb85
Copy link

nmb85 commented Sep 3, 2020

Oh, great finding! I'm scratching my head to think what is happening on the .save() code, but I will take a closer look (next week).

Okay, thanks for your help!

Yup! It's just hard to find good Erva Mate around here, but I still have some that I brought from Brazil =]

Is it possible to order any good erva mate (why the "erva" transliteration instead of "yerba"?) online?

Finding a reduced subset that triggers the problem would be a great unit test

I will script this up by subsetting the input file list in Python using a binary approach (cut number of input signature files in half, test, then up or down by half, etc) and try running tree.save() to give you an estimate about the effect of number of signatures by next week.

@nmb85
Copy link

nmb85 commented Sep 5, 2020

** See next comment below for a reduced unit test with just 46 signatures that give the duplication behavior in the sbt index **

Hi, @luizirber, I found that the number of signatures alone might not be the trigger for the duplication behavior. I've uploaded 1,000 signatures and their paired fasta files into a folder on my Google Drive. Eleven of these signatures are duplicated in the sbt index, two are triplicated, one is replicated 5 times, and one is replicated 6 times. You can download them from here. They are about 2.6 Gb total. Do you mind letting me know if you get duplication when building an sbt index from this set of signatures? Thank you!

@nmb85
Copy link

nmb85 commented Sep 5, 2020

Here is one more observation about this duplication behavior that might be helpful. I scripted up a recursive function to use a binary search-like approach to recursively increase or decrease the size of the signature input list ultimately given to tree.add_node(), then saved with tree.save(), then reloaded with sourmash.sbtmh.load_sbt_index() to check for duplication. If duplication was observed, the function decreased the size of the input signature list by 50%. If no duplication was observed, then the function increased the size of the input signature list by 50%. I ran the function for 10 iterations to start, which takes 30 seconds on the unit test set of 1,000 signatures. On the 6th iteration, the dataset could be reduced to 31 files that would not produce duplicates in the sbt index. On the 7th iteration, the dataset was increased to 46 files that duplicated one signature in the saved sbt index.

Code for recursive function:

#load modules
import glob, sourmash, time

# get a list of signature files
input_filenames = glog.glob(unix_regex_for_unit_test_signatures)

#define recursive binary search function
def binary_search(input_filenames, length_input, max_iterations = 10, size = 1, iterations = 0, lengths = list()):
    
    # stop recursion at maximum number of iterations, default max_iterations = 10
    if iterations == max_iterations:
        return(lengths)
    
    #calculate length of signature list according to last list size scaling factor (either +50% or -50% depending on duplication)
    length_list = int(size*length_input)
    
    #save the list length for this iteration, then return at the end of recursion
    lengths.append(length_list)
    
    start_time = time.time()
    print("Iteration number " + str(iterations + 1) + ". Testing " + str(length_list) + " files.")
    
    # adjust the original signature file list according to the size variable (which was increased/decreased 50%) 
    cut_to_size = input_filenames[:length_list]
    
    
    tree_build_time = str(int(time.time() - start_time))
    print("Building tree. Seconds elapsed: " + tree_build_time)
    
    # make the sequence bloom tree
    tree = sourmash.create_sbt_index()
    
    # add the signatures to the tree
    for filename in cut_to_size:
        sig = sourmash.load_one_signature(filename, ksize = 31)
        leaf = SigLeaf(sig.md5sum(), sig)
        tree.add_node(leaf)

    out_filename = "./updated_python_api_unit_test_" + str(iterations) + ".sbt.zip"

    tree_save_time = str(int(time.time() - start_time))
    print("Saving tree. Seconds elapsed: " + tree_save_time)
    
    # save the tree to file
    output_filename = tree.save(out_filename)

    # reload the tree from file to check for duplication
    tree_reload_time = str(int(time.time() - start_time))
    print("Saving tree. Seconds elapsed: " + tree_reload_time)
    
    reloaded_tree = sourmash.sbtmh.load_sbt_index(out_filename)

    tree_check_time = str(int(time.time() - start_time))
    print("Checking reloaded tree for duplicates. Seconds elapsed: " + tree_check_time)
    
    # check the reloaded tree for duplication using standard dictionary count method
    sig_counts = {}
    duplication = 0 # This is a flag to indicate if any duplication has occurred

    for sig in reloaded_tree.leaves():
        sig_str = sig.data.name()
        if sig_str in sig_counts.keys():
            sig_counts[sig_str] = sig_counts[sig_str] + 1
            print("DUPLICATED: " + sig_str)
            duplication = 1 # Flip the flag because duplication is observed
        else:
            sig_counts[sig_str] = 1
    
    end_time = str(int(time.time() - start_time))
    print("Done with iteration number " + str(iterations + 1) + ". Seconds elapsed: " + end_time)
    
    # check the duplication flag and if there has been duplication decrease the size of the input signature list by 50%
    if duplication == 1:
        size = size / 2 # size is set to 50%
        iterations = iterations + 1
        binary_search(input_filenames = input_filenames, length_input = length_input, \
                                max_iterations = max_iterations, size = size, iterations = iterations, lengths = lengths)
    
    # if there has not been duplication, then increase the size of the input signature list by 50%
    else:
        size = size + size / 2 # size is set to 150%
        iterations = iterations + 1
        binary_search(input_filenames = input_filenames, length_input = length_input, \
                                max_iterations = max_iterations, size = size, iterations = iterations, lengths = lengths)

#run the function
length_input = len(input_filenames)
lengths = binary_search(input_filenames, length_input, max_iterations = 10, size = 1)

So there is a smallest unit test collection of 46 signatures that give duplication of a single signature, named /data/ncbi/genomes/all/GCA/013/183/375/GCA_013183375.1_ASM1318337v1/GCA_013183375.1_ASM1318337v1_genomic.fna.gz, that you can download here. The size of that dataset is just 129Mb. I'm afraid that at this point, I'll have to leave the debugging to you, @luizirber, when you can find a moment in your busy schedule. Thanks for any help you can offer!

@taylorreiter
Copy link
Contributor

This caused some problems over in charcoal: dib-lab/charcoal#175. Just to record the info here,

Using

sourmash sig describe --csv ~/gtdb-rs202.genomic.k31.sbt.zip.sig_describe.csv gtdb-rs202.genomic.k31.sbt.zip

There are 258,406 signtuares in the sbt of which 250,886 are distinct based on identical rows in sourmash sig describe output.

@ctb
Copy link
Contributor Author

ctb commented Jun 3, 2021

So there is a smallest unit test collection of 46 signatures that give duplication of a single signature, named /data/ncbi/genomes/all/GCA/013/183/375/GCA_013183375.1_ASM1318337v1/GCA_013183375.1_ASM1318337v1_genomic.fna.gz, that you can download here. The size of that dataset is just 129Mb. I'm afraid that at this point, I'll have to leave the debugging to you, @luizirber, when you can find a moment in your busy schedule. Thanks for any help you can offer!

thanks! I note that for k=31 the duplicate md5 is ea8e0babc70f61011cbc15b453bb61ce, which is in GCA_013183235.1_ASM1318323v1_genomic.sig and GCA_013183375.1_ASM1318337v1_genomic.sig both, so I'm not sure this is a true duplicate?

@ctb
Copy link
Contributor Author

ctb commented Jun 4, 2021

catalog generation

I started with gtdb-rs202.genomic.k31.zip and gtdb-rs202.genomic.k31.sbt.zip, which contain 258,406 signatures from the GTDB genomic collection. The .sbt.zip was constructed via sourmash index.

I then created a new .sbt.zip by running:

sourmash index out.zip gtdb-rs202.genomic.k31.zip

and then used sourmash sig describe to create .catalog.txt files, which I then turned into .sorted files with:

for i in *.txt; do grep ^signature: $i | sort > $i.sorted; done

This yielded three .sorted files with the same number of signature names. Yay so far!

  258406  1723081 20771891 gtdb-rs202.genomic.k31.sbt.zip.catalog.txt.sorted
  258406  1723238 20772573 gtdb-rs202.genomic.k31.zip.catalog.txt.sorted
  258406  1723081 20771891 out.zip.sbt.zip.catalog.txt.sorted

Also, the .sbt.zip indexed files matched in content --

diff out.zip.sbt.zip.catalog.txt.sorted gtdb-rs202.genomic.k31.sbt.zip.catalog.txt.sorted

so I will ignore the out.zip file.

differences

Unfortunately, there are differences between the .zip and the .sbt.zip file: they contain 7520 differences in signature names.

diff gtdb-rs202.genomic.k31.zip.catalog.txt.sorted \
    gtdb-rs202.genomic.k31.sbt.zip.catalog.txt.sorted \
    | grep ^'>' | wc -l

yielded 7520.

When I then counted the duplicates like so,

uniq -c gtdb-rs202.genomic.k31.sbt.zip.catalog.txt.sorted | \
    grep -v ' 1 signature' > \
    gtdb-rs202.genomic.k31.sbt.zip.catalog.txt.sorted.counts

I discovered that there were exactly 7520 counts of duplicated signatures.

So what appears to be happening in the SBT code is that signatures are being replaced by duplicates.

This also explains why some GTDB identifiers are not being found in the file - see #1511 (comment), where I used the .sbt.zip to construct an LCA database.

tl;dr

This puts a very different complexion on things - we're not just adding duplicates, we're actually replacing signatures with duplicates. It also starts to narrow down places in the SBT code where it could be happening...

@ctb ctb mentioned this issue Jun 4, 2021
2 tasks
@ctb
Copy link
Contributor Author

ctb commented Jun 4, 2021

OK, new theory 😓 . On a quick scan, it looks like all of the duplicates have the same md5sum! So, for example:

  • there is a signature GCA_009816935.1 Francisella tularensis strain=06-2412, ASM981693v1 that shows up 157 times in the indexed SBT
  • it has the md5sum fb2c4c8861753dbc497d72d0e465465a
  • there are 157 entries with md5sum fb2c4c8861753dbc497d72d0e465465a in the GDTB zipfile collection

So it looks like this is may only be losing information about duplicate md5sum signatures.

@ctb
Copy link
Contributor Author

ctb commented Jun 4, 2021

ok, this is interesting. In the indexed SBT zip file, there are many different versions of the same saved sig file;

% unzip -v gtdb-rs202.genomic.k31.sbt.zip | grep fb2c4c8861753dbc497d72d0e465465a

produces

   15130  Stored    15130   0% 2021-05-17 19:58 5fb73760  .sbt.gtdb-rs202.genomic.k31/fb2c4c8861753dbc497d72d0e465465a_151
   15128  Stored    15128   0% 2021-05-17 19:58 4e145220  .sbt.gtdb-rs202.genomic.k31/fb2c4c8861753dbc497d72d0e465465a_152
   15128  Stored    15128   0% 2021-05-17 19:58 775b041c  .sbt.gtdb-rs202.genomic.k31/fb2c4c8861753dbc497d72d0e465465a_153
   15128  Stored    15128   0% 2021-05-17 19:58 c05ac4f5  .sbt.gtdb-rs202.genomic.k31/fb2c4c8861753dbc497d72d0e465465a_154
   15127  Stored    15127   0% 2021-05-17 19:58 3a8ff97a  .sbt.gtdb-rs202.genomic.k31/fb2c4c8861753dbc497d72d0e465465a_155

but in the .sbt.json directory file, these aren't referenced; so the command

% jq . < gtdb-rs202.genomic.k31.sbt.json | grep -B 1 -A 1 fb2c4c8861753dbc497d72d0e465465a

produces

    "462033": {
      "filename": "fb2c4c8861753dbc497d72d0e465465a",
      "name": "fb2c4c8861753dbc497d72d0e465465a",
      "metadata": "fb2c4c8861753dbc497d72d0e465465a"
    },
--
    "462061": {
      "filename": "fb2c4c8861753dbc497d72d0e465465a",
      "name": "fb2c4c8861753dbc497d72d0e465465a",
      "metadata": "fb2c4c8861753dbc497d72d0e465465a"
    },
--
    "462081": {
      "filename": "fb2c4c8861753dbc497d72d0e465465a",
      "name": "fb2c4c8861753dbc497d72d0e465465a",
      "metadata": "fb2c4c8861753dbc497d72d0e465465a"
    },

So it looks very much like there is some miscommunication between the storage and the SBT itself.

@ctb
Copy link
Contributor Author

ctb commented Jun 4, 2021

...aaaaaaand this is now straightforward to reproduce 😓

@ctb
Copy link
Contributor Author

ctb commented Jun 4, 2021

ok I think I figured it out over in #1568 - the JSON filename wasn't being updated with the correct actually-saved filename for duplicate md5 signatures.

Will rework that into a proper PR with tests and everything.

@ctb
Copy link
Contributor Author

ctb commented Jun 4, 2021

ok, I rebuilt the GTDB SBT index with the fix in #1568, and the catalog of the SBT index now matches that of the input zipfile collection. 🎉

I provisionally declare this bug slain!

@ctb ctb closed this as completed in #1568 Jun 7, 2021
@moorembioinfo
Copy link

Hi,

I believe this is the same problem. Using sourmash signature intersect and then comparing the signatures with sourmash compare *sig outputs, in the matrix, shortened md5 values instead of the file names. As such there are far fewer comparisons than (n)sig*(n)sig and I would have to map back to the signatures to work out which genomes have which distances!

Is there a way to force the usual compare behaviour with these signatures?

Thanks in advance for any help with this!!!!

@ctb
Copy link
Contributor Author

ctb commented Jan 15, 2022

hi @moorembioinfo, could you start a new issue, please? This particular problem was (we hope) fixed a while back and I'm guessing you're finding something new!

I'd copy your question over to a new one myself, but I thought I'd take the opportunity to ask for more info -

  • what version of sourmash are you using?
  • could you paste in the command lines you are using?
  • if you can replicate it with just a few signatures, could you also send us the output that you're seeing? (no worries if it's too big)

It sounds like when you run sourmash sig intersect, the signature names are being removed, so sourmash is displaying only the unique identifier. (I checked the code, sourmash sig intersect does indeed clear the name!) So that's one problem for us to fix.

  • Do you have too many sigs to run sourmash sig rename on the output of intersect?
  • Do you have thoughts on what a good default name would be for the output of intersect? It might make sense for the output signature to keep the name of the first signature you provide to intersect, or maybe the first sigs name with some kind of modification 🤔 . Any ideas?

Then there's another problem, which is that you're not getting a comparison of all of your intersected sigs. That should be a different issue from the signature naming problem - the names are only used for display, nothing else. I'm not actually sure how to debug that, so any more info you can give on what you're trying to do would be helpful here!

@moorembioinfo
Copy link

Hi @ctb, thanks for getting back to me! I've looked into this and haven't been able to reproduce the sourmash compare output issue. So it seems there's no bug there! Re-running on real and toy datasets the intersect signatures are named after a part of the md5sum (and have the same names) but the comparison matrix is definitely (n)sig*(n)sig. It must've been an issue of mine downstream.

As for the naming in the first place, sourmash rename worked perfectly. Perhaps this could be combined with the required output (-o) flag of sourmash intersect? Or a second optional flag to initiate this behaviour?

All the best!

Matt

@ctb
Copy link
Contributor Author

ctb commented Jan 18, 2022

excellent - very glad to hear it, if the bug crops up again don't worry too hard about replicating it before posting it here, it's always valuable to know where there are wibbly UX problems cropping up!

rename stuff punted to issue here, #1801

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants