Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] dumping color file into parsable format #50

Closed
damiankao opened this issue Jun 14, 2021 · 4 comments
Closed

[feature request] dumping color file into parsable format #50

damiankao opened this issue Jun 14, 2021 · 4 comments

Comments

@damiankao
Copy link

damiankao commented Jun 14, 2021

It would be great if we can dump the color file into a text format or something easier to parse so we can analyze the graph.

@damiankao damiankao changed the title [feature request] [feature request] dumping color file into parsable format Jun 14, 2021
@rhysnewell
Copy link

rhysnewell commented Nov 30, 2022

I would also appreciate this, alongside the index file if possible. Or perhaps some instruction on how to interpret the byte strings in the colours file?

@cgroza
Copy link

cgroza commented May 12, 2023

This would be useful for creating phylogenies based on k-mer sharing between colours using metrics like Jaccard distance.

@cgroza
Copy link

cgroza commented May 24, 2023

The right solution to this is to use the Bifrost API.
I have made an attempt here, maybe it is useful to others:
https://github.com/cgroza/bifrost_jaccard

@GuillaumeHolley
Copy link
Collaborator

GuillaumeHolley commented Sep 14, 2023

Hi everyone,

@cgroza Thank you for your implementation!

There is now another fairly simple solution to do this:

  1. Make a FASTA file of each k-mer in the segments of the GFA file. Assuming k=31:
zcat mygraph.gfa.gz | awk 'BEGIN {K=31} {if ($1=="S"){LEN_KM_UNITIG=length($3)-K+1; for (i=1; i<=LEN_KM_UNITIG; i+=1){print ">" $2 "_" i "\n" substr($3,i,K)}}}' > mygraph.kmers.fasta

Every record in the generated FASTA file has a name with the form >x_y where x is the unitig ID (in the GFA) the k-mer is from and y is the position (1-based) of the k-mer within that unitig.

  1. Query the colored graph using the previously generated GFA
Bifrost query -v -t 16 -e 1.0 -g mygraph.gfa.gz -C mygraph.color.bfg -q mygraph.kmers.fasta -o mygraph.colors

The output file will be mygraph.colors.tsv which is a matrix (k-mers x colors). The intersection of a row (k-mer) and column (color) contains a binary value indicating whether the corresponding k-mer is present (1) or not (0) in the sample matching the corresponding color.

Given that Bifrost graphs are now fully indexed (.bfi file output alongside the .gfa) and are very fast to load in memory, this solution should take no time to run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants