Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No output with the actual genome/contigs clusters sequence #2

Open
VadimDu opened this issue Sep 19, 2020 · 2 comments
Open

No output with the actual genome/contigs clusters sequence #2

VadimDu opened this issue Sep 19, 2020 · 2 comments

Comments

@VadimDu
Copy link

VadimDu commented Sep 19, 2020

Dear developers,

Thank you for the useful good tool. I have followed your instructions manual, however, by running gclust exactly as you did, no output file with the actual clusters nucleotide sequence is produced. Only a list of the clusters with the genomes/contigs in each.
It would be a much better and easy to use tool, if will produce a similar output like cd-hit does, with a representative clusters fasta file.

Thank you and best regards
Vadim

@sclirl
Copy link
Contributor

sclirl commented Dec 29, 2020

Hi!
Thank you for your great suggestions. Considering that the large input genome will affect the I/O performance of the software, we wrote a shell script to run when needed. For specific examples, please refer to Example step 3 in the README.md file.

Best regards
Ruilin

@zhixue
Copy link

zhixue commented Apr 12, 2021

Dear developers,

Thank you for the useful good tool. I have followed your instructions manual, however, by running gclust exactly as you did, no output file with the actual clusters nucleotide sequence is produced. Only a list of the clusters with the genomes/contigs in each.
It would be a much better and easy to use tool, if will produce a similar output like cd-hit does, with a representative clusters fasta file.

Thank you and best regards
Vadim

I had the same question before.
I have written a similar script like example step 3 to solve this, with name "gclust2fa.py".

# get the clusters' representative sequences from fasta and glust_cluster_output
# usage: python3 gclust2fa.py raw.fa gclust.out cluster.fa
# input just like:
'''
>Cluster 0
0       5888230nt, >seq1... *
>Cluster 1
0       4800869nt, >seq2... *
>Cluster 2
0       3906592nt, >seq3... *
1       20nt, >seq4... at -/100.00%

'''

import sys

# input:
fasta_file = sys.argv[1]
# glust.out
clust_file = sys.argv[2]
# output:
outfa = sys.argv[3]
if fasta_file == outfa:
    exit()

representative_ctgs = dict()

i = 0
with open(clust_file) as f:
    for line in f:
        if line.startswith('>'):
            i += 1
        else:
            temp = line.rstrip().split()
            if temp[-1] == '*':
                representative = 1
            else:
                representative = 0
            if representative == 1:
                ctgname = temp[2].rstrip('.')
                representative_ctgs[ctgname] = ''

print("Representative number: " + str(i))

outFlag = 0
with open(outfa, 'w') as fout:
    with open(fasta_file) as f:
        for line in f:
            if line.startswith('>'):
                if line.rstrip() in representative_ctgs:
                    outFlag = 1
                else:
                    outFlag = 0
            if outFlag == 1:
                fout.write(line)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants