No output with the actual genome/contigs clusters sequence #2

VadimDu · 2020-09-19T09:32:28Z

Dear developers,

Thank you for the useful good tool. I have followed your instructions manual, however, by running gclust exactly as you did, no output file with the actual clusters nucleotide sequence is produced. Only a list of the clusters with the genomes/contigs in each.
It would be a much better and easy to use tool, if will produce a similar output like cd-hit does, with a representative clusters fasta file.

Thank you and best regards
Vadim

sclirl · 2020-12-29T10:01:27Z

Hi!
Thank you for your great suggestions. Considering that the large input genome will affect the I/O performance of the software, we wrote a shell script to run when needed. For specific examples, please refer to Example step 3 in the README.md file.

Best regards
Ruilin

zhixue · 2021-04-12T14:05:28Z

Dear developers,

Thank you for the useful good tool. I have followed your instructions manual, however, by running gclust exactly as you did, no output file with the actual clusters nucleotide sequence is produced. Only a list of the clusters with the genomes/contigs in each.
It would be a much better and easy to use tool, if will produce a similar output like cd-hit does, with a representative clusters fasta file.

Thank you and best regards
Vadim

I had the same question before.
I have written a similar script like example step 3 to solve this, with name "gclust2fa.py".

# get the clusters' representative sequences from fasta and glust_cluster_output
# usage: python3 gclust2fa.py raw.fa gclust.out cluster.fa
# input just like:
'''
>Cluster 0
0       5888230nt, >seq1... *
>Cluster 1
0       4800869nt, >seq2... *
>Cluster 2
0       3906592nt, >seq3... *
1       20nt, >seq4... at -/100.00%

'''

import sys

# input:
fasta_file = sys.argv[1]
# glust.out
clust_file = sys.argv[2]
# output:
outfa = sys.argv[3]
if fasta_file == outfa:
    exit()

representative_ctgs = dict()

i = 0
with open(clust_file) as f:
    for line in f:
        if line.startswith('>'):
            i += 1
        else:
            temp = line.rstrip().split()
            if temp[-1] == '*':
                representative = 1
            else:
                representative = 0
            if representative == 1:
                ctgname = temp[2].rstrip('.')
                representative_ctgs[ctgname] = ''

print("Representative number: " + str(i))

outFlag = 0
with open(outfa, 'w') as fout:
    with open(fasta_file) as f:
        for line in f:
            if line.startswith('>'):
                if line.rstrip() in representative_ctgs:
                    outFlag = 1
                else:
                    outFlag = 0
            if outFlag == 1:
                fout.write(line)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No output with the actual genome/contigs clusters sequence #2

No output with the actual genome/contigs clusters sequence #2

VadimDu commented Sep 19, 2020

sclirl commented Dec 29, 2020

zhixue commented Apr 12, 2021

No output with the actual genome/contigs clusters sequence #2

No output with the actual genome/contigs clusters sequence #2

Comments

VadimDu commented Sep 19, 2020

sclirl commented Dec 29, 2020

zhixue commented Apr 12, 2021