primary_transcripts.py Not working #363

pmomadeira · 2020-03-24T16:50:39Z

Greetings,
I'm currently trying to find orthologs/orthogroups in a group of algae species of the same genus from a set of transcriptomes. I found Orthofinder while looking for ways to do this and it seemed to be one of the most complete tools out there, so I decided to try it out.

While following the tutorial to learn how to use it, I came across the following ror when using the primary_transcript.py script:

Traceback (most recent call last): File "/home/pedro/OrthoFinder/tools/primary_transcript.py", line 147, in <module> main(args) File "/home/pedro/OrthoFinder/tools/primary_transcript.py", line 143, in main CreatePrimaryTranscriptsFile(fn, dout) File "/home/pedro/OrthoFinder/tools/primary_transcript.py", line 63, in CreatePrimaryTranscriptsFile if not line.startswith(">"): continue

I'm new to computational work and a beginner when it comes to Python in particular, so I have no clue what the problem might be.

Thanks in advance.

The text was updated successfully, but these errors were encountered:

davidemms · 2020-03-24T18:11:57Z

Hi Pedro

Sorry, that tool is for proteomes downloaded from Ensembl, which labels isoforms according to which gene locus they come from. For transcriptomes which aren't labelled like that you won't be able to use the primary_transcript.py script.

If you are able to just provide input files to OrthoFinder that have just one transcript per gene that would help--for example, if you assembled the transcriptomes using Trinity then it will have the information for you to do that. Otherwise, just provide your transcriptome fasta files as input to OrthoFinder.

All the best
David

pmomadeira · 2020-03-24T22:08:07Z

Hi David,

Thank you for the quick answer, I didn't notice that the primary_script.py script was meant to work with Ensembl data. I do have some assembled transcriptomes and was able to use the aminoacid fasta files (.faa) in a test run, which seemed to work fine. When you say to use the transcriptome fasta files, do you mean the nucleotide fasta? It would be interesting to test it, since the assembly is the most time consuming step.

Best regards,
Pedro

davidemms · 2020-03-25T10:50:12Z

Hi Pedro

No, it's just amino acid sequences at the moment. We are looking into using assembled nucleotide sequences but if any such feature were added, it would be quite a way in the future.

All the best
David

EasyPiPi · 2020-03-30T00:31:40Z

Hi David,
I am trying to use the primary_transcript.py and met the same error.

Traceback (most recent call last):
File "primary_transcript.py", line 147, in
main(args)
File "primary_transcript.py", line 143, in main
CreatePrimaryTranscriptsFile(fn, dout)
File "primary_transcript.py", line 63, in CreatePrimaryTranscriptsFile
if not line.startswith(">"): continue
TypeError: startswith first arg must be bytes or a tuple of bytes, not str

My input file is downloaded from ensembl, and it looks like:

ENSP00000451515.1 pep chromosome:GRCh38:14:22439007:22439015:1 gene:ENSG00000237235.2 transcript:ENST00000434970.2 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD2 description:T cell receptor delta diversity 2 [Source:HGNC Symbol;Acc:HGNC:12255]
PSY
ENSP00000451042.1 pep chromosome:GRCh38:14:22438547:22438554:1 gene:ENSG00000223997.1 transcript:ENST00000415118.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD1 description:T cell receptor delta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12254]
EI
ENSP00000452494.1 pep chromosome:GRCh38:14:22449113:22449125:1 gene:ENSG00000228985.1 transcript:ENST00000448914.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD3 description:T cell receptor delta diversity 3 [Source:HGNC Symbol;Acc:HGNC:12256]
TGGY
ENSP00000488240.1 pep chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 transcript:ENST00000631435.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GTGG
ENSP00000487941.1 pep chromosome:GRCh38:7:142786213:142786224:1 gene:ENSG00000282431.1 transcript:ENST00000632684.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GTGG

Any suggestions will be highly appreciated!

Yixin

EasyPiPi · 2020-03-30T00:37:43Z

For anyone meets the same issue.
I think I get the reason. I run the script with python3.6 and it fails, but if I run it with python 2.7, it works fine.

davidemms · 2020-03-30T09:26:51Z

Hi

This issue has been resolved, see here: #345

You can just download the latest version of the script from the master branch on github and use that. I'll create a new release in the coming days that contain all the latest changes.

All the best
David

shrhops · 2022-02-25T09:24:05Z

Hi @davidemms, I'm having a similar issue and can't fix it with any of the solutions you've mentioned. When I try to run primary trancripts on my files, it looks like it's running but nothing actually happens. Running htop shows no sign that primary_transcript.py is actually running. A couple of salient points:

One of my files is not from Ensembl, but it is the amino acid sequences with only one transcript per gene, downloaded from here
I don't get an error or anything, it just doesn't appear to run at all. I've tried running primary_transcript.py on the file individually, and get the same result.

This is what the file looks like:

001620F.g33.t1
METXXXXXXXXXXXLKFEASEIEYVSYGGEHHLPLIMGLVDSELSEPYSIFTYRYFVYLW
PQLSFLAFHRGRCVGTVVCKMGEHRNTFRGYIAMLVVIKPYRGKGIATELVTRSIQVMME
SGCEEVTLEAEVTNKGALALYGRLGFVRAKRLFRYYLNGVDAFRLKLLFPSPLLHPSLSM
MADKDDSHWHNNDQIPIEECSEIH
004365F.g5.t1

Any advice on how I can run it?

davidemms closed this as completed Apr 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

primary_transcripts.py Not working #363

primary_transcripts.py Not working #363

pmomadeira commented Mar 24, 2020

davidemms commented Mar 24, 2020

pmomadeira commented Mar 24, 2020

davidemms commented Mar 25, 2020

EasyPiPi commented Mar 30, 2020

EasyPiPi commented Mar 30, 2020

davidemms commented Mar 30, 2020

shrhops commented Feb 25, 2022

primary_transcripts.py Not working #363

primary_transcripts.py Not working #363

Comments

pmomadeira commented Mar 24, 2020

davidemms commented Mar 24, 2020

pmomadeira commented Mar 24, 2020

davidemms commented Mar 25, 2020

EasyPiPi commented Mar 30, 2020

EasyPiPi commented Mar 30, 2020

davidemms commented Mar 30, 2020

shrhops commented Feb 25, 2022