Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

primary_transcripts.py Not working #363

Closed
pmomadeira opened this issue Mar 24, 2020 · 7 comments
Closed

primary_transcripts.py Not working #363

pmomadeira opened this issue Mar 24, 2020 · 7 comments

Comments

@pmomadeira
Copy link

Greetings,
I'm currently trying to find orthologs/orthogroups in a group of algae species of the same genus from a set of transcriptomes. I found Orthofinder while looking for ways to do this and it seemed to be one of the most complete tools out there, so I decided to try it out.

While following the tutorial to learn how to use it, I came across the following ror when using the primary_transcript.py script:

Traceback (most recent call last): File "/home/pedro/OrthoFinder/tools/primary_transcript.py", line 147, in <module> main(args) File "/home/pedro/OrthoFinder/tools/primary_transcript.py", line 143, in main CreatePrimaryTranscriptsFile(fn, dout) File "/home/pedro/OrthoFinder/tools/primary_transcript.py", line 63, in CreatePrimaryTranscriptsFile if not line.startswith(">"): continue

I'm new to computational work and a beginner when it comes to Python in particular, so I have no clue what the problem might be.

Thanks in advance.

@davidemms
Copy link
Owner

Hi Pedro

Sorry, that tool is for proteomes downloaded from Ensembl, which labels isoforms according to which gene locus they come from. For transcriptomes which aren't labelled like that you won't be able to use the primary_transcript.py script.

If you are able to just provide input files to OrthoFinder that have just one transcript per gene that would help--for example, if you assembled the transcriptomes using Trinity then it will have the information for you to do that. Otherwise, just provide your transcriptome fasta files as input to OrthoFinder.

All the best
David

@pmomadeira
Copy link
Author

Hi David,

Thank you for the quick answer, I didn't notice that the primary_script.py script was meant to work with Ensembl data. I do have some assembled transcriptomes and was able to use the aminoacid fasta files (.faa) in a test run, which seemed to work fine. When you say to use the transcriptome fasta files, do you mean the nucleotide fasta? It would be interesting to test it, since the assembly is the most time consuming step.

Best regards,
Pedro

@davidemms
Copy link
Owner

Hi Pedro

No, it's just amino acid sequences at the moment. We are looking into using assembled nucleotide sequences but if any such feature were added, it would be quite a way in the future.

All the best
David

@EasyPiPi
Copy link

Hi David,
I am trying to use the primary_transcript.py and met the same error.

Traceback (most recent call last):
File "primary_transcript.py", line 147, in
main(args)
File "primary_transcript.py", line 143, in main
CreatePrimaryTranscriptsFile(fn, dout)
File "primary_transcript.py", line 63, in CreatePrimaryTranscriptsFile
if not line.startswith(">"): continue
TypeError: startswith first arg must be bytes or a tuple of bytes, not str

My input file is downloaded from ensembl, and it looks like:

ENSP00000451515.1 pep chromosome:GRCh38:14:22439007:22439015:1 gene:ENSG00000237235.2 transcript:ENST00000434970.2 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD2 description:T cell receptor delta diversity 2 [Source:HGNC Symbol;Acc:HGNC:12255]
PSY
ENSP00000451042.1 pep chromosome:GRCh38:14:22438547:22438554:1 gene:ENSG00000223997.1 transcript:ENST00000415118.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD1 description:T cell receptor delta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12254]
EI
ENSP00000452494.1 pep chromosome:GRCh38:14:22449113:22449125:1 gene:ENSG00000228985.1 transcript:ENST00000448914.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD3 description:T cell receptor delta diversity 3 [Source:HGNC Symbol;Acc:HGNC:12256]
TGGY
ENSP00000488240.1 pep chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 transcript:ENST00000631435.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GTGG
ENSP00000487941.1 pep chromosome:GRCh38:7:142786213:142786224:1 gene:ENSG00000282431.1 transcript:ENST00000632684.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GTGG

Any suggestions will be highly appreciated!

Yixin

@EasyPiPi
Copy link

For anyone meets the same issue.
I think I get the reason. I run the script with python3.6 and it fails, but if I run it with python 2.7, it works fine.

@davidemms
Copy link
Owner

Hi

This issue has been resolved, see here: #345

You can just download the latest version of the script from the master branch on github and use that. I'll create a new release in the coming days that contain all the latest changes.

All the best
David

@shrhops
Copy link

shrhops commented Feb 25, 2022

Hi @davidemms, I'm having a similar issue and can't fix it with any of the solutions you've mentioned. When I try to run primary trancripts on my files, it looks like it's running but nothing actually happens. Running htop shows no sign that primary_transcript.py is actually running. A couple of salient points:

  1. One of my files is not from Ensembl, but it is the amino acid sequences with only one transcript per gene, downloaded from here
  2. I don't get an error or anything, it just doesn't appear to run at all. I've tried running primary_transcript.py on the file individually, and get the same result.

This is what the file looks like:

001620F.g33.t1
METXXXXXXXXXXXLKFEASEIEYVSYGGEHHLPLIMGLVDSELSEPYSIFTYRYFVYLW
PQLSFLAFHRGRCVGTVVCKMGEHRNTFRGYIAMLVVIKPYRGKGIATELVTRSIQVMME
SGCEEVTLEAEVTNKGALALYGRLGFVRAKRLFRYYLNGVDAFRLKLLFPSPLLHPSLSM
MADKDDSHWHNNDQIPIEECSEIH
004365F.g5.t1

Any advice on how I can run it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants