Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update genome-preprocess #26

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

haxiomic
Copy link

@haxiomic haxiomic commented Mar 8, 2021

@yunhailuo
Copy link
Collaborator

Thank you for the PR.

I got the following errors when processing GFF3: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M26/gencode.vM26.annotation.gff3.gz

> Files queued for conversion:
	gencode.vM26.GRCm39.annotation.gff3
> Reading gencode.vM26.GRCm39.annotation.gff3
> description: evidence-based annotation of the mouse genome (GRCm39), version M26 (Ensembl 103)
> provider: GENCODE
> contact: gencode-help@ebi.ac.uk
> format: gff3
> date: 2021-01-29
> Parsing features of sequence chr1 6%
> Completed sequence chr1
> Saved 186 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr1
> Saved 6 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr1-macro
> Parsing features of sequence chr2 12%
> Error: Invalid attribute: "", Assignment must contain a '=' character
> Parsing features of sequence chr2 15%
> Completed sequence chr2
> Saved 174 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr2
> Saved 6 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr2-macro
> Parsing features of sequence chr3 19%
> Completed sequence chr3
> Saved 153 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr3
> Saved 5 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr3-macro
> Parsing features of sequence chr4 25%
> Completed sequence chr4
> Saved 150 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr4
> Saved 5 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr4-macro
> Parsing features of sequence chr5 28%
> Error: Invalid attribute: "", Assignment must contain a '=' character
> Parsing features of sequence chr5 30%
> Error: Invalid attribute: "", Assignment must contain a '=' character
> Error: Invalid attribute: "", Assignment must contain a '=' character
> Error: Invalid attribute: "", Assignment must contain a '=' character
> Error: Invalid attribute: "", Assignment must contain a '=' character
> Parsing features of sequence chr5 32%
> Completed sequence chr5
> Saved 145 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr5
> Saved 5 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr5-macro
> Parsing features of sequence chr6 37%
> Completed sequence chr6
> Saved 143 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr6
> Saved 5 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr6-macro
> Parsing features of sequence chr7 45%
> Completed sequence chr7
> Saved 139 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr7
> Saved 5 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr7-macro
> Parsing features of sequence chr8 46%
> Error: Invalid attribute: "", Assignment must contain a '=' character
> Parsing features of sequence chr8 50%
> Completed sequence chr8
> Saved 124 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr8
> Saved 4 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr8-macro
> Parsing features of sequence chr9 56%
> Completed sequence chr9
> Saved 119 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr9
> Saved 4 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr9-macro
> Parsing features of sequence chr10 60%
> Completed sequence chr10
> Saved 125 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr10
> Saved 4 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr10-macro
> Parsing features of sequence chr11 67%
> Completed sequence chr11
> Saved 117 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr11
> Saved 4 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr11-macro
> Parsing features of sequence chr12 71%
> Completed sequence chr12
> Saved 115 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr12
> Saved 4 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr12-macro
> Parsing features of sequence chr13 74%
> Completed sequence chr13
> Saved 116 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr13
> Saved 4 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr13-macro
> Parsing features of sequence chr14 78%
> Completed sequence chr14
> Saved 120 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr14
> Saved 4 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr14-macro
> Parsing features of sequence chr15 82%
> Completed sequence chr15
> Saved 100 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr15
> Saved 4 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr15-macro
> Parsing features of sequence chr16 83%
> Error: Invalid attribute: "", Assignment must contain a '=' character
> Error: Invalid attribute: "", Assignment must contain a '=' character
> Error: Invalid attribute: "", Assignment must contain a '=' character
> Error: Invalid attribute: "", Assignment must contain a '=' character
> Error: Invalid attribute: "", Assignment must contain a '=' character
> Error: Invalid attribute: "", Assignment must contain a '=' character
> Parsing features of sequence chr16 85%
> Completed sequence chr16
> Saved 94 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr16
> Saved 3 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr16-macro
> Parsing features of sequence chr17 90%
> Completed sequence chr17
> Saved 91 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr17
> Saved 3 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr17-macro
> Parsing features of sequence chr18 93%
> Completed sequence chr18
> Saved 87 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr18
> Saved 3 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr18-macro
> Parsing features of sequence chr19 93%
> Error: Invalid attribute: "", Assignment must contain a '=' character
> Error: Invalid attribute: "", Assignment must contain a '=' character
> Error: Invalid attribute: "", Assignment must contain a '=' character
> Parsing features of sequence chr19 96%
> Completed sequence chr19
> Saved 59 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr19
> Saved 2 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr19-macro
> Parsing features of sequence chrX 99%
> Completed sequence chrX
> Saved 162 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chrX
> Saved 6 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chrX-macro
> Parsing features of sequence chrY 100%
> Completed sequence chrY
> Saved 87 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chrY
> Saved 3 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chrY-macro
> Parsing features of sequence chrM 100%
> Completed sequence chrM
> Saved 1 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chrM
> Saved 1 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chrM-macro
> Saved hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/manifest.json
> Warning: Unknown features: { stop_codon: 59236,
  start_codon: 63625,
  stop_codon_redefined_as_selenocysteine: 65 }

I got no errors when trying to process corresponding GTF: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M26/gencode.vM26.annotation.gtf.gz

> Files queued for conversion:
	gencode.vM26.GRCm39.annotation.gtf
> Reading gencode.vM26.GRCm39.annotation.gtf
> Description: evidence-based annotation of the mouse genome (GRCm39), version M26 (Ensembl 103)
> Provider: GENCODE
> Contact: gencode-help@ebi.ac.uk
> Date: 2021-01-29
> Parsing features of sequence chr1 6%
> Completed sequence chr1
> Saved 186 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr1
> Saved 6 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr1-macro
> Parsing features of sequence chr2 15%
> Completed sequence chr2
> Saved 174 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr2
> Saved 6 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr2-macro
> Parsing features of sequence chr3 19%
> Completed sequence chr3
> Saved 153 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr3
> Saved 5 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr3-macro
> Parsing features of sequence chr4 25%
> Completed sequence chr4
> Saved 150 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr4
> Saved 5 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr4-macro
> Parsing features of sequence chr5 32%
> Completed sequence chr5
> Saved 145 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr5
> Saved 5 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr5-macro
> Parsing features of sequence chr6 37%
> Completed sequence chr6
> Saved 143 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr6
> Saved 5 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr6-macro
> Parsing features of sequence chr7 45%
> Completed sequence chr7
> Saved 139 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr7
> Saved 5 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr7-macro
> Parsing features of sequence chr8 50%
> Completed sequence chr8
> Saved 124 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr8
> Saved 4 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr8-macro
> Parsing features of sequence chr9 56%
> Completed sequence chr9
> Saved 119 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr9
> Saved 4 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr9-macro
> Parsing features of sequence chr10 60%
> Completed sequence chr10
> Saved 125 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr10
> Saved 4 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr10-macro
> Parsing features of sequence chr11 67%
> Completed sequence chr11
> Saved 117 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr11
> Saved 4 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr11-macro
> Parsing features of sequence chr12 71%
> Completed sequence chr12
> Saved 115 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr12
> Saved 4 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr12-macro
> Parsing features of sequence chr13 74%
> Completed sequence chr13
> Saved 116 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr13
> Saved 4 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr13-macro
> Parsing features of sequence chr14 78%
> Completed sequence chr14
> Saved 120 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr14
> Saved 4 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr14-macro
> Parsing features of sequence chr15 82%
> Completed sequence chr15
> Saved 100 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr15
> Saved 4 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr15-macro
> Parsing features of sequence chr16 85%
> Completed sequence chr16
> Saved 94 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr16
> Saved 3 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr16-macro
> Parsing features of sequence chr17 90%
> Completed sequence chr17
> Saved 91 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr17
> Saved 3 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr17-macro
> Parsing features of sequence chr18 93%
> Completed sequence chr18
> Saved 87 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr18
> Saved 3 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr18-macro
> Parsing features of sequence chr19 96%
> Completed sequence chr19
> Saved 59 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr19
> Saved 2 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chr19-macro
> Parsing features of sequence chrX 99%
> Completed sequence chrX
> Saved 162 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chrX
> Saved 6 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chrX-macro
> Parsing features of sequence chrY 100%
> Completed sequence chrY
> Saved 87 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chrY
> Saved 3 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chrY-macro
> Parsing features of sequence chrM 100%
> Completed sequence chrM
> Saved 1 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chrM
> Saved 1 files into hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/chrM-macro
> Saved hpgv-files/gencode.vM26.GRCm39.annotation.vgenes-dir/manifest.json
> Warning: Unknown features: { start_codon: 119850,
  stop_codon: 111336,
  UTR: 371984,
  Selenocysteine: 130 }

My question is 1) is the GFF3 corrupted or is it a converter problem; 2) is the conversion of GTF fine or some problems are hidden and slip through @haxiomic ?

@haxiomic
Copy link
Author

haxiomic commented Mar 10, 2021

hey @yunhailuo, I did some investigation and found a few rows in that gff3 file had trailing semi-colons. I've improved handing this, if you reinstall the node modules it should no longer produce errors (you will need to delete package-lock before doing so).

See VALIS-software/Genomics-Formats#1

the good news is the error doesn't affect the output – so it was still generating the correct valis files

@yunhailuo
Copy link
Collaborator

hey @yunhailuo, I did some investigation and found a few rows in that gff3 file had trailing semi-colons. I've fixed handing this, if you reinstall the node modules it should no longer produce errors (you will need to delete package-lock before doing so).

See VALIS-software/Genomics-Formats#1

the good news is the error doesn't affect the output – so it was still generating the correct valis files

Thank you very much for the quick fix. Tested and no errors now. I'm going to move forward using GTF. Any extra comment on the difference between GFF3 and GTF (in terms of the visual on Valis)?

@haxiomic
Copy link
Author

haxiomic commented Mar 10, 2021

GTF is the same thing as GFF3 but it's version 2 whereas GFF3 is version 3

GFF3 adds a few extra features and can contain more information so I'd recommend using GFF3 where you can. They both use the same parser in this case, the error you were getting is because I only applied the more loose semi-colon parsing for GTF, so the fix was just to use it for GFF3 too

@yunhailuo
Copy link
Collaborator

Thanks, George. Will use GFF3 whenever possible then.

@yunhailuo yunhailuo changed the title Update genome-preprocess VALIS-40-update-genome-preprocess Mar 12, 2021
@yunhailuo yunhailuo changed the title VALIS-40-update-genome-preprocess Update-genome-preprocess Mar 13, 2021
@yunhailuo yunhailuo changed the title Update-genome-preprocess Update genome-preprocess Mar 13, 2021
@zoldello zoldello force-pushed the master branch 2 times, most recently from efdae02 to 2f50742 Compare April 28, 2021 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants