-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to output ClinVar XML #365
Add ability to output ClinVar XML #365
Conversation
Pull Request Test Coverage Report for Build 4245543983
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks good.
I understand that this allows to write the XML not to make addition/changes to it yet.
The consequence changes are strange and I will report them to Ensembl. They affect large deletions that overlap with large portion of genes so splice_polypyrimidine_tract_variant
was wrong but coding_sequence_variant
is not any better. I guess they can't be saying transcript_ablation
since it's only a partial ablations.
I was curious so I looked into one of the consequence in more detail:
Maybe we need to revise how we're reporting consequences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I took a look at the slim clinvar_dataset.py
, for which I left a quick question (since I don't know the insides of the classes).
Re. the consequence changes, similar to @tcezard, I took a look at one of the examples: ENSG00000044524 (EPHA3) (89335416_89368021del
) that changes, like most, from splice_polypyrimidine_tract_variant
to coding_sequence_variant
. In this case I think, without taking into account the insides of VEP, the change makes sense: the deletion spans 32Kb and engulfs 2 exons, so I assume that the most severe consequence is coding_sequence_variant
out of the ones VEP reports (and above splice_polypyrimidine_tract_variant
)
Summary of VEP's:
{
"transcript_consequences": [
"coding_sequence_variant",
"intron_variant",
"feature_truncation"
]
}
output_file = os.path.join(resources_dir, 'test_output.xml.gz') | ||
|
||
input_dataset = ClinVarDataset(input_file) | ||
input_dataset.write(output_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume test_output.xml.gz
is the annotated version of ClinVar's that we would also distribute some how in case people want to use it, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or are we missing the annotation chunk in between?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's right, there's no annotation yet and we're just outputting the raw RCV XML for each record. The annotation will be coming in a subsequent PR.
Most of this diff is just splitting the single
clinvar_xml_io.py
into multiple files, one for each class. The actual functionality change is inclinvar_dataset.py
and its test.I also had to update the consequences for SNP and structural variants in the end-to-end test to match updates to VEP, please double-check those as well (can make a separate PR if you prefer).