Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to output ClinVar XML #365

Merged
merged 6 commits into from
Feb 24, 2023

Conversation

apriltuesday
Copy link
Contributor

@apriltuesday apriltuesday commented Feb 22, 2023

Most of this diff is just splitting the single clinvar_xml_io.py into multiple files, one for each class. The actual functionality change is in clinvar_dataset.py and its test.

I also had to update the consequences for SNP and structural variants in the end-to-end test to match updates to VEP, please double-check those as well (can make a separate PR if you prefer).

@coveralls
Copy link

Pull Request Test Coverage Report for Build 4245543983

  • 297 of 324 (91.67%) changed or added relevant lines in 7 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+1.02%) to 83.869%

Changes Missing Coverage Covered Lines Changed/Added Lines %
eva_cttv_pipeline/clinvar_xml_io/clinvar_xml_io/xml_parsing.py 22 23 95.65%
eva_cttv_pipeline/clinvar_xml_io/clinvar_xml_io/clinvar_record.py 64 68 94.12%
eva_cttv_pipeline/clinvar_xml_io/clinvar_xml_io/clinvar_trait.py 45 50 90.0%
eva_cttv_pipeline/clinvar_xml_io/clinvar_xml_io/clinvar_measure.py 124 141 87.94%
Files with Coverage Reduction New Missed Lines %
eva_cttv_pipeline/trait_mapping/ols.py 1 87.8%
Totals Coverage Status
Change from base Build 4026072309: 1.02%
Covered Lines: 1305
Relevant Lines: 1556

💛 - Coveralls

@apriltuesday apriltuesday marked this pull request as ready for review February 23, 2023 08:15
@apriltuesday apriltuesday self-assigned this Feb 23, 2023
Copy link
Member

@tcezard tcezard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good.
I understand that this allows to write the XML not to make addition/changes to it yet.
The consequence changes are strange and I will report them to Ensembl. They affect large deletions that overlap with large portion of genes so splice_polypyrimidine_tract_variant was wrong but coding_sequence_variant is not any better. I guess they can't be saying transcript_ablation since it's only a partial ablations.

@tcezard
Copy link
Member

tcezard commented Feb 23, 2023

I was curious so I looked into one of the consequence in more detail:
nsv1197494 overlaps the second half of IRAK1 deleting several exon/intro.
We report it as a coding_sequence_variant but VEP reports it as

        "consequence_terms": [
          "coding_sequence_variant",
          "5_prime_UTR_variant",
          "intron_variant",
          "feature_truncation"
        ]

Maybe we need to revise how we're reporting consequences

@apriltuesday
Copy link
Contributor Author

apriltuesday commented Feb 24, 2023

@tcezard I've created #366 for this so we can come back to it, thanks for raising. Feel free to edit the issue with any additional details.

Copy link
Collaborator

@M-casado M-casado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I took a look at the slim clinvar_dataset.py, for which I left a quick question (since I don't know the insides of the classes).

Re. the consequence changes, similar to @tcezard, I took a look at one of the examples: ENSG00000044524 (EPHA3) (89335416_89368021del) that changes, like most, from splice_polypyrimidine_tract_variant to coding_sequence_variant. In this case I think, without taking into account the insides of VEP, the change makes sense: the deletion spans 32Kb and engulfs 2 exons, so I assume that the most severe consequence is coding_sequence_variant out of the ones VEP reports (and above splice_polypyrimidine_tract_variant)
Summary of VEP's:

{
  "transcript_consequences": [
  "coding_sequence_variant",
  "intron_variant",
  "feature_truncation"
   ]
}

output_file = os.path.join(resources_dir, 'test_output.xml.gz')

input_dataset = ClinVarDataset(input_file)
input_dataset.write(output_file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume test_output.xml.gz is the annotated version of ClinVar's that we would also distribute some how in case people want to use it, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or are we missing the annotation chunk in between?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's right, there's no annotation yet and we're just outputting the raw RCV XML for each record. The annotation will be coming in a subsequent PR.

@apriltuesday apriltuesday merged commit 5347f8c into EBIvariation:master Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants