Add ability to output ClinVar XML #365

apriltuesday · 2023-02-22T16:54:01Z

Most of this diff is just splitting the single clinvar_xml_io.py into multiple files, one for each class. The actual functionality change is in clinvar_dataset.py and its test.

I also had to update the consequences for SNP and structural variants in the end-to-end test to match updates to VEP, please double-check those as well (can make a separate PR if you prefer).

coveralls · 2023-02-22T18:10:45Z

Pull Request Test Coverage Report for Build 4245543983

297 of 324 (91.67%) changed or added relevant lines in 7 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+1.02%) to 83.869%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
eva_cttv_pipeline/clinvar_xml_io/clinvar_xml_io/xml_parsing.py	22	23	95.65%
eva_cttv_pipeline/clinvar_xml_io/clinvar_xml_io/clinvar_record.py	64	68	94.12%
eva_cttv_pipeline/clinvar_xml_io/clinvar_xml_io/clinvar_trait.py	45	50	90.0%
eva_cttv_pipeline/clinvar_xml_io/clinvar_xml_io/clinvar_measure.py	124	141	87.94%

Files with Coverage Reduction	New Missed Lines	%
eva_cttv_pipeline/trait_mapping/ols.py	1	87.8%

Totals
Change from base Build 4026072309:	1.02%
Covered Lines:	1305
Relevant Lines:	1556

💛 - Coveralls

tcezard

That looks good.
I understand that this allows to write the XML not to make addition/changes to it yet.
The consequence changes are strange and I will report them to Ensembl. They affect large deletions that overlap with large portion of genes so splice_polypyrimidine_tract_variant was wrong but coding_sequence_variant is not any better. I guess they can't be saying transcript_ablation since it's only a partial ablations.

tcezard · 2023-02-23T15:51:00Z

I was curious so I looked into one of the consequence in more detail:
nsv1197494 overlaps the second half of IRAK1 deleting several exon/intro.
We report it as a coding_sequence_variant but VEP reports it as

        "consequence_terms": [
          "coding_sequence_variant",
          "5_prime_UTR_variant",
          "intron_variant",
          "feature_truncation"
        ]

Maybe we need to revise how we're reporting consequences

apriltuesday · 2023-02-24T10:11:07Z

@tcezard I've created #366 for this so we can come back to it, thanks for raising. Feel free to edit the issue with any additional details.

M-casado

Overall I took a look at the slim clinvar_dataset.py, for which I left a quick question (since I don't know the insides of the classes).

Re. the consequence changes, similar to @tcezard, I took a look at one of the examples: ENSG00000044524 (EPHA3) (89335416_89368021del) that changes, like most, from splice_polypyrimidine_tract_variant to coding_sequence_variant. In this case I think, without taking into account the insides of VEP, the change makes sense: the deletion spans 32Kb and engulfs 2 exons, so I assume that the most severe consequence is coding_sequence_variant out of the ones VEP reports (and above splice_polypyrimidine_tract_variant)
Summary of VEP's:

{
  "transcript_consequences": [
  "coding_sequence_variant",
  "intron_variant",
  "feature_truncation"
   ]
}

M-casado · 2023-02-24T10:33:51Z

eva_cttv_pipeline/clinvar_xml_io/tests/test_clinvar_dataset.py

+    output_file = os.path.join(resources_dir, 'test_output.xml.gz')
+
+    input_dataset = ClinVarDataset(input_file)
+    input_dataset.write(output_file)


I assume test_output.xml.gz is the annotated version of ClinVar's that we would also distribute some how in case people want to use it, right?

Or are we missing the annotation chunk in between?

Yes that's right, there's no annotation yet and we're just outputting the raw RCV XML for each record. The annotation will be coming in a subsequent PR.

apriltuesday added 6 commits February 21, 2023 11:48

wip

f17c97b

add ability to output ClinVarDataset as XML

cea4e85

add test and run clinvar_xml_io tests in github

ce9c391

rename file and remove unused method

0a34057

update VEP consequences in e2e test

a9f4749

update structural consequences in e2e test

4b10260

apriltuesday marked this pull request as ready for review February 23, 2023 08:15

apriltuesday requested review from M-casado and tcezard February 23, 2023 08:15

apriltuesday self-assigned this Feb 23, 2023

tcezard approved these changes Feb 23, 2023

View reviewed changes

apriltuesday mentioned this pull request Feb 24, 2023

Improve functional consequence reporting for structural variants #366

Open

M-casado approved these changes Feb 24, 2023

View reviewed changes

apriltuesday merged commit 5347f8c into EBIvariation:master Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to output ClinVar XML #365

Add ability to output ClinVar XML #365

apriltuesday commented Feb 22, 2023 •

edited

Loading

coveralls commented Feb 22, 2023

tcezard left a comment •

edited

Loading

tcezard commented Feb 23, 2023 •

edited

Loading

apriltuesday commented Feb 24, 2023 •

edited

Loading

M-casado left a comment

M-casado Feb 24, 2023

M-casado Feb 24, 2023

apriltuesday Feb 24, 2023

Add ability to output ClinVar XML #365

Add ability to output ClinVar XML #365

Conversation

apriltuesday commented Feb 22, 2023 • edited Loading

coveralls commented Feb 22, 2023

Pull Request Test Coverage Report for Build 4245543983

💛 - Coveralls

tcezard left a comment • edited Loading

Choose a reason for hiding this comment

tcezard commented Feb 23, 2023 • edited Loading

apriltuesday commented Feb 24, 2023 • edited Loading

M-casado left a comment

Choose a reason for hiding this comment

M-casado Feb 24, 2023

Choose a reason for hiding this comment

M-casado Feb 24, 2023

Choose a reason for hiding this comment

apriltuesday Feb 24, 2023

Choose a reason for hiding this comment

apriltuesday commented Feb 22, 2023 •

edited

Loading

tcezard left a comment •

edited

Loading

tcezard commented Feb 23, 2023 •

edited

Loading

apriltuesday commented Feb 24, 2023 •

edited

Loading