-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds support for the conversion of Structural Variants #86
Adds support for the conversion of Structural Variants #86
Conversation
c365499
to
06a61e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High-level syntactical review comments. Will do a logic review in the next cycle.
PR title: "Adds support for the conversion of Structural Variants" should be enough.
vcf2fhir/common.py
Outdated
|
||
|
||
def validate_alt_sv(alt): | ||
alt_pattern_sv = re.compile(r"^([a-zA-Z]+)(,[a-zA-Z]+)*$|^\.$|" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regex is really hard to understand. It's a good practice to split regex in multiple blocks and add comments for each block.
alt_pattern_sv = re.compile(r"^([a-zA-Z]+)(,[a-zA-Z]+)*$|^\.$|" + | |
alt_pattern_sv = re.compile( | |
r" | |
^. // // Some comment about pattern bein matched. | |
([a-zA-Z]+). (,[a-zA-Z]+) // Matches pattern .. | |
*$|^\.$|" + // Some comment about pattern bein matched. | |
.... |
vcf2fhir/converter.py
Outdated
determine whether to assign Homoplasmic or Heteroplasmic. | ||
If allelic depth (FORMAT.AD) / read depth (FORMAT.DP) is \ | ||
greater than ratio_ad_dp then allelic state is | ||
homoplasmic; else heteroplasmic. | ||
|
||
**genomic_source_class** (optional): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check with @rhdolin for documentation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srgothi92 @Rohan-cod genomic_source_class: An assertion as to whether variants in the VCF file are in the germline (i.e. inherited), somatic (e.g. arose spontaneously as cancer mutations), or mixed (i.e. may be a combination of germline and/or somatic).
vcf2fhir/fhir_helper.py
Outdated
def add_variant_obv( | ||
self, | ||
record, | ||
ref_seq, | ||
ratio_ad_dp, | ||
genomic_source_class): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer more compact formatting rather than splitting each parameter on a different line. I know there is a character restriction of 79 on a line. But according to doc here there are much better ways to address this requirement. For parameter, I would have relied on python's: "line continuation if code is contained within parentheses, brackets, or braces". Instead, of splitting each parameter on a new line. Can you please make this change to all the methods added/changed in this PR. For method untouched, we can create a separate PR to fix it.
def add_variant_obv( | |
self, | |
record, | |
ref_seq, | |
ratio_ad_dp, | |
genomic_source_class): | |
def add_variant_obv(self, record, ref_seq, ratio_ad_dp, | |
genomic_source_class): | |
vcf2fhir/test/expected_HG00403A.json
Outdated
} | ||
] | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new line please.
vcf2fhir/test/test_unit.py
Outdated
alt = [ | ||
['<>'], | ||
['<[[]]'], | ||
['<CN1]]'], | ||
['<CN0p', '>CN12>'], | ||
['<CN0>'], | ||
['<CN0>', '<CN2>'], | ||
['<CNV>'], | ||
['<CN0>', '<CN2>'], | ||
['<INS>'], | ||
['<CN0>', '<CN2>', '<CN3>', '<CN4>'], | ||
['.'], | ||
['<.>'], | ||
['<INS>']] | ||
valid = [ | ||
False, | ||
False, | ||
False, | ||
False, | ||
True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix formatting here as well. Aim for a more compact format.
vcf2fhir/test/test_unit.py
Outdated
['P', 'S'], | ||
['P'], | ||
['p'], | ||
['.'], | ||
['<.>'], | ||
['p', 's'], | ||
[',,'], | ||
['p', 'S'], | ||
[',t'], | ||
['..']] | ||
valid = [ | ||
True, | ||
True, | ||
True, | ||
True, | ||
False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Formatting.
vcf2fhir/json_generator.py
Outdated
( | ||
( | ||
( | ||
(not record.is_sv or | ||
(record.is_sv and | ||
record.INFO['SVTYPE'] in [ | ||
'INV', | ||
'DEL', | ||
'INS' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit formatting.
( | |
( | |
( | |
(not record.is_sv or | |
(record.is_sv and | |
record.INFO['SVTYPE'] in [ | |
'INV', | |
'DEL', | |
'INS' | |
((((not record.is_sv or | |
(record.is_sv and | |
record.INFO['SVTYPE'] in ['INV', 'DEL', 'INS' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more code review comments.
vcf2fhir/converter.py
Outdated
determine whether to assign Homoplasmic or Heteroplasmic. | ||
If allelic depth (FORMAT.AD) / read depth (FORMAT.DP) is \ | ||
greater than ratio_ad_dp then allelic state is | ||
homoplasmic; else heteroplasmic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you trying to fix the documentation issue as visible here? if yes, can you please raise a separate PR for it?
vcf2fhir/converter.py
Outdated
@@ -167,12 +170,18 @@ def __init__( | |||
if not validate_ratio_ad_dp(ratio_ad_dp): | |||
raise Exception("Please provide a valid 'ratio_ad_dp'") | |||
|
|||
if genomic_source_class not in ["germline", "somatic", "mixed"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"germline", "somatic", "mixed" three are going to be used very frequently. It will be a good idea to declare them as enum in common.py. https://docs.python.org/3/library/enum.html
vcf2fhir/json_generator.py
Outdated
if record.is_sv and len(record.samples) > 1: | ||
invalid_record_logger.debug( | ||
("Reason: Multi-sample VCFs for SVs are not allowed, " + | ||
"Record: %s, considered sample: %s"), | ||
record, | ||
record.samples[0].data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to check this case with Bob. We anyways consider 1st sample. Why the distinction for structural variants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be hard to differentiate present vs. absent for some types of SVs in a multi-sample VCF
vcf2fhir/json_generator.py
Outdated
if(record.is_sv): | ||
if( | ||
record.is_sv and | ||
record.INFO['SVTYPE'] not in ['INS', 'DEL', 'DUP', 'CNV', 'INV']): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a good idea to declare this list as constant. That way we can easily add or remove fields as an when required.
vcf2fhir/json_generator.py
Outdated
record, | ||
record.samples[0].data) | ||
return False | ||
if not record.is_sv and not validate_alt_simple(record.ALT): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validate_alt_simple(record.ALT)
Can you check PyVCF, if it has something that handles our validation scenario instead of us adding RegEx to do the validation? For us, PyVcf
does the vcf file parsing, so we should first check if we can leverage PyVcf for all our vcf related requirements. The best way to check is, create a vcf file with records containing all the different types of ALT fields and check how pyVcf
parses each one of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Shailesh, I have updated the code to use pyvcf
for alt validation
and I have commented
out the code for validation using regular expression just in case if we need it when we test the code for structural variants. If the alt validation
using pyvcf works fine I'll remove the code for validation using regular expression.
vcf2fhir/json_generator.py
Outdated
def _add_record_variants( | ||
record, | ||
ref_seq, | ||
patientID, | ||
fhir_helper, | ||
ratio_ad_dp, | ||
genomic_source_class): | ||
if(_valid_record(record, genomic_source_class)): | ||
fhir_helper.add_variant_obv( | ||
record, | ||
ref_seq, | ||
ratio_ad_dp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: look for compact formatting.
vcf2fhir/json_generator.py
Outdated
return False | ||
if( | ||
record.is_sv and | ||
record.INFO['SVTYPE'] in ['CNV', 'DUP', 'DEL'] and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
record.INFO['SVTYPE'] in ['CNV', 'DUP', 'DEL']
It's repeated multiple times, it will be a good idea to add a utility for it something like is_<what you are checking>(record)
. However, I am not able to able to understand why this condition is required. can you point me to a location in our design document form where you pick this condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, prior to to the changes in the design document, we were supposed to raise an exception if CN
is not present when SVTYPE
is in ['CNV', 'DUP', 'DEL']
but now we have to populate the copyNumber
component if possible. I added this condition to ignore the complete record and add it to the invalid record log
if the condition is true
. I just want to confirm with @rhdolin if we have to ignore just the copyNumber
component or the complete record If the condition evaluates to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just ignore the component, and still translate the record.
vcf2fhir/json_generator.py
Outdated
record, | ||
record.samples[0].data) | ||
return False | ||
if record.is_sv and not validate_alt_sv(record.ALT): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all validation methods check if we can use PyVCF.
vcf2fhir/fhir_helper.py
Outdated
if( | ||
(alleles['CODE'] != "" or alleles['ALLELE'] != "") and | ||
genomic_source_class == 'germline' and | ||
not (record.is_sv and record.INFO['SVTYPE'] not in [ | ||
'INS', | ||
'INV', | ||
'DUP', | ||
'DEL' | ||
])): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If conditions in if case become such huge it's a good practice to add a comment before it. It helps the reader to understand what if
is trying to check.
vcf2fhir/fhir_helper.py
Outdated
|
||
observation_dv_component5 = observation.ObservationComponent() | ||
observation_dv_component5.code = concept.CodeableConcept( | ||
if record.is_sv and record.INFO['SVTYPE'] in ['CNV', 'DUP', 'DEL']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the each if has record.is_sv
or not
record.is_svwhy not create a nested if and remove the repeated condition from each
if`.
if record.is_sv:
// explain contion 1. Provide some example If logic is complicated.(Repeat same thing for all conditions.)
if condition_1:
if condition_2:
else:
if condition_3:
if condition_4:
Conditional codes are complicated to understand and very error-prone. You should always try to simplify the conditions as much as possible and try to make them more readable. Unit tests are also very important to make sure all the conditions work as expected.
b55d343
to
e4d238c
Compare
2cf2fb9
to
34cd4ab
Compare
vcf2fhir/fhir_helper.py
Outdated
if record.is_sv: | ||
dna_chg = get_dna_chg(record.INFO['SVTYPE']) | ||
observation_dv_component1 = observation.ObservationComponent() | ||
observation_dv_component1.code = concept.CodeableConcept( | ||
{ | ||
"coding": [ | ||
{ | ||
"system": "http://loinc.org", | ||
"code": "48019-4", | ||
"display": "DNA Change Type" | ||
} | ||
] | ||
} | ||
) | ||
observation_dv_component1\ | ||
.valueCodeableConcept = concept.CodeableConcept( | ||
{ | ||
"coding": [ | ||
{ | ||
"system": ("http://sequenc" + | ||
"eontology.org"), | ||
"code": dna_chg['CODE'], | ||
"display": dna_chg['DISPLAY'] | ||
} | ||
] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def appendComponents:
append(getCompoentID)
append(getCompoentMeta)
append(getCompoentStatus)
if record.is_sv:
append(getGenomicSourceClass)
...
append(getSVComponet12)
else:
append(getStanardComponet2)
append(getGenomicSourceClass)
....
append(getStanardComponet7)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srgothi92, I'm not sure if this would ensure the order of the variant components.
What I did to reduce the complexity of the code was:
-
First I changed the variable names for all the variant components from
observation_dv_component{component_number}
to a name that's related to it likedna_change_type_component
. -
Second I created a list of the variant components(in order) here.
-
Third, I used that list to add components that are declared(checked using the
locals()
dictionary) to theobservation_dv.component
list.
This would also make it easier to add a new variant component in the future. We just have to update the variant components list here by adding the name of the component at the correct position and add the code to create the component in the add_variant_obv
function.
This made the logic and maintainability, a lot simpler.
I have also added a comment just above the code to append variant components.
if hasattr(record.INFO, 'CIPOS') and hasattr(record.INFO, 'CIEND'): | ||
inner_start = record.POS + record.INFO['CIPOS'][1] | ||
inner_end = record.INFO['END'] + abs(record.INFO['CIEND'][0]) | ||
outer_start = record.POS - abs(record.INFO['CIPOS'][0]) | ||
outer_end = record.INFO['END'] + record.INFO['CIEND'][1] | ||
outer_start_end_component = observation.ObservationComponent() | ||
outer_start_end_component.code = get_codeable_concept( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we doing here? It's hard to understand the logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we check, if there is a CIPOS
and a CIEND
field in record.INFO
and calculate inner and outer
start and end based on it, else we ignore the outer-start-end
component and populate the inner-start-end
component based on only the INFO.END
value. I have also added a comment just above this code.
vcf2fhir/common.py
Outdated
# def validate_alt_simple(alt): | ||
# alt_pattern_simple =\ | ||
# re.compile( | ||
# r"^([a-zA-Z]+)(,[a-zA-Z]+)*$|" + | ||
# # Match character string and comma separated | ||
# # list of character strings | ||
# r"^\.$" | ||
# # Match '.' | ||
# ) | ||
# alt = ",".join(list(map(str, list(alt)))) | ||
# if alt_pattern_simple.match(alt): | ||
# return True | ||
# else: | ||
# return False | ||
|
||
|
||
# def validate_alt_sv(alt): | ||
# alt_pattern_sv =\ | ||
# re.compile( | ||
# r"^([a-zA-Z]+)(,[a-zA-Z]+)*$|" + | ||
# # Match character string and comma separated | ||
# # list of character strings | ||
# r"^\.$|" + | ||
# # Match '.' | ||
# r"^(<[A-Z*\d*]+>)(,<[A-Z*\d*]+>)*$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(multiple occurrences)
Do we need commented code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @srgothi92, I have removed all the commented code from this PR. I have stored a copy of that code on my local computer if it is required anytime in the future.
34cd4ab
to
722e3fa
Compare
* Updated the code to add support for the conversion of Structural Variants. * Exposed a new parameter 'genomic_source_class' to configure Genomic Source Class.
722e3fa
to
de26367
Compare
Description
Structural Variants
.genomic_source_class
' to configure Genomic Source Class.Fixes #11
How Has This Been Tested?
The code has been tested by running all the unit tests using the command
python -m unittest
and validated the PEP 8 formatting usingpycodestyle
.