Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for the conversion of Structural Variants #86

Merged
merged 2 commits into from
Jul 28, 2021

Conversation

Rohan-cod
Copy link
Contributor

Description

  • Updated the code to add support for the conversion of Structural Variants.
  • Exposed a new parameter 'genomic_source_class' to configure Genomic Source Class.

Fixes #11

How Has This Been Tested?

The code has been tested by running all the unit tests using the command python -m unittest and validated the PEP 8 formatting using pycodestyle.

  • Unit Tests
  • PEP 8 Formatting Validation

@Rohan-cod Rohan-cod force-pushed the structural-variants branch 2 times, most recently from c365499 to 06a61e3 Compare July 9, 2021 10:06
Copy link
Contributor

@srgothi92 srgothi92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level syntactical review comments. Will do a logic review in the next cycle.
PR title: "Adds support for the conversion of Structural Variants" should be enough.



def validate_alt_sv(alt):
alt_pattern_sv = re.compile(r"^([a-zA-Z]+)(,[a-zA-Z]+)*$|^\.$|" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex is really hard to understand. It's a good practice to split regex in multiple blocks and add comments for each block.

Suggested change
alt_pattern_sv = re.compile(r"^([a-zA-Z]+)(,[a-zA-Z]+)*$|^\.$|" +
alt_pattern_sv = re.compile(
r"
^. // // Some comment about pattern bein matched.
([a-zA-Z]+). (,[a-zA-Z]+) // Matches pattern ..
*$|^\.$|" + // Some comment about pattern bein matched.
....

Comment on lines 80 to 85
determine whether to assign Homoplasmic or Heteroplasmic.
If allelic depth (FORMAT.AD) / read depth (FORMAT.DP) is \
greater than ratio_ad_dp then allelic state is
homoplasmic; else heteroplasmic.

**genomic_source_class** (optional):
Copy link
Contributor

@srgothi92 srgothi92 Jul 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check with @rhdolin for documentation here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srgothi92 @Rohan-cod genomic_source_class: An assertion as to whether variants in the VCF file are in the germline (i.e. inherited), somatic (e.g. arose spontaneously as cancer mutations), or mixed (i.e. may be a combination of germline and/or somatic).

Comment on lines 246 to 251
def add_variant_obv(
self,
record,
ref_seq,
ratio_ad_dp,
genomic_source_class):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer more compact formatting rather than splitting each parameter on a different line. I know there is a character restriction of 79 on a line. But according to doc here there are much better ways to address this requirement. For parameter, I would have relied on python's: "line continuation if code is contained within parentheses, brackets, or braces". Instead, of splitting each parameter on a new line. Can you please make this change to all the methods added/changed in this PR. For method untouched, we can create a separate PR to fix it.

Suggested change
def add_variant_obv(
self,
record,
ref_seq,
ratio_ad_dp,
genomic_source_class):
def add_variant_obv(self, record, ref_seq, ratio_ad_dp,
genomic_source_class):

}
]
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line please.

Comment on lines 133 to 152
alt = [
['<>'],
['<[[]]'],
['<CN1]]'],
['<CN0p', '>CN12>'],
['<CN0>'],
['<CN0>', '<CN2>'],
['<CNV>'],
['<CN0>', '<CN2>'],
['<INS>'],
['<CN0>', '<CN2>', '<CN3>', '<CN4>'],
['.'],
['<.>'],
['<INS>']]
valid = [
False,
False,
False,
False,
True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix formatting here as well. Aim for a more compact format.

Comment on lines 105 to 120
['P', 'S'],
['P'],
['p'],
['.'],
['<.>'],
['p', 's'],
[',,'],
['p', 'S'],
[',t'],
['..']]
valid = [
True,
True,
True,
True,
False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Formatting.

Comment on lines 54 to 62
(
(
(
(not record.is_sv or
(record.is_sv and
record.INFO['SVTYPE'] in [
'INV',
'DEL',
'INS'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit formatting.

Suggested change
(
(
(
(not record.is_sv or
(record.is_sv and
record.INFO['SVTYPE'] in [
'INV',
'DEL',
'INS'
((((not record.is_sv or
(record.is_sv and
record.INFO['SVTYPE'] in ['INV', 'DEL', 'INS'

Copy link
Contributor

@srgothi92 srgothi92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more code review comments.

Comment on lines 80 to 83
determine whether to assign Homoplasmic or Heteroplasmic.
If allelic depth (FORMAT.AD) / read depth (FORMAT.DP) is \
greater than ratio_ad_dp then allelic state is
homoplasmic; else heteroplasmic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you trying to fix the documentation issue as visible here? if yes, can you please raise a separate PR for it?

@@ -167,12 +170,18 @@ def __init__(
if not validate_ratio_ad_dp(ratio_ad_dp):
raise Exception("Please provide a valid 'ratio_ad_dp'")

if genomic_source_class not in ["germline", "somatic", "mixed"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"germline", "somatic", "mixed" three are going to be used very frequently. It will be a good idea to declare them as enum in common.py. https://docs.python.org/3/library/enum.html

Comment on lines 11 to 16
if record.is_sv and len(record.samples) > 1:
invalid_record_logger.debug(
("Reason: Multi-sample VCFs for SVs are not allowed, " +
"Record: %s, considered sample: %s"),
record,
record.samples[0].data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to check this case with Bob. We anyways consider 1st sample. Why the distinction for structural variants?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be hard to differentiate present vs. absent for some types of SVs in a multi-sample VCF

if(record.is_sv):
if(
record.is_sv and
record.INFO['SVTYPE'] not in ['INS', 'DEL', 'DUP', 'CNV', 'INV']):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good idea to declare this list as constant. That way we can easily add or remove fields as an when required.

record,
record.samples[0].data)
return False
if not record.is_sv and not validate_alt_simple(record.ALT):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate_alt_simple(record.ALT)

Can you check PyVCF, if it has something that handles our validation scenario instead of us adding RegEx to do the validation? For us, PyVcf does the vcf file parsing, so we should first check if we can leverage PyVcf for all our vcf related requirements. The best way to check is, create a vcf file with records containing all the different types of ALT fields and check how pyVcf parses each one of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Shailesh, I have updated the code to use pyvcf for alt validation and I have commented out the code for validation using regular expression just in case if we need it when we test the code for structural variants. If the alt validation using pyvcf works fine I'll remove the code for validation using regular expression.

Comment on lines 124 to 143
def _add_record_variants(
record,
ref_seq,
patientID,
fhir_helper,
ratio_ad_dp,
genomic_source_class):
if(_valid_record(record, genomic_source_class)):
fhir_helper.add_variant_obv(
record,
ref_seq,
ratio_ad_dp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: look for compact formatting.

return False
if(
record.is_sv and
record.INFO['SVTYPE'] in ['CNV', 'DUP', 'DEL'] and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

record.INFO['SVTYPE'] in ['CNV', 'DUP', 'DEL']

It's repeated multiple times, it will be a good idea to add a utility for it something like is_<what you are checking>(record) . However, I am not able to able to understand why this condition is required. can you point me to a location in our design document form where you pick this condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, prior to to the changes in the design document, we were supposed to raise an exception if CN is not present when SVTYPE is in ['CNV', 'DUP', 'DEL'] but now we have to populate the copyNumber component if possible. I added this condition to ignore the complete record and add it to the invalid record log if the condition is true. I just want to confirm with @rhdolin if we have to ignore just the copyNumber component or the complete record If the condition evaluates to true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just ignore the component, and still translate the record.

record,
record.samples[0].data)
return False
if record.is_sv and not validate_alt_sv(record.ALT):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all validation methods check if we can use PyVCF.

Comment on lines 391 to 399
if(
(alleles['CODE'] != "" or alleles['ALLELE'] != "") and
genomic_source_class == 'germline' and
not (record.is_sv and record.INFO['SVTYPE'] not in [
'INS',
'INV',
'DUP',
'DEL'
])):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If conditions in if case become such huge it's a good practice to add a comment before it. It helps the reader to understand what if is trying to check.


observation_dv_component5 = observation.ObservationComponent()
observation_dv_component5.code = concept.CodeableConcept(
if record.is_sv and record.INFO['SVTYPE'] in ['CNV', 'DUP', 'DEL']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the each if has record.is_sv or not record.is_svwhy not create a nested if and remove the repeated condition from eachif`.

if  record.is_sv:
   // explain contion 1. Provide some example If logic is complicated.(Repeat same thing for all conditions.)
    if condition_1:
    if condition_2:
else:
    if condition_3:
    if condition_4:

Conditional codes are complicated to understand and very error-prone. You should always try to simplify the conditions as much as possible and try to make them more readable. Unit tests are also very important to make sure all the conditions work as expected.

@Rohan-cod Rohan-cod changed the title Updated the code to add support for the conversion of Structural Variants Adds support for the conversion of Structural Variants Jul 12, 2021
@Rohan-cod Rohan-cod force-pushed the structural-variants branch 10 times, most recently from b55d343 to e4d238c Compare July 18, 2021 11:16
@Rohan-cod Rohan-cod force-pushed the structural-variants branch 2 times, most recently from 2cf2fb9 to 34cd4ab Compare July 21, 2021 16:44
Comment on lines 306 to 331
if record.is_sv:
dna_chg = get_dna_chg(record.INFO['SVTYPE'])
observation_dv_component1 = observation.ObservationComponent()
observation_dv_component1.code = concept.CodeableConcept(
{
"coding": [
{
"system": "http://loinc.org",
"code": "48019-4",
"display": "DNA Change Type"
}
]
}
)
observation_dv_component1\
.valueCodeableConcept = concept.CodeableConcept(
{
"coding": [
{
"system": ("http://sequenc" +
"eontology.org"),
"code": dna_chg['CODE'],
"display": dna_chg['DISPLAY']
}
]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def appendComponents:
   append(getCompoentID)
   append(getCompoentMeta)
   append(getCompoentStatus)
   if record.is_sv:
      append(getGenomicSourceClass)
      ...
       append(getSVComponet12)
  else:
      append(getStanardComponet2)   
     append(getGenomicSourceClass)   
      ....
      append(getStanardComponet7)      

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srgothi92, I'm not sure if this would ensure the order of the variant components.

What I did to reduce the complexity of the code was:

  • First I changed the variable names for all the variant components from observation_dv_component{component_number} to a name that's related to it like dna_change_type_component.

  • Second I created a list of the variant components(in order) here.

  • Third, I used that list to add components that are declared(checked using the locals() dictionary) to the observation_dv.component list.

This would also make it easier to add a new variant component in the future. We just have to update the variant components list here by adding the name of the component at the correct position and add the code to create the component in the add_variant_obv function.

This made the logic and maintainability, a lot simpler.

I have also added a comment just above the code to append variant components.

Comment on lines +221 to +232
if hasattr(record.INFO, 'CIPOS') and hasattr(record.INFO, 'CIEND'):
inner_start = record.POS + record.INFO['CIPOS'][1]
inner_end = record.INFO['END'] + abs(record.INFO['CIEND'][0])
outer_start = record.POS - abs(record.INFO['CIPOS'][0])
outer_end = record.INFO['END'] + record.INFO['CIEND'][1]
outer_start_end_component = observation.ObservationComponent()
outer_start_end_component.code = get_codeable_concept(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we doing here? It's hard to understand the logic here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we check, if there is a CIPOS and a CIEND field in record.INFO and calculate inner and outer
start and end
based on it, else we ignore the outer-start-end component and populate the inner-start-end component based on only the INFO.END value. I have also added a comment just above this code.

Comment on lines 187 to 211
# def validate_alt_simple(alt):
# alt_pattern_simple =\
# re.compile(
# r"^([a-zA-Z]+)(,[a-zA-Z]+)*$|" +
# # Match character string and comma separated
# # list of character strings
# r"^\.$"
# # Match '.'
# )
# alt = ",".join(list(map(str, list(alt))))
# if alt_pattern_simple.match(alt):
# return True
# else:
# return False


# def validate_alt_sv(alt):
# alt_pattern_sv =\
# re.compile(
# r"^([a-zA-Z]+)(,[a-zA-Z]+)*$|" +
# # Match character string and comma separated
# # list of character strings
# r"^\.$|" +
# # Match '.'
# r"^(<[A-Z*\d*]+>)(,<[A-Z*\d*]+>)*$"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(multiple occurrences)
Do we need commented code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @srgothi92, I have removed all the commented code from this PR. I have stored a copy of that code on my local computer if it is required anytime in the future.

@Rohan-cod Rohan-cod force-pushed the structural-variants branch from 34cd4ab to 722e3fa Compare July 28, 2021 12:07
* Updated the code to add support for the conversion of Structural Variants.

* Exposed a new parameter 'genomic_source_class' to configure Genomic Source Class.
@Rohan-cod Rohan-cod force-pushed the structural-variants branch from 722e3fa to de26367 Compare July 28, 2021 12:18
@rhdolin rhdolin merged commit 0331975 into elimuinformatics:master Jul 28, 2021
@Rohan-cod Rohan-cod deleted the structural-variants branch July 28, 2021 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vcf2fhir for structural variants
3 participants