-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding the second VcfParser class: NucleusParser #346
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Saman. I added some comments. Please also add unit tests just for the proto->class conversion for now. We can add more once the from_string method is available.
import vcf | ||
|
||
from apache_beam.coders import coders | ||
from apache_beam.io.filesystems import FileSystems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please only import the filesystems module (this is required by the styleguide [1], but not followed by Beam, which is why there is sometimes a discrepancy in our code).
[1] https://github.com/google/styleguide/blob/gh-pages/pyguide.md#22-imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
break | ||
return temp_file_name | ||
|
||
def _store_to_temp_local_file(self, header_lines): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had this for our final design that we will have from_string(), I though I will just leave it here until we drop the current hack and will have our final implementation.
I will remove it now and add later ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have the from_string()
I added this back.
# This member will be properly initiated in _init_with_header(). | ||
self._vcf_reader = None | ||
# These members will be properly initiated in _extract_header_fields(). | ||
self._is_info_repeated = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a dict of info and format that point to VcfInfo
and VcfFormatInfo
instead; that is more future-proof (e.g. if we want to also check the description or type for another purpose, we could just reuse those dicts later on without introducing more specialized dicts). You can keep the is_repeated arg in _convert_list_value
as it's a nice abstraction, but you can make a helper function that given a number string would return true/false if the field is repeated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I also added two other methods so finding is_repeated would be more clear.
|
||
def _init_with_header(self, header_lines): | ||
# This header line is needed by Nucleus. | ||
header_lines = ['##fileformat=VCFv4.2'] + header_lines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please only add this if the first like is not ##fileformat
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point!
|
||
def _init_with_header(self, header_lines): | ||
# This header line is needed by Nucleus. | ||
header_lines = ['##fileformat=VCFv4.2'] + header_lines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this a constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the question is, how about the const string ##fileformat
which we use in the if condition? It seems a bit too redundant if we define that as another constant?
Let me know what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of this suggestion: define FILE_FORMAT_HEADER_TEMPLATE='##fileformat=VCF{VERSION}'
.
You can then do startswith(FILE_FORMAT_HEADER_TEMPLATE.format(VERSION='')
...if it doesn't match, you can add FILE_FORMAT_HEADER_TEMPLATE.format('v4.0')
. I think it's ok to hard-code the 'v.4.0' for simplicity.
self._file_name, data_line, str(e)) | ||
return MalformedVcfRecord(self._file_name, data_line, str(e)) | ||
|
||
def _convert_to_variant(self, record): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/record/variant_proto (we used 'record' before b/c that was the PyVCF naming).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
return MalformedVcfRecord(self._file_name, data_line, str(e)) | ||
|
||
def _convert_to_variant(self, record): | ||
# type: (nucleus_proto.Variant) -> Variant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type looks wrong as there is no "nucleus_proto" module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my 'proper' implementation (using the from_string() method), we need:
from nucleus.protos import variants_pb2 as nucleus_proto
to initialize the Nucleus object (their signature has changed from the current version).
Here we don't need to import that module.
info=self._get_variant_info(record), | ||
calls=self._get_variant_calls(record)) | ||
|
||
def _get_variant_info(self, record): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/record/variant_proto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
if not sample.genotype: | ||
call.genotype.append(MISSING_GENOTYPE_VALUE) | ||
else: | ||
for v in sample.genotype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just do call.genotype = sample.genotype
(or if it's really not a list, just cast to a list like list(sample.genotype)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to convert it to list otherwise it fails.
@@ -186,7 +186,8 @@ def __init__(self, | |||
compression_type=CompressionTypes.AUTO, # type: str | |||
buffer_size=DEFAULT_VCF_READ_BUFFER_SIZE, # type: int | |||
validate=True, # type: bool | |||
allow_malformed_records=False # type: bool | |||
allow_malformed_records=False, # type: bool | |||
use_nucleus=False # type: bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's more future proof to make an enum for this like VcfParserType
and use that here (in case we decide to add a third parser). There is even a 'simpler' way of just using the class string and let python actually construct it (so you dont even need an if statement) but that may be less readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we need a more complicated approach given that we are pretty sure there is not going to be another parser anytime soon?
If you think still we better use the enum let me know, I will do it in the next round of comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should. It's very reasonable to have other types of parsers in the future (e.g. we're thinking about a bgzip parser; which may end up being a layer on top of Nuclues). In general, it's a good software engineering practice to not use bools for things that can possibly have more than two values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
67a6340
to
13640a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except for a strange pylint message (which I am looking into it) this PR is ready for review.
Also I have started working on unit tests in a separate PR, if you prefer I can add them to this one.
Thanks.
import vcf | ||
|
||
from apache_beam.coders import coders | ||
from apache_beam.io.filesystems import FileSystems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
self._temp_local_file = self._write_to_local_file(file_name) | ||
|
||
def _write_to_local_file(self, remote_file_name): | ||
(temp_file, temp_file_name) = tempfile.mkstemp(text=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
break | ||
return temp_file_name | ||
|
||
def _store_to_temp_local_file(self, header_lines): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had this for our final design that we will have from_string(), I though I will just leave it here until we drop the current hack and will have our final implementation.
I will remove it now and add later ...
|
||
def _init_with_header(self, header_lines): | ||
# This header line is needed by Nucleus. | ||
header_lines = ['##fileformat=VCFv4.2'] + header_lines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point!
|
||
def _init_with_header(self, header_lines): | ||
# This header line is needed by Nucleus. | ||
header_lines = ['##fileformat=VCFv4.2'] + header_lines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the question is, how about the const string ##fileformat
which we use in the if condition? It seems a bit too redundant if we define that as another constant?
Let me know what do you think?
return None | ||
if len(output_list) == 1: | ||
return output_list[0] | ||
raise ValueError('a not repeated field has more than 1 value') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe, regardless of what the Nucleus does, we need to do the right thing and fail (or at least report this problem) instead of just dropping unexpected values.
If you think failing is not a good idea, we could change this to a warning. Let me know what you think.
# This member will be properly initiated in _init_with_header(). | ||
self._vcf_reader = None | ||
# These members will be properly initiated in _extract_header_fields(). | ||
self._is_info_repeated = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I also added two other methods so finding is_repeated would be more clear.
break | ||
return temp_file_name | ||
|
||
def _store_to_temp_local_file(self, header_lines): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have the from_string()
I added this back.
alternate_bases=variant_proto.alternate_bases, | ||
names=map(str, variant_proto.names), | ||
alternate_bases=list(variant_proto.alternate_bases), | ||
names=variant_proto.names[0].split(';') if variant_proto.names else [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should report this as a bug to Nucleus team.
if phaseset_from_format or call_proto.is_phased: | ||
call.phaseset = (phaseset_from_format if phaseset_from_format | ||
else DEFAULT_PHASESET_VALUE) | ||
if call_proto.is_phased: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified this if statement, I think this is more straightforward. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Saman. This looks much better! Please also submit unit tests together with this PR as we want to avoid having untested code in our repo at any point; may also be easier to catch/fix strange errors.
P.S. let's sync offline about how to do setup.py properly so that the unit tests actually pass.
import os | ||
import tempfile | ||
|
||
from nucleus.io.python import vcf_reader as nucleus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/nucleus/nucleus_vcf_reader so that it doesn't conflict with import nucleus
itself (in case someone wants to do that in the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
import tempfile | ||
|
||
from nucleus.io.python import vcf_reader as nucleus | ||
from nucleus.protos import variants_pb2 as nucleus_proto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use variants_pb2
here. The one above is fine as you like to distinguish between pyvcf and nucleus parsers, but you should leave this one as is (e.g. it's easier for searching).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -41,7 +46,7 @@ | |||
DEFAULT_PHASESET_VALUE = '*' # Default phaseset value if call is phased, but | |||
# no 'PS' is present. | |||
MISSING_GENOTYPE_VALUE = -1 # Genotype to use when '.' is used in GT field. | |||
|
|||
FIRST_HEADER_LINE = '##fileformat=VCFv4.2\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: do all header_lines
passed via _init_with_header
include a \n
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, originally I did not have \n
and I noticed header was missing AC
info field. When I further looked at the problem I found out the top two lines are concatenated as one line (and Nucleus actually did not show any warning or error message).
@@ -41,7 +46,7 @@ | |||
DEFAULT_PHASESET_VALUE = '*' # Default phaseset value if call is phased, but | |||
# no 'PS' is present. | |||
MISSING_GENOTYPE_VALUE = -1 # Genotype to use when '.' is used in GT field. | |||
|
|||
FIRST_HEADER_LINE = '##fileformat=VCFv4.2\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use 4.0 as the 'least common denominator'. Not sure if it matters for the parser though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
return MalformedVcfRecord(self._file_name, data_line, str(e)) | ||
|
||
def _convert_to_variant(self, variant_proto): | ||
# type: (nucleus.protos.variants_pb2.Variant) -> Variant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: just variants_pb2.Variant
is fine here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
# This is a temporary solution until from_string will be fixed. | ||
record = next(self._vcf_reader) | ||
return self._convert_to_variant(record) | ||
except (LookupError, ValueError) as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't see LookupError
in their code. Looks like they only throw ValueError
in their from_string signature.
if record.reference_bases != MISSING_FIELD_VALUE | ||
else None), | ||
alternate_bases=( | ||
map(str, record.alternate_bases) if record.alternate_bases else []), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strange. We should investigate this further later. OK for this PR.
alternate_bases=( | ||
map(str, record.alternate_bases) if record.alternate_bases else []), | ||
names=map(str, record.names) if record.names else [], | ||
quality=record.quality, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right; it seems like they use -1
instead of null for quality field. We should eventually revisit this and special case on that. OK to leave as TODO for this PR.
return None | ||
if len(output_list) == 1: | ||
return output_list[0] | ||
raise ValueError('a not repeated field has more than 1 value') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only throw an error if allow_malformed_records is true. But let's first see what Nucleus does. OK to just add a TODO for this PR though.
|
||
def _get_variant_calls(self, record): | ||
calls = [] | ||
for sample in record.calls: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this context, it's better to use consistent names within each class rather than across classes. There are some terminologies that are PyVCF specific that we shouldn't leak into Nucleus ;) (e.g. when/if we remove PyVCF).
5282426
to
b845c24
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added all modifications that you asked for. I also duplicated two of the unit tests so we have a minimum level of tests for Nucleus. Please let me know if you think this is enough.
import os | ||
import tempfile | ||
|
||
from nucleus.io.python import vcf_reader as nucleus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
import tempfile | ||
|
||
from nucleus.io.python import vcf_reader as nucleus | ||
from nucleus.protos import variants_pb2 as nucleus_proto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -41,7 +46,7 @@ | |||
DEFAULT_PHASESET_VALUE = '*' # Default phaseset value if call is phased, but | |||
# no 'PS' is present. | |||
MISSING_GENOTYPE_VALUE = -1 # Genotype to use when '.' is used in GT field. | |||
|
|||
FIRST_HEADER_LINE = '##fileformat=VCFv4.2\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, originally I did not have \n
and I noticed header was missing AC
info field. When I further looked at the problem I found out the top two lines are concatenated as one line (and Nucleus actually did not show any warning or error message).
@@ -41,7 +46,7 @@ | |||
DEFAULT_PHASESET_VALUE = '*' # Default phaseset value if call is phased, but | |||
# no 'PS' is present. | |||
MISSING_GENOTYPE_VALUE = -1 # Genotype to use when '.' is used in GT field. | |||
|
|||
FIRST_HEADER_LINE = '##fileformat=VCFv4.2\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
return MalformedVcfRecord(self._file_name, data_line, str(e)) | ||
|
||
def _convert_to_variant(self, variant_proto): | ||
# type: (nucleus.protos.variants_pb2.Variant) -> Variant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
else: | ||
call.genotype = list(call_proto.genotype) | ||
|
||
if call_proto.is_phased: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your reasoning for changing it?
Nucleus comment on is_phased
field implies that the value of PS field is only relevant when is_phased == True
.
I changed this to the old logic.
@@ -186,7 +186,8 @@ def __init__(self, | |||
compression_type=CompressionTypes.AUTO, # type: str | |||
buffer_size=DEFAULT_VCF_READ_BUFFER_SIZE, # type: int | |||
validate=True, # type: bool | |||
allow_malformed_records=False # type: bool | |||
allow_malformed_records=False, # type: bool | |||
use_nucleus=False # type: bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
# This is a temporary solution until from_string will be fixed. | ||
record = next(self._vcf_reader) | ||
return self._convert_to_variant(record) | ||
except (LookupError, ValueError) as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, last time I quickly checked this test file and I saw they are checking both of those exception. Now that I checked the test_vcf_from_string_raises_on_bad_input
I saw they only check the ValueError
for from_string()
method.
df52894
to
d8d17f3
Compare
Pull Request Test Coverage Report for Build 1264
💛 - Coveralls |
d8d17f3
to
89d9266
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Saman! Looks great! Just two nits (one is optional).
try: | ||
variant_proto = self._vcf_reader.from_string(data_line) | ||
return self._convert_to_variant(variant_proto) | ||
except (ValueError) as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please remove the surrounding brackets around ValueError
self._representative_header_lines, | ||
buffer_size=self._buffer_size, | ||
skip_header_lines=0) | ||
if self._vcf_parser_type == VcfParserType.PYVCF: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Since the signature of the classes is exactly the same, you can do:
vcf_parser_class = None
if self._vcf_parser_type == VcfParserType.PYVCF:
vcf_parser_class = vcf_parser.PyVcfParser
elif self._vcf_parser_type == VcfParserType.NUCLEUS:
vcf_parser_class = vcf_parser.NucleusParser
else:
raise Error
record_iterator = vcf_parser_class(
file_name,
range_tracker,
....)
This has the advantage of not duplicating the code for the arguments, but we may need to change it back if at any point the signature of the subclasses is not the same. So, up to you if you want to change it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Please double check that integration tests pass prior to submitting.
as the second class that implements the VcfParser interface. Currently we are trying to solve a few issues related to the nucleus Wheel file.
After defining the VCFParser interface and PyVcfParser in the PR we add NucleusParser. Following points must be considered: * This new parser is hidden behind a flag (use_nucleus=False). * Since we are not still able to create a Wheel file that have the from_string() method, we use a hack to make Nucleus work: write the whole file to a local temp file and send it as input to parser. * Currently we don't have any unit tests or integration test. We manually cheked the sanity of Nucleus output by comparing its output table with PyVcf output. This must be done in the next step.
as proposed by Asha.
After we managed to build a Wheel file based on the latest code base. We also make a few minor changes to make NucleusParser work with Dataflow.
and also move unit tests of Variant and VariantCall classes to vcf_parser_test.py
In order to use NucleusParser with a DataflowRuner the following flag is still needed: --extra_package /dirctory/containing/Nucleus-0.1.0-py2-none-any.whl
11bf049
to
5c2fc6f
Compare
In PR googlegenomics#346 we added a temporary solution to install Nucleus using its wheel file. Now that Nucleus has pip install we can remove that hack.
In PR googlegenomics#346 we added a temporary solution to install Nucleus using its wheel file. Now that Nucleus has pip install we can remove that hack.
In PR googlegenomics#346 we added a temporary solution to install Nucleus using its wheel file. Nucleus now has pip install so we can remove that hack.
In PR googlegenomics#346 we added a temporary solution to install Nucleus using its wheel file. Nucleus now has pip install so we can remove that hack.
In PR googlegenomics#346 we added a temporary solution to install Nucleus using its wheel file. Nucleus now has pip install so we can remove that hack.
* Adding google-nucleus to the REQUIRED_PACKAGES list In PR #346 we added a temporary solution to install Nucleus using its wheel file. Nucleus now has pip install so we can remove that hack. * Review comments
After defining the VCFParser interface and PyVcfParser in a previous PR, here we add
NucleusParser. Following points must be highlighted:
from_string() method, we use a hack to make Nucleus work: write the whole
file to a local temp file and send it as input to parser.
manually cheked the sanity of Nucleus output by comparing its output
table with PyVcf output. This must be done in the next step.