-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[issue/273] dbSNP 2 VCF reformat util #300
[issue/273] dbSNP 2 VCF reformat util #300
Conversation
…e/273_dbSNP_vcf_reformat_utils
…tlar/bystro into feature/273_dbSNP_vcf_reformat_utils
perl/t/utils/dbsnp2FormatInfo.t
Outdated
ok( | ||
<$fh> | ||
== "NC_000001.11 10001 rs1570391677 T A,C . . RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;KOREAN=0.0109,.;SGDP_PRJ=0,.;dbGaP_PopFreq=.,0", | ||
'1st data row wiht KOREAN, SGDP_PRJ, dbGap freqs are correctly processed' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wiht -> with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bah :) need CI spellcheck, clearly
thanks
perl/lib/Utils/DbSnp2FormatInfo.pm
Outdated
$self->{_localFiles} = $localFilesAref; | ||
} | ||
|
||
# TODO: error check opening of file handles, write tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment still in force? (looks like we have tests now for this module)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no it isn't, thanks
perl/lib/Utils/DbSnp2FormatInfo.pm
Outdated
|
||
my $base_name = basename($input_vcf); | ||
$base_name =~ s/\.[^.]+$//; # Remove last file extension (if present) | ||
$base_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how many different kinds of file extensions are we likely to have to handle here, and could we match explicitly on those instead of sedding off two extensions? The worry is someone uploading a file named something like foo.bar.vcf, which would get pared down to foo [?]
perl/lib/Utils/DbSnp2FormatInfo.pm
Outdated
} | ||
|
||
# If it's an INFO line | ||
if (/FREQ=/) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably missing context here but this comment / condition pair is a little surprising because we're not matching against INFO
explicitly-- do all INFO lines in this file match /FREQ=/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I found this curious too. This is actually a clever solution, unless I'm missing something. ChatGPT came up with this, and I ended up keeping it because it was cute and quite elegant.
It checks rows for the FREQ=. If it doesn't find that, there's nothing to do but add the row as is. FREQ isn't guaranteed to be in the INFO field. It is also not found any other field, and never will be.
Then we split on ";". That will result in 1 field that contains everything up until the first INFO value, and the rest of the info values. We find the field containing FREQ=, extract the FREQ=(.*) value, expand the individual population POP=VAL as new INFO fields, join by ";" to the first field, which results in a complete file with correct delimiters.
I would have written this by splitting each field by "\t", then search INFO, but didn't see any downsides with this approach. FREQ is guaranteed to never appear 2x, though it could potentially appear 0 times, and this handles that case :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INFO is a positional field so what Alex is searching for is some value within that field. I see the logic. I would use index to find "FREQ=" rather than a regular expression but that's just my preference.
My reading of the regexps is that they would catch some non-numeric characters and your tests suggest it catches trailing stuff. I'm not sure how important that is for you.
perl/lib/Utils/DbSnp2FormatInfo.pm
Outdated
} | ||
|
||
# TODO: error check opening of file handles, write tests | ||
sub go { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems like it's doing a lot of different things at a few different levels of abstraction. One way to address that might be to break it up into helpers roughly along the lines of:
- input validation
- setting up filepaths
- the core nested loop where we process $in_fh into $output_data_fh
- updating the VCF header
- writing everything out and cleaning up
But I defer to your judgment as to whether the juice is worth the squeeze there?
perl/lib/Seq/Role/IO.pm
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adding or switching to checking magic strings to deduce the compression type rather than matching extensions exclusively? File::LibMagic is a Perl package that binds to the c library that seems like it would be a good fit and probably much easier to use than our own implementation for checking magic strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable, though beyond the scope of this PR, as the file extension solution is used everywhere (the change here was to remove accepting $innerFile, which was not handled if provided). I have a tracking ticket #312 to evaluate the switch to File::LibMagic, scheduled for Sprint 4.
perl/lib/Utils/DbSnp2FormatInfo.pm
Outdated
use 5.10.0; | ||
use strict; | ||
use warnings; | ||
use DDP; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still need DDP after development is done?
|
||
my $output_data_fh = $self->getWriteFh($output_data_path); | ||
|
||
$self->log( 'info', "Writing to $output_data_path" ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything to this point is getting setup to do actual work. This seems like a logical place to split apart the function into two sections, which will also aid testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
perl/lib/Utils/DbSnp2FormatInfo.pm
Outdated
my $freq_data = $1; | ||
my @pops = split( /\|/, $freq_data ); | ||
|
||
foreach my $pop (@pops) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the heart of what you're doing - how about making it a func that you'd test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sent you my suggestion for sharding the functions and testing via teams.
perl/t/utils/dbsnp2FormatInfo.t
Outdated
# Read the processed file to check the INFO field | ||
$fh = path($expected_output_vcf_path)->openr; | ||
|
||
ok( <$fh> == "##fileformat=VCFv4.1", 'VCF fileformat is correctly processed' ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're comparing a string using ==
the interpreter gave warnings with prove -lv
, which makes me concerned that the ok
wasn't really evaluated or something along those lines since you should expected '\n' characters if you're using <>. My suggestion is to write these as arrays of expected lines and cycle through them. Alternatively, I think we should use the eq
operator and chomp before comparing the string.
e.g.,
$fh = path($expected_output_vcf_path)->openr;
my $str = <$fh>;
chomp $str;
ok( $str eq "##fileformat=VCFv4.1", 'VCF fileformat is correctly processed' );
$str = <$fh>;
chomp $str;
ok( $str eq "##INFO=<ID=RS,Number=1,Type=String,Description=\"dbSNP ID\">",
'RS population is correctly processed' );
perl/lib/Utils/DbSnp2FormatInfo.pm
Outdated
} | ||
|
||
# If it's an INFO line | ||
if (/FREQ=/) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INFO is a positional field so what Alex is searching for is some value within that field. I see the logic. I would use index to find "FREQ=" rather than a regular expression but that's just my preference.
My reading of the regexps is that they would catch some non-numeric characters and your tests suggest it catches trailing stuff. I'm not sure how important that is for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added commit to include the spec URL and a note on processing in the module. Testing could be more transparent but that could be addressed later with some minor refactoring.
This is necessary because the dbSNP 2 vcf does not make good use of the VCF spec; the Freq field is the combination of multiple fields, each of which is an Allelic type, but where the first allele is the reference, which is not the standard use.
This will enable us to reproducibly fetch, transform, build dbSNP files, from 1 yaml config, once we add 1 more utility, which translates the RefSeq NC_* chromosome identifiers to chr1-22,X,Y,M.
Will remove [wip] once test added