Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporating changes from GVS to existing files #8256

Merged
merged 2 commits into from
Apr 11, 2023

Conversation

lbergelson
Copy link
Member

  • Absorbing the changes to core GATK files from the long running GVS work.
  • Based on the vs_834_deletions branch where code was factored to separate the gvs
    specific classes from the shared gatk code
  • There were fairly minimal conflicts between the two except for in the new VQSR package.
    In this case we've taken the master version of those files since the GVS version is out
    of date.
  • The dockstore file was also not included because I don't understand it and someone familiar with it will have to take a look.

@lbergelson
Copy link
Member Author

@droazen @mcovarr
This incorporates the code from https://github.com/broadinstitute/gatk/pull/8231/files into gatk. There are two caveats that I've described above which may need input from the GVS team.

If tests are passing we should probably do a light review to make sure nothing is obviously bonkers in unintentional ways from the merge.

@droazen droazen self-requested a review March 22, 2023 17:17
@droazen droazen self-assigned this Mar 22, 2023
@lbergelson
Copy link
Member Author

We're failing with:
stderr: write /opt/miniconda/envs/gatk/libexec/gcc/x86_64-conda-linux-gnu/7.5.0/lto1: no space left on device

@mcovarr
Copy link
Collaborator

mcovarr commented Mar 22, 2023

@lbergelson we're not sure why we'd be running out of space, we'll have a look.

@mcovarr
Copy link
Collaborator

mcovarr commented Mar 22, 2023

Per team discussion:

  • Taking the new VQSR package from master is a good thing, so yay to that. 🙂
  • .dockstore.yml is used for making the GVS WDLs available in Terra from Dockstore, so this is something that can / should live in a GVS-specific repo and doesn't need to be in GATK.

@mcovarr
Copy link
Collaborator

mcovarr commented Mar 22, 2023

@lbergelson The ~370 MB of files added under src/test/resources/large/filteringJointVcf appear to be test data from an older version of VQSR Lite that are no longer referenced. I deleted these files and that seems to have been enough to bring disk usage back under the limit, with no tests upset by their absence.

@lbergelson
Copy link
Member Author

@mcovarr Happy to remove these then! We're dangerously close to the vm disk limit it seems.

* Absorbing the changes to core GATK files from the long running GVS work.
* Based on the vs_834_deletions branch where code was factored to separate the gvs
  specific classes from the shared gatk code
* There were fairly minimal conflicts between the two except for in the new VQSR package.
  In this case we've taken the master version of those files since the GVS version is out
  of date.

remove some large files that don't seem to be referenced
@mcovarr
Copy link
Collaborator

mcovarr commented Mar 22, 2023

Yeah disk space definitely seems pretty tight in these builds right now...

I'm not sure why those files are still in our branch. I did a cleanup pass on our branch a couple of weeks ago that was intended to identify files like this; I'll see if I can figure out how they were missed.

@lbergelson
Copy link
Member Author

Maybe they were referenced in the vqsr stuff that I took the newer version of?

@mcovarr
Copy link
Collaborator

mcovarr commented Mar 22, 2023

OK so the script I wrote earlier was really about figuring out which files in ah_var_store were candidates to be moved to a new variantstore repo and not explicitly about identifying cruft in our ah_var_store branch (though I did incidentally identify some cruft in the process of writing the script). However a couple of functions from this can be repurposed to find cruft:

% comm -1 -2 <(files_deleted_from_master) <(files_added_on_ah_var_store)              
scripts/vcf_site_level_filtering_cromwell_tests/vcf_site_level_filtering_travis.json
src/test/resources/large/filteringJointVcf/test_10_samples.22.avg.vcf.gz
src/test/resources/large/filteringJointVcf/test_10_samples.22.avg.vcf.gz.tbi
src/test/resources/large/filteringJointVcf/test_10_samples.23.avg.vcf.gz
src/test/resources/large/filteringJointVcf/test_10_samples.23.avg.vcf.gz.tbi
src/test/resources/large/filteringJointVcf/test_10_samples.sites_only.vcf.gz
src/test/resources/large/filteringJointVcf/test_10_samples.sites_only.vcf.gz.tbi
%

Which are the 6 files we already knew about plus a bonus JSON. I'll make a PR against ah_var_store to clean this out and hopefully prevent anyone else from having to think about these files again. 🙂

@mcovarr
Copy link
Collaborator

mcovarr commented Mar 22, 2023

Actually it turns out all of these files are still being used in the version of VQSR Lite code on ah_var_store, even the JSON file with 'Travis' in its name. So at least from the ah_var_store perspective this wasn't cruft, even if it is on master.

@lbergelson
Copy link
Member Author

@mcovarr Interesting. Maybe they were additional test files added when you incorporated the vqsr lite changes? It was very unclear to me if there were intentional changes / additions in the variant store branch or if things were just out of date / slightly different than master. Maybe we should have someone who was involved in that take a look so we can be sure to incorporate any positive changes that were made.

@mcovarr
Copy link
Collaborator

mcovarr commented Mar 23, 2023

Hi Louis, we talked this over and the Variants folks who have done more VQSR Lite than myself say this looks fine as they expected the version of VQSR Lite on master to be more up to date than what's on ah_var_store.

So basically this looks good to us. Thanks again for taking this on! 🙏

@mcovarr
Copy link
Collaborator

mcovarr commented Mar 23, 2023

So I tried adding in the GVS files to this branch and found two compilation issues:

  • One can apparently be resolved by adding this constant.
  • The other appears to be related to the absence of this commit. Unfortunately I get a merge conflict when I try to cherry pick this onto my branch, and I don't understand the code well enough to resolve it.

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbergelson Just a few minor issues / questions

Can you also comment on what happened to the changes in 95b1548 during conflict resolution?

@@ -145,7 +145,7 @@ public abstract class GATKTool extends CommandLineProgram {
/**
* Our source of reference data (null if no reference was provided)
*/
ReferenceDataSource reference;
protected ReferenceDataSource reference;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would keep this package-protected. It's meant to be encapsulated for walker traversals (descendants of WalkerBase), and exposed for direct descendants of GATKTool (ie., non-walkers). Non-walker tools can access this field via the inherited directlyAccessEngineReferenceDataSource() method. @mcovarr / @lbergelson , would it be a significant hardship for the GVS branch to migrate to using that method?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be okay if we do the following in ExtractTool.java? Only the three subclasses of ExtractTool.java reference reference.

protected ReferenceDataSource reference = directlyAccessEngineReferenceDataSource();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcovarr Of course, that's totally fine! If those are the only usages in the GVS branch, then, would you be ok with @lbergelson changing this field back to package-accessible?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that would be fine, thanks!

public GnarlyGenotyperEngine(final boolean keepAllSites, final int maxAltAllelesToOutput, final boolean stripASAnnotations) {
this(keepAllSites, maxAltAllelesToOutput, true, stripASAnnotations);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good for @ldgauthier to have a quick look at the GnarlyGenotyper changes here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually on second look these GnarlyGenotyper changes are trivial (just a new emitPls flag), and so don't require extra scrutiny

@@ -200,7 +200,7 @@ public LineBuilder fill(final String filling) {
* Constructs the line and writes it out to the output
*/
public void write() {
Utils.validate(!Arrays.stream(lineToBuild).anyMatch(Objects::isNull), "Attempted to construct an incomplete line, make sure all columns are filled");
Utils.validate(!Arrays.stream(lineToBuild).anyMatch(Objects::isNull), "Attempted to construct an incomplete line, make sure all columns are filled: " + Arrays.toString(lineToBuild));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this String should be lazily constructed (ie., only when the validation check fails) instead of unconditionally on every call to write()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made lazy

@@ -38,28 +36,42 @@ public void testExecuteQueryWithWhereClause() {
expectedNamesAndAges.put("Fred", "35");

final String query = String.format("SELECT * FROM `%s` WHERE name = 'Fred'", BIGQUERY_FULLY_QUALIFIED_TABLE);
Map<String, String> labels = new HashMap<String, String>();
labels.put("gatktestquery", "testwhereclause" + runUuid);
System.out.print("testwhereclause" + runUuid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the logger, not System.out.print()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they're more commonly called lumberjacks?
image

Copy link
Collaborator

@mcovarr mcovarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM delta David's requested changes. Looking forward to having GVS synced up with master! 😄

@lbergelson
Copy link
Member Author

@droazen I've responded to your comments.

@droazen droazen merged commit 0374937 into master Apr 11, 2023
@droazen droazen deleted the lb_merge_gvs_conflicts branch April 11, 2023 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants