-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement PathSeq taxon hit scoring in Spark #3406
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3406 +/- ##
===============================================
+ Coverage 80.497% 80.589% +0.093%
- Complexity 17553 17668 +115
===============================================
Files 1175 1175
Lines 63487 63836 +349
Branches 9895 9963 +68
===============================================
+ Hits 51105 51445 +340
- Misses 8433 8434 +1
- Partials 3949 3957 +8
|
On vacation--won't have a chance to look at this until next week. If Chris or others approve just go for it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me.
final Double score = SCORE_GENOME_LENGTH_UNITS * hit.numMates / (numHits * tree.getLengthOf(taxId)); | ||
sum += score; | ||
//Git list containing this node and its ancestors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Git -> Get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Upgrades PathSeqScoreSpark to perform abundance score calculations on the executors rather than the driver. This was crashing on inputs with a lot of pathogen reads.
This also required some minor changes to the
PSPathogenTaxonScore
class to be able to keep track of abundance score contributions that come directly from hits to that taxon and those that are from the taxon's descendents.As a result, some of the test output changed when using bitwise, exact checks on the output. So the tests now check for output equivalence, meaning parsing the scores table, checking that all the taxa are the same, and that the scores are equal to within some defined epsilon.