-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] fix bad bug in IndexedRVDSpec2 #14420
Merged
hail-ci-robot
merged 7 commits into
hail-is:main
from
patrick-schultz:IndexedRVDSpec-bugfix
Mar 28, 2024
Merged
[query] fix bad bug in IndexedRVDSpec2 #14420
hail-ci-robot
merged 7 commits into
hail-is:main
from
patrick-schultz:IndexedRVDSpec-bugfix
Mar 28, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
patrick-schultz
force-pushed
the
IndexedRVDSpec-bugfix
branch
from
March 19, 2024 18:51
5d534f0
to
2012b96
Compare
@ehigham reassigned to you since Chris is out |
ehigham
previously requested changes
Mar 25, 2024
Comment on lines
532
to
533
val nestedContexts = extendedNP.rangeBounds.indices.map { newPartIdx => | ||
val newInterval = extendedNP.rangeBounds(newPartIdx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zipWithIndex?
patrick-schultz
force-pushed
the
IndexedRVDSpec-bugfix
branch
from
March 27, 2024 17:22
358adb0
to
1020a42
Compare
patrick-schultz
force-pushed
the
IndexedRVDSpec-bugfix
branch
from
March 28, 2024 14:48
1020a42
to
c0ca3f6
Compare
ehigham
approved these changes
Mar 28, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding the clarifying comments and the test. Much clearer now.
chrisvittal
pushed a commit
to chrisvittal/hail
that referenced
this pull request
Jul 10, 2024
CHANGELOG: Fixes a serious, but likely rare, bug in the Table/MatrixTable reader, which has been present since Sep 2020. It manifests as many (around half or more) of the rows being dropped. This could only happen when 1) reading a (matrix)table whose partitioning metadata allows rows with the same key to be split across neighboring partitions, and 2) reading it with a different partitioning than it was written. 1) would likely only happen by reading data keyed by locus and alleles, and rekeying it to only locus before writing. 2) would likely only happen by using the `_intervals` or `_n_partitions` arguments to `read_(matrix)_table`, or possibly `repartition`. Please reach out to us if you're concerned you may have been affected by this. This fixes a serious and longstanding bug in `IndexedRVDSpec2`, which appears to have been around since this code was first added in hail-is#9522 almost four years ago. It was reported in this [zulip thread](https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/Number.20of.20rows.20changing.20with.20partitioning). I want to do further work to better characterize exactly what it takes to be affected by this bug, but I think you must have a table or matrixtable on disk which has duplicate keys, and moreover keys which span neighboring partitions, and then you must read the data with a different partitioner. The root of the issue is an invalid assumption made in the code. To read data written with partitioner `p1` using new partitioner `p2`, it first computes the "intersection", or common refinement, of the two. It then assumes that each partition in the refinement overlaps exactly one partition of `p1`. But this is only true if the partitions of `p1` are themselves mutually disjoint, which is usually but not necessarily true. For example, suppose `p1 = [ [1, 5], [5, 8] ]` is the old partitioner, and `p2 = [ [1, 4), [4, 8] ]` is the new. Note that the two input partitions are not disjoint, as the key `5` is allowed in both. The common refinement would then be `[ [1, 4), [4, 5], [5, 8] ]`. For each partition in the refinement, we want to read in the corresponding range from the appropriate input partition, then we want to group the partitions in the refinement to match the new partitioner. The code finds "the appropriate input partition" by taking the first input partition which overlaps the refinement partition, using `lowerBoundInterval`. That works if there is only one overlapping input partition, but here fails, since the refinement partition `[5, 8]` overlaps both input partitions. So the code mistakenly reads from the input partition `[1, 5]` to produce the refinement partition `[5, 8]`, and so completely drops all rows in the input `[5, 8]`. In practice, I think the most likely way to run into this (and the way it was found by a user) is to have a dataset keyed by `["locus", "alleles"]`, which has split multi-allelics, so there are multiple rows with the same locus. Then shorten the key to `["locus"]`, write the dataset to disk, and read it back with a different partitioning, e.g. by passing a `_n_partitions` argument to `read_table` or `read_matrix_table`. For instance, if the partitioning was originally `[ [{1:1, ["A"]}, {1:500, ["G"]}), [{1:500, ["G"]}, {1:1000, ["C"]}] ]`, then after shortening the key it would be `[ [1:1, 1:500], [1:500, 1:1000] ]`. Notice that even though the original partitioning had no overlap, it does after shortening the key, because rows with locus `1:500` with alleles less than `["G"]` are allowed in the first partition, so we have to make the right endpoint inclusive after shortening. You would then need to write this rekeyed dataset to disk and read it back with different partitioning (note that `ds.repartition` is enough to do this in the batch backend). I still need to think through what holes in our testing allowed this to remain undetected for so long, and attempt to plug them. We should also plan for what to tell a user who is concerned they may have been affected by this in the past.
chrisvittal
pushed a commit
that referenced
this pull request
Jul 11, 2024
CHANGELOG: Fixes a serious, but likely rare, bug in the Table/MatrixTable reader, which has been present since Sep 2020. It manifests as many (around half or more) of the rows being dropped. This could only happen when 1) reading a (matrix)table whose partitioning metadata allows rows with the same key to be split across neighboring partitions, and 2) reading it with a different partitioning than it was written. 1) would likely only happen by reading data keyed by locus and alleles, and rekeying it to only locus before writing. 2) would likely only happen by using the `_intervals` or `_n_partitions` arguments to `read_(matrix)_table`, or possibly `repartition`. Please reach out to us if you're concerned you may have been affected by this. This fixes a serious and longstanding bug in `IndexedRVDSpec2`, which appears to have been around since this code was first added in #9522 almost four years ago. It was reported in this [zulip thread](https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/Number.20of.20rows.20changing.20with.20partitioning). I want to do further work to better characterize exactly what it takes to be affected by this bug, but I think you must have a table or matrixtable on disk which has duplicate keys, and moreover keys which span neighboring partitions, and then you must read the data with a different partitioner. The root of the issue is an invalid assumption made in the code. To read data written with partitioner `p1` using new partitioner `p2`, it first computes the "intersection", or common refinement, of the two. It then assumes that each partition in the refinement overlaps exactly one partition of `p1`. But this is only true if the partitions of `p1` are themselves mutually disjoint, which is usually but not necessarily true. For example, suppose `p1 = [ [1, 5], [5, 8] ]` is the old partitioner, and `p2 = [ [1, 4), [4, 8] ]` is the new. Note that the two input partitions are not disjoint, as the key `5` is allowed in both. The common refinement would then be `[ [1, 4), [4, 5], [5, 8] ]`. For each partition in the refinement, we want to read in the corresponding range from the appropriate input partition, then we want to group the partitions in the refinement to match the new partitioner. The code finds "the appropriate input partition" by taking the first input partition which overlaps the refinement partition, using `lowerBoundInterval`. That works if there is only one overlapping input partition, but here fails, since the refinement partition `[5, 8]` overlaps both input partitions. So the code mistakenly reads from the input partition `[1, 5]` to produce the refinement partition `[5, 8]`, and so completely drops all rows in the input `[5, 8]`. In practice, I think the most likely way to run into this (and the way it was found by a user) is to have a dataset keyed by `["locus", "alleles"]`, which has split multi-allelics, so there are multiple rows with the same locus. Then shorten the key to `["locus"]`, write the dataset to disk, and read it back with a different partitioning, e.g. by passing a `_n_partitions` argument to `read_table` or `read_matrix_table`. For instance, if the partitioning was originally `[ [{1:1, ["A"]}, {1:500, ["G"]}), [{1:500, ["G"]}, {1:1000, ["C"]}] ]`, then after shortening the key it would be `[ [1:1, 1:500], [1:500, 1:1000] ]`. Notice that even though the original partitioning had no overlap, it does after shortening the key, because rows with locus `1:500` with alleles less than `["G"]` are allowed in the first partition, so we have to make the right endpoint inclusive after shortening. You would then need to write this rekeyed dataset to disk and read it back with different partitioning (note that `ds.repartition` is enough to do this in the batch backend). I still need to think through what holes in our testing allowed this to remain undetected for so long, and attempt to plug them. We should also plan for what to tell a user who is concerned they may have been affected by this in the past.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
CHANGELOG: Fixes a serious, but likely rare, bug in the Table/MatrixTable reader, which has been present since Sep 2020. It manifests as many (around half or more) of the rows being dropped. This could only happen when 1) reading a (matrix)table whose partitioning metadata allows rows with the same key to be split across neighboring partitions, and 2) reading it with a different partitioning than it was written. 1) would likely only happen by reading data keyed by locus and alleles, and rekeying it to only locus before writing. 2) would likely only happen by using the
_intervals
or_n_partitions
arguments toread_(matrix)_table
, or possiblyrepartition
. Please reach out to us if you're concerned you may have been affected by this.This fixes a serious and longstanding bug in
IndexedRVDSpec2
, which appears to have been around since this code was first added in #9522 almost four years ago. It was reported in this zulip thread. I want to do further work to better characterize exactly what it takes to be affected by this bug, but I think you must have a table or matrixtable on disk which has duplicate keys, and moreover keys which span neighboring partitions, and then you must read the data with a different partitioner.The root of the issue is an invalid assumption made in the code. To read data written with partitioner
p1
using new partitionerp2
, it first computes the "intersection", or common refinement, of the two. It then assumes that each partition in the refinement overlaps exactly one partition ofp1
. But this is only true if the partitions ofp1
are themselves mutually disjoint, which is usually but not necessarily true.For example, suppose
p1 = [ [1, 5], [5, 8] ]
is the old partitioner, andp2 = [ [1, 4), [4, 8] ]
is the new. Note that the two input partitions are not disjoint, as the key5
is allowed in both. The common refinement would then be[ [1, 4), [4, 5], [5, 8] ]
. For each partition in the refinement, we want to read in the corresponding range from the appropriate input partition, then we want to group the partitions in the refinement to match the new partitioner. The code finds "the appropriate input partition" by taking the first input partition which overlaps the refinement partition, usinglowerBoundInterval
. That works if there is only one overlapping input partition, but here fails, since the refinement partition[5, 8]
overlaps both input partitions. So the code mistakenly reads from the input partition[1, 5]
to produce the refinement partition[5, 8]
, and so completely drops all rows in the input[5, 8]
.In practice, I think the most likely way to run into this (and the way it was found by a user) is to have a dataset keyed by
["locus", "alleles"]
, which has split multi-allelics, so there are multiple rows with the same locus. Then shorten the key to["locus"]
, write the dataset to disk, and read it back with a different partitioning, e.g. by passing a_n_partitions
argument toread_table
orread_matrix_table
. For instance, if the partitioning was originally[ [{1:1, ["A"]}, {1:500, ["G"]}), [{1:500, ["G"]}, {1:1000, ["C"]}] ]
, then after shortening the key it would be[ [1:1, 1:500], [1:500, 1:1000] ]
. Notice that even though the original partitioning had no overlap, it does after shortening the key, because rows with locus1:500
with alleles less than["G"]
are allowed in the first partition, so we have to make the right endpoint inclusive after shortening. You would then need to write this rekeyed dataset to disk and read it back with different partitioning (note thatds.repartition
is enough to do this in the batch backend).I still need to think through what holes in our testing allowed this to remain undetected for so long, and attempt to plug them. We should also plan for what to tell a user who is concerned they may have been affected by this in the past.