[query] fix bad bug in IndexedRVDSpec2 #14420

patrick-schultz · 2024-03-19T16:30:00Z

CHANGELOG: Fixes a serious, but likely rare, bug in the Table/MatrixTable reader, which has been present since Sep 2020. It manifests as many (around half or more) of the rows being dropped. This could only happen when 1) reading a (matrix)table whose partitioning metadata allows rows with the same key to be split across neighboring partitions, and 2) reading it with a different partitioning than it was written. 1) would likely only happen by reading data keyed by locus and alleles, and rekeying it to only locus before writing. 2) would likely only happen by using the _intervals or _n_partitions arguments to read_(matrix)_table, or possibly repartition. Please reach out to us if you're concerned you may have been affected by this.

This fixes a serious and longstanding bug in IndexedRVDSpec2, which appears to have been around since this code was first added in #9522 almost four years ago. It was reported in this zulip thread. I want to do further work to better characterize exactly what it takes to be affected by this bug, but I think you must have a table or matrixtable on disk which has duplicate keys, and moreover keys which span neighboring partitions, and then you must read the data with a different partitioner.

The root of the issue is an invalid assumption made in the code. To read data written with partitioner p1 using new partitioner p2, it first computes the "intersection", or common refinement, of the two. It then assumes that each partition in the refinement overlaps exactly one partition of p1. But this is only true if the partitions of p1 are themselves mutually disjoint, which is usually but not necessarily true.

For example, suppose p1 = [ [1, 5], [5, 8] ] is the old partitioner, and p2 = [ [1, 4), [4, 8] ] is the new. Note that the two input partitions are not disjoint, as the key 5 is allowed in both. The common refinement would then be [ [1, 4), [4, 5], [5, 8] ]. For each partition in the refinement, we want to read in the corresponding range from the appropriate input partition, then we want to group the partitions in the refinement to match the new partitioner. The code finds "the appropriate input partition" by taking the first input partition which overlaps the refinement partition, using lowerBoundInterval. That works if there is only one overlapping input partition, but here fails, since the refinement partition [5, 8] overlaps both input partitions. So the code mistakenly reads from the input partition [1, 5] to produce the refinement partition [5, 8], and so completely drops all rows in the input [5, 8].

In practice, I think the most likely way to run into this (and the way it was found by a user) is to have a dataset keyed by ["locus", "alleles"], which has split multi-allelics, so there are multiple rows with the same locus. Then shorten the key to ["locus"], write the dataset to disk, and read it back with a different partitioning, e.g. by passing a _n_partitions argument to read_table or read_matrix_table. For instance, if the partitioning was originally [ [{1:1, ["A"]}, {1:500, ["G"]}), [{1:500, ["G"]}, {1:1000, ["C"]}] ], then after shortening the key it would be [ [1:1, 1:500], [1:500, 1:1000] ]. Notice that even though the original partitioning had no overlap, it does after shortening the key, because rows with locus 1:500 with alleles less than ["G"] are allowed in the first partition, so we have to make the right endpoint inclusive after shortening. You would then need to write this rekeyed dataset to disk and read it back with different partitioning (note that ds.repartition is enough to do this in the batch backend).

I still need to think through what holes in our testing allowed this to remain undetected for so long, and attempt to plug them. We should also plan for what to tell a user who is concerned they may have been affected by this in the past.

patrick-schultz · 2024-03-21T19:46:14Z

@ehigham reassigned to you since Chris is out

hail/src/main/scala/is/hail/rvd/AbstractRVDSpec.scala

ehigham · 2024-03-25T16:27:54Z

hail/src/main/scala/is/hail/rvd/AbstractRVDSpec.scala

+        val nestedContexts = extendedNP.rangeBounds.indices.map { newPartIdx =>
+          val newInterval = extendedNP.rangeBounds(newPartIdx)


zipWithIndex?

hail/src/main/scala/is/hail/rvd/AbstractRVDSpec.scala

addressed

ehigham

Thank you for adding the clarifying comments and the test. Much clearer now.

CHANGELOG: Fixes a serious, but likely rare, bug in the Table/MatrixTable reader, which has been present since Sep 2020. It manifests as many (around half or more) of the rows being dropped. This could only happen when 1) reading a (matrix)table whose partitioning metadata allows rows with the same key to be split across neighboring partitions, and 2) reading it with a different partitioning than it was written. 1) would likely only happen by reading data keyed by locus and alleles, and rekeying it to only locus before writing. 2) would likely only happen by using the `_intervals` or `_n_partitions` arguments to `read_(matrix)_table`, or possibly `repartition`. Please reach out to us if you're concerned you may have been affected by this. This fixes a serious and longstanding bug in `IndexedRVDSpec2`, which appears to have been around since this code was first added in hail-is#9522 almost four years ago. It was reported in this [zulip thread](https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/Number.20of.20rows.20changing.20with.20partitioning). I want to do further work to better characterize exactly what it takes to be affected by this bug, but I think you must have a table or matrixtable on disk which has duplicate keys, and moreover keys which span neighboring partitions, and then you must read the data with a different partitioner. The root of the issue is an invalid assumption made in the code. To read data written with partitioner `p1` using new partitioner `p2`, it first computes the "intersection", or common refinement, of the two. It then assumes that each partition in the refinement overlaps exactly one partition of `p1`. But this is only true if the partitions of `p1` are themselves mutually disjoint, which is usually but not necessarily true. For example, suppose `p1 = [ [1, 5], [5, 8] ]` is the old partitioner, and `p2 = [ [1, 4), [4, 8] ]` is the new. Note that the two input partitions are not disjoint, as the key `5` is allowed in both. The common refinement would then be `[ [1, 4), [4, 5], [5, 8] ]`. For each partition in the refinement, we want to read in the corresponding range from the appropriate input partition, then we want to group the partitions in the refinement to match the new partitioner. The code finds "the appropriate input partition" by taking the first input partition which overlaps the refinement partition, using `lowerBoundInterval`. That works if there is only one overlapping input partition, but here fails, since the refinement partition `[5, 8]` overlaps both input partitions. So the code mistakenly reads from the input partition `[1, 5]` to produce the refinement partition `[5, 8]`, and so completely drops all rows in the input `[5, 8]`. In practice, I think the most likely way to run into this (and the way it was found by a user) is to have a dataset keyed by `["locus", "alleles"]`, which has split multi-allelics, so there are multiple rows with the same locus. Then shorten the key to `["locus"]`, write the dataset to disk, and read it back with a different partitioning, e.g. by passing a `_n_partitions` argument to `read_table` or `read_matrix_table`. For instance, if the partitioning was originally `[ [{1:1, ["A"]}, {1:500, ["G"]}), [{1:500, ["G"]}, {1:1000, ["C"]}] ]`, then after shortening the key it would be `[ [1:1, 1:500], [1:500, 1:1000] ]`. Notice that even though the original partitioning had no overlap, it does after shortening the key, because rows with locus `1:500` with alleles less than `["G"]` are allowed in the first partition, so we have to make the right endpoint inclusive after shortening. You would then need to write this rekeyed dataset to disk and read it back with different partitioning (note that `ds.repartition` is enough to do this in the batch backend). I still need to think through what holes in our testing allowed this to remain undetected for so long, and attempt to plug them. We should also plan for what to tell a user who is concerned they may have been affected by this in the past.

CHANGELOG: Fixes a serious, but likely rare, bug in the Table/MatrixTable reader, which has been present since Sep 2020. It manifests as many (around half or more) of the rows being dropped. This could only happen when 1) reading a (matrix)table whose partitioning metadata allows rows with the same key to be split across neighboring partitions, and 2) reading it with a different partitioning than it was written. 1) would likely only happen by reading data keyed by locus and alleles, and rekeying it to only locus before writing. 2) would likely only happen by using the `_intervals` or `_n_partitions` arguments to `read_(matrix)_table`, or possibly `repartition`. Please reach out to us if you're concerned you may have been affected by this. This fixes a serious and longstanding bug in `IndexedRVDSpec2`, which appears to have been around since this code was first added in #9522 almost four years ago. It was reported in this [zulip thread](https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/Number.20of.20rows.20changing.20with.20partitioning). I want to do further work to better characterize exactly what it takes to be affected by this bug, but I think you must have a table or matrixtable on disk which has duplicate keys, and moreover keys which span neighboring partitions, and then you must read the data with a different partitioner. The root of the issue is an invalid assumption made in the code. To read data written with partitioner `p1` using new partitioner `p2`, it first computes the "intersection", or common refinement, of the two. It then assumes that each partition in the refinement overlaps exactly one partition of `p1`. But this is only true if the partitions of `p1` are themselves mutually disjoint, which is usually but not necessarily true. For example, suppose `p1 = [ [1, 5], [5, 8] ]` is the old partitioner, and `p2 = [ [1, 4), [4, 8] ]` is the new. Note that the two input partitions are not disjoint, as the key `5` is allowed in both. The common refinement would then be `[ [1, 4), [4, 5], [5, 8] ]`. For each partition in the refinement, we want to read in the corresponding range from the appropriate input partition, then we want to group the partitions in the refinement to match the new partitioner. The code finds "the appropriate input partition" by taking the first input partition which overlaps the refinement partition, using `lowerBoundInterval`. That works if there is only one overlapping input partition, but here fails, since the refinement partition `[5, 8]` overlaps both input partitions. So the code mistakenly reads from the input partition `[1, 5]` to produce the refinement partition `[5, 8]`, and so completely drops all rows in the input `[5, 8]`. In practice, I think the most likely way to run into this (and the way it was found by a user) is to have a dataset keyed by `["locus", "alleles"]`, which has split multi-allelics, so there are multiple rows with the same locus. Then shorten the key to `["locus"]`, write the dataset to disk, and read it back with a different partitioning, e.g. by passing a `_n_partitions` argument to `read_table` or `read_matrix_table`. For instance, if the partitioning was originally `[ [{1:1, ["A"]}, {1:500, ["G"]}), [{1:500, ["G"]}, {1:1000, ["C"]}] ]`, then after shortening the key it would be `[ [1:1, 1:500], [1:500, 1:1000] ]`. Notice that even though the original partitioning had no overlap, it does after shortening the key, because rows with locus `1:500` with alleles less than `["G"]` are allowed in the first partition, so we have to make the right endpoint inclusive after shortening. You would then need to write this rekeyed dataset to disk and read it back with different partitioning (note that `ds.repartition` is enough to do this in the batch backend). I still need to think through what holes in our testing allowed this to remain undetected for so long, and attempt to plug them. We should also plan for what to tell a user who is concerned they may have been affected by this in the past.

patrick-schultz assigned chrisvittal Mar 19, 2024

patrick-schultz force-pushed the IndexedRVDSpec-bugfix branch from 5d534f0 to 2012b96 Compare March 19, 2024 18:51

patrick-schultz assigned ehigham and unassigned chrisvittal Mar 21, 2024

patrick-schultz marked this pull request as ready for review March 21, 2024 19:46

ehigham previously requested changes Mar 25, 2024

View reviewed changes

patrick-schultz added 5 commits March 27, 2024 13:13

fix bug

f4be702

rewrite to not use repartitionNoShuffle

c68ad67

fix

2859b09

tidy up code

bc5b958

add test

d3e5ae7

patrick-schultz force-pushed the IndexedRVDSpec-bugfix branch from 358adb0 to 1020a42 Compare March 27, 2024 17:22

clarify code

c0ca3f6

patrick-schultz force-pushed the IndexedRVDSpec-bugfix branch from 1020a42 to c0ca3f6 Compare March 28, 2024 14:48

add missing test file index

0578a54

ehigham approved these changes Mar 28, 2024

View reviewed changes

hail-ci-robot merged commit ad1fcee into hail-is:main Mar 28, 2024
2 checks passed

patrick-schultz deleted the IndexedRVDSpec-bugfix branch April 11, 2024 10:51

patrick-schultz mentioned this pull request May 6, 2024

[query] filter intervals causing a failed partitioner assertion #14529

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[query] fix bad bug in IndexedRVDSpec2 #14420

[query] fix bad bug in IndexedRVDSpec2 #14420

patrick-schultz commented Mar 19, 2024 •

edited

Loading

patrick-schultz commented Mar 21, 2024

ehigham Mar 25, 2024

ehigham left a comment

		val nestedContexts = extendedNP.rangeBounds.indices.map { newPartIdx =>
		val newInterval = extendedNP.rangeBounds(newPartIdx)

[query] fix bad bug in IndexedRVDSpec2 #14420

[query] fix bad bug in IndexedRVDSpec2 #14420

Conversation

patrick-schultz commented Mar 19, 2024 • edited Loading

patrick-schultz commented Mar 21, 2024

ehigham Mar 25, 2024

Choose a reason for hiding this comment

ehigham left a comment

Choose a reason for hiding this comment

patrick-schultz commented Mar 19, 2024 •

edited

Loading