Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41008][MLLIB] Dedup isotonic regression duplicate features #38966

Closed
wants to merge 2 commits into from

Conversation

ahmed-mahran
Copy link
Contributor

What changes were proposed in this pull request?

Adding a pre-processing step to isotonic regression in mllib to handle duplicate features. This is to match sklearn implementation. Input points of duplicate feature values are aggregated into a single point using as label the weighted average of the labels of the points with duplicate feature values. All points for a unique feature values are aggregated as:

  • Aggregated label is the weighted average of all labels
  • Aggregated feature is the weighted average of all equal features. It is possible that feature values to be equal up to a resolution due to representation errors, since we cannot know which feature value to use in that case, we compute the weighted average of the features. Ideally, all feature values will be equal and the weighted average is just the value at any point.
  • Aggregated weight is the sum of all weights

Why are the changes needed?

As per discussion on ticket [SPARK-41008], it is a bug and results should match sklearn.

Does this PR introduce any user-facing change?

There are no changes to the API, documentation or error messages. However, the user should expect results to change.

How was this patch tested?

Existing test cases for duplicate features failed. These tests were adjusted accordingly. Also, new tests are added.

Here is a python snippet that can be used to verify the results:

from sklearn.isotonic import IsotonicRegression

def test(x, y, x_test, isotonic=True):
    ir = IsotonicRegression(out_of_bounds='clip', increasing=isotonic).fit(x, y)
    y_test = ir.predict(x_test)
    
    def print_array(label, a):
        print(f"{label}: [{', '.join([str(i) for i in a])}]")

    print_array("boundaries", ir.X_thresholds_)
    print_array("predictions", ir.y_thresholds_)
    print_array("y_test", y_test)

test(
    x = [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],
    y = [1, 0, 0, 1, 0, 1, 0, 0, 0],
    x_test = [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20]
)

@srowen @zapletal-martin

@github-actions github-actions bot added the MLLIB label Dec 7, 2022
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, if tests pass. Is it fair to say this keeps existing tests and adds new ones? my only concern is about introducing new unintended behavior changes, but if it's passing all the previous tests, that's good enough for purposes here

// Utility object, holds a buffer of all points with unique features so far, and performs
// weighted sum accumulation of points. Hides these details for better readability of the
// main algorithm.
object PointsAccumulator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make this a top-level object in this file at least, if not a separate file

weight > 0
}

if (cleanInput.isEmpty) Array.empty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expand the if-else so bodies are on new lines

@@ -194,40 +241,46 @@ class IsotonicRegressionSuite extends SparkFunSuite with MLlibTestSparkContext w
assert(model.predict(0.5) === 1.5)
assert(model.predict(0.75) === 1.75)
assert(model.predict(1) === 2)
assert(model.predict(2) === 10d/3)
assert(model.predict(9) === 10d/3)
assert(model.predict(2) === 10d / 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're welcome to just write 10.0 / 3 here too

@ahmed-mahran
Copy link
Contributor Author

Is it fair to say this keeps existing tests and adds new ones? my only concern is about introducing new unintended behavior changes, but if it's passing all the previous tests, that's good enough for purposes here

Well, the behavior has changed on duplicate features cases. Specifically, only these two tests I had to change to conform with the new behavior. These tests, on master, produce different results than sklearn:

Added new tests:

@srowen
Copy link
Member

srowen commented Dec 7, 2022

Of course yeah this is going to change behavior where a bug is fixed. Other than that, do the tests now return different answers from sklearn where they agreed before? I'm only concerned about other regressions here

@ahmed-mahran
Copy link
Contributor Author

I don't remember whether I've checked all tests against sklearn. I'll double check anyways.

@srowen
Copy link
Member

srowen commented Dec 7, 2022

Double-checking is great, though, I'm asking if you meant in your comment that you know a test currently disagrees with sklearn already

@ahmed-mahran
Copy link
Contributor Author

Double-checking on this branch, all tests produce same results as sklearn.

On master, these two (already existing) tests produce different results than sklearn:

@srowen
Copy link
Member

srowen commented Dec 7, 2022

Ok if it was already varying from sklearn, and isn't the same issue as you're fixing, leave it for now

@ahmed-mahran
Copy link
Contributor Author

Aha, I see your point! It was varying because of the issue I'm fixing, which is about handling duplicate features.

@ahmed-mahran
Copy link
Contributor Author

... So, if we run this new code on existing (old) master's suite, those two tests will fail.

@srowen
Copy link
Member

srowen commented Dec 7, 2022

Ah OK, hm, that sounds like a regression related to this change then and probably needs to be fixed (or decide the old test was wrong)

@ahmed-mahran
Copy link
Contributor Author

Regardless from this change, these tests' assumptions contradicts with what this ticket is about. See the tests' names at least: the tests are about handling duplicate features. My guess is that these were introduced to cover corner cases of lex sorting and repartitioning. However, we know that original implementation doesn't actually handle cosolidation of duplicate features. Hence, these tests by definition, are not valid.

@srowen
Copy link
Member

srowen commented Dec 7, 2022

OK so the test result is different, but maybe wasn't correct in the first place. But it doesn't match sklearn now, and did before (?) I'm still a little concerned that this changes behavior in unintended ways, but maybe you have a better judgment about whether this is still mostly an improvement in the result

@ahmed-mahran
Copy link
Contributor Author

OK so the test result is different, but maybe wasn't correct in the first place. But it doesn't match sklearn now, and did before (?)

I'll try to trace history of sklearn, maybe it has changed behavior.

I'm still a little concerned that this changes behavior in unintended ways, but maybe you have a better judgment about whether this is still mostly an improvement in the result

All I can say here is that if sklearn should be the reference, then these tests are definitely wrong and the issue is because of not doing duplicate features consolidation.

@srowen srowen closed this in 3d05c7e Dec 8, 2022
@srowen
Copy link
Member

srowen commented Dec 8, 2022

Merged to master

@ahmed-mahran
Copy link
Contributor Author

So, it is true: scikit-learn has changed behavior just after spark introduced isotonic regression. I've tested different scikit-learn versions (namely: 0.13.1, 0.14.1, 0.15.0, 0.15.2) since isotonic regression was first introduced until the duplicates PR applied, and interestingly none has matched master's spark.

from sklearn.isotonic import IsotonicRegression

IsotonicRegression(out_of_bounds='clip', increasing=isotonic).fit(x, y, sample_weight=weights).predict(x_test) # >= 0.15
IsotonicRegression(increasing=isotonic).fit(x, y, sample_weight=weights).predict(x_test) # = 0.14
IsotonicRegression().fit(x, y, weight=weights).predict(x_test) # = 0.13

# test("isotonic regression prediction with duplicate features")
x, y, weights, x_test, isotonic = [1, 1, 2, 2, 3, 3], [2, 1, 4, 2, 6, 5], [1, 1, 1, 1, 1, 1], [0, 1.5, 2.5, 4], True

# spark  : array([ 1 ,  2,  4.5,  6 ])
# v0.13.1: error
# v0.14.1: error
# v0.15.0: array([ nan,  2. ,  4.5,  5. ])
# v0.15.2: array([ 2. ,  2. ,  4.5,  6. ])
# v0.16.0: array([ 1.5 ,  2.25,  4.25,  5.5 ])

# test("antitonic regression prediction with duplicate features")
x, y, weights, x_test, isotonic = [1, 1, 2, 2, 3, 3], [5, 6, 2, 4, 1, 2], [1, 1, 1, 1, 1, 1], [0, 1.5, 2.5, 4], False

# spark  : array([ 6 ,  4.5,  2,  1 ])
# v0.13.1: error
# v0.14.1: error
# v0.15.0: array([  nan,  4.25,  2.25,  1.5 ])
# v0.15.2: array([ 5.5 ,  4.25,  2.25,  1.5 ])
# v0.16.0: array([ 5.5 ,  4.25,  2.25,  1.5 ])

This R package suggests that there are three methods for breaking ties: primary, secondary and tertiary. Spark and scikit-learn originally implemented "primary". Then scikit-learn replaced it with "secondary" which is implemented in this PR.

@ahmed-mahran
Copy link
Contributor Author

@srowen
Copy link
Member

srowen commented Dec 8, 2022

OK, as long as it's logical and consistent, and more logical/consistent than before, I think it's OK. Feel free to update docs as you see fit

Comment on lines 492 to +498
.mapPartitions(p => Iterator(p.toArray.sortBy(x => (x._2, x._1))))
// Aggregate points with equal features into a single point.
.map(makeUnique)
.flatMap(poolAdjacentViolators)
.collect()
.sortBy(x => (x._2, x._1)) // Sort again because collect() doesn't promise ordering.
// Sort again because collect() doesn't promise ordering.
.sortBy(x => (x._2, x._1))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now it is redundant to sort by labels since we already are making features unique.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can open up a follow-up PR on the same JIRA if there are minor additional improvements or docs to suggest

@ahmed-mahran
Copy link
Contributor Author

I'll follow up with PR for doc and removing the redundant sort key

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

srowen pushed a commit that referenced this pull request Dec 11, 2022
### What changes were proposed in this pull request?

A follow-up on #38966 to update relevant documentation and remove redundant sort key.

### Why are the changes needed?

For isotonic regression, another method for breaking ties of repeated features was introduced in #38966. This will aggregate points having the same feature value by computing the weighted average of the labels.
- This only requires points to be sorted by features instead of features and labels. So, we should remove label as a secondary sorting key.
- Isotonic regression documentation needs to be updated to reflect the new behavior.

### Does this PR introduce _any_ user-facing change?

Isotonic regression documentation update. The documentation described the behavior of the algorithm when there are points in the input with repeated features. Since this behavior has changed, documentation needs to describe the new behavior.

### How was this patch tested?

Existing tests passed. No need to add new tests since existing tests are already comprehensive.

srowen

Closes #38996 from ahmed-mahran/ml-isotonic-reg-dups-follow-up.

Authored-by: Ahmed Mahran <ahmed.mahran@mashin.io>
Signed-off-by: Sean Owen <srowen@gmail.com>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
### What changes were proposed in this pull request?

Adding a pre-processing step to isotonic regression in mllib to handle duplicate features. This is to match `sklearn` implementation. Input points of duplicate feature values are aggregated into a single point using as label the weighted average of the labels of the points with duplicate feature values. All points for a unique feature values are aggregated as:
 - Aggregated label is the weighted average of all labels
 - Aggregated feature is the weighted average of all equal features. It is possible that feature values to be equal up to a resolution due to representation errors, since we cannot know which feature value to use in that case, we compute the weighted average of the features. Ideally, all feature values will be equal and the weighted average is just the value at any point.
 - Aggregated weight is the sum of all weights

### Why are the changes needed?

As per discussion on ticket [[SPARK-41008]](https://issues.apache.org/jira/browse/SPARK-41008), it is a bug and results should match `sklearn`.

### Does this PR introduce _any_ user-facing change?

There are no changes to the API, documentation or error messages. However, the user should expect results to change.

### How was this patch tested?

Existing test cases for duplicate features failed. These tests were adjusted accordingly. Also, new tests are added.

Here is a python snippet that can be used to verify the results:

```python
from sklearn.isotonic import IsotonicRegression

def test(x, y, x_test, isotonic=True):
    ir = IsotonicRegression(out_of_bounds='clip', increasing=isotonic).fit(x, y)
    y_test = ir.predict(x_test)

    def print_array(label, a):
        print(f"{label}: [{', '.join([str(i) for i in a])}]")

    print_array("boundaries", ir.X_thresholds_)
    print_array("predictions", ir.y_thresholds_)
    print_array("y_test", y_test)

test(
    x = [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],
    y = [1, 0, 0, 1, 0, 1, 0, 0, 0],
    x_test = [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20]
)
```

srowen zapletal-martin

Closes apache#38966 from ahmed-mahran/ml-isotonic-reg-dups.

Authored-by: Ahmed Mahran <ahmed.mahran@mashin.io>
Signed-off-by: Sean Owen <srowen@gmail.com>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
### What changes were proposed in this pull request?

A follow-up on apache#38966 to update relevant documentation and remove redundant sort key.

### Why are the changes needed?

For isotonic regression, another method for breaking ties of repeated features was introduced in apache#38966. This will aggregate points having the same feature value by computing the weighted average of the labels.
- This only requires points to be sorted by features instead of features and labels. So, we should remove label as a secondary sorting key.
- Isotonic regression documentation needs to be updated to reflect the new behavior.

### Does this PR introduce _any_ user-facing change?

Isotonic regression documentation update. The documentation described the behavior of the algorithm when there are points in the input with repeated features. Since this behavior has changed, documentation needs to describe the new behavior.

### How was this patch tested?

Existing tests passed. No need to add new tests since existing tests are already comprehensive.

srowen

Closes apache#38996 from ahmed-mahran/ml-isotonic-reg-dups-follow-up.

Authored-by: Ahmed Mahran <ahmed.mahran@mashin.io>
Signed-off-by: Sean Owen <srowen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants