[MLLIB][SPARK-3278] Monotone (Isotonic) regression using parallel pool adjacent violators algorithm #3519

zapletal-martin · 2014-11-30T23:24:52Z

This PR introduces an API for Isotonic regression and one algorithm implementing it, Pool adjacent violators.

The Isotonic regression problem is sufficiently described in Floudas, Pardalos, Encyclopedia of Optimization, Wikipedia or Stat Wiki.

Pool adjacent violators was introduced by M. Ayer et al. in 1955. A history and development of isotonic regression algorithms is in Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods and list of available algorithms including their complexity is listed in Stout, Fastest Isotonic Regression Algorithms.

An approach to parallelize the computation of PAV was presented in Kearsley, Tapia, Trosset, An Approach to Parallelizing Isotonic Regression.

The implemented Pool adjacent violators algorithm is based on Floudas, Pardalos, Encyclopedia of Optimization (Chapter Isotonic regression problems, p. 86) and Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods, also nicely formulated in Tibshirani, Hoefling, Tibshirani, Nearly-Isotonic Regression. Implementation itself inspired by R implementations Klaus, Strimmer, 2008, fdrtool: Estimation of (Local) False Discovery Rates and Higher Criticism and R Development Core Team, stats, 2009. I ran tests with both these libraries and confirmed they yield the same results. More R implementations referenced in aforementioned Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators
Algorithm (PAVA) and Active Set Methods. The implementation is also inspired and cross checked with other implementations: Ted Harding, 2007, scikit-learn, Andrew Tulloch, 2014, Julia, Andrew Tulloch, 2014, c++, described in Andrew Tulloch, Speeding up isotonic regression in scikit-learn by 5,000x, Fabian Pedregosa, 2012, Sreangsu Acharyya. libpav and Gustav Larsson.

…luding proposed API

…or Java api

…eights

AmplabJenkins · 2014-11-30T23:27:10Z

Can one of the admins verify this patch?

mengxr · 2014-12-01T08:00:47Z

add to whitelist

mengxr · 2014-12-01T08:00:51Z

ok to test

mengxr · 2014-12-01T08:00:57Z

test this please

SparkQA · 2014-12-01T08:05:10Z

Test build #23979 has started for PR 3519 at commit 8f5daf9.

This patch merges cleanly.

SparkQA · 2014-12-01T08:06:16Z

Test build #23979 has finished for PR 3519 at commit 8f5daf9.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- sealed trait MonotonicityConstraint
- class IsotonicRegressionModel(
- case class WeightedLabeledPoint(label: Double, features: Vector, weight: Double = 1)

AmplabJenkins · 2014-12-01T08:06:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23979/
Test FAILed.

SparkQA · 2014-12-01T14:10:07Z

Test build #23990 has started for PR 3519 at commit 6046550.

This patch merges cleanly.

SparkQA · 2014-12-01T15:35:45Z

Test build #23990 has finished for PR 3519 at commit 6046550.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- sealed trait MonotonicityConstraint
- class IsotonicRegressionModel(
- case class WeightedLabeledPoint(label: Double, features: Vector, weight: Double = 1)

AmplabJenkins · 2014-12-01T15:35:49Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23990/
Test PASSed.

mengxr · 2014-12-15T23:48:46Z

@zapletal-martin Some high-level comments:

The implementation introduces APIs that are not necessary for this PR. For example, WeightedLabeledPoint and MonotonicityConstraint. It should be sufficient to have only IsotonicRegression and IsotonicRegressionModel. Please try to make the public APIs minimal.
I think it would be better if an IsotonicRegression takes an RDD[(Double, Double)] instead of RDD[WeightedLabeledPoint] and use the natural ordering. It is easy for the new pipeline API, where IsotonicRegression expects 2 or 3 (with weight) columns and the model applies to a single column.

…eplced by simple boolean

zapletal-martin · 2014-12-29T17:41:58Z

@mengxr , thank you very much for your feedback.

Sure I will do it.
a) Can you please clarify if you are sugesting to use RDD[(Double, Double, Double)] - i.e label, feature, weight or RDD[(Double, Double)] - i.e. just label, weight and already expect the data to be ordered? Also I assume there should be API with weight default to 1 (so user does not have to specify it).

b) IsotonicRegressionModel extends RegressionModel. It implements methods predict(testData: RDD[Vector]) and predict(testData: Vector). Are these still relevant if we implement the changes in 1)? There would never be a Vector, just Double. Also we would need feature in 1) to be able to predict label.

c) How do you expect the java api to look like? Unfortunately the java/scala interop here is not very helpful. When train method expects tuple of scala.Double then when called from java you get:

[error] IsotonicRegressionModel model = IsotonicRegression.train(testRDD.rdd(), true);
[error] ^
[error] required: RDD<Tuple3<Object,Object,Object>>,boolean
[error] found: RDD<Tuple3<Double,Double,Double>>,boolean
[error] reason: actual argument RDD<Tuple3<Double,Double,Double>> cannot be converted to RDD<Tuple3<Object,Object,Object>> by method invocation conversion

There are solutions to this problem, but most of them quite ugly. See for example http://stackoverflow.com/questions/17071061/scala-java-interoperability-how-to-deal-with-options-containing-int-long-primi or http://www.scala-notes.org/2011/04/specializing-for-primitive-types/.

Is there another public java api that uses primitive type in generic that I could use as reference?

mengxr · 2014-12-29T19:04:38Z

2a) (label: Double, feature: Double, weight: Double) sounds good to me. We may add weight support to LabeledPoint as part of SPARK-3702, which should be orthogonal to this PR. We can update the API here (before 1.3) once that gets merged.

2b) Isotonic regression is a univariate regression algorithm. It is not necessary to have its model extend RegressionModel. It should have predict(RDD[Double]) and predict(Double).

2c) Try train(JavaPairRDD<java.lang.Double, java.lang.Double>)

…omments

…edLabeledPoint

…) and updated api

…3278

SparkQA · 2015-01-30T20:03:31Z

Test build #26416 has finished for PR 3519 at commit 75eac55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-30T20:03:35Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26416/
Test PASSed.

SparkQA · 2015-01-30T20:17:33Z

Test build #26417 has finished for PR 3519 at commit 3da56e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class IsotonicRegressionModel (

AmplabJenkins · 2015-01-30T20:17:37Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26417/
Test PASSed.

Update isotonic regression

SparkQA · 2015-01-30T22:42:35Z

Test build #26433 has started for PR 3519 at commit ded071c.

This patch merges cleanly.

SparkQA · 2015-01-30T22:47:52Z

Test build #26434 has started for PR 3519 at commit e3c0e44.

This patch merges cleanly.

SparkQA · 2015-01-30T23:36:22Z

Test build #26433 has finished for PR 3519 at commit ded071c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-30T23:36:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26433/
Test FAILed.

SparkQA · 2015-01-30T23:40:07Z

Test build #26434 has finished for PR 3519 at commit e3c0e44.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class IsotonicRegressionModel (

AmplabJenkins · 2015-01-30T23:40:11Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26434/
Test FAILed.

fix java tests

SparkQA · 2015-01-31T07:02:53Z

Test build #26458 has started for PR 3519 at commit 5a54ea4.

This patch merges cleanly.

SparkQA · 2015-01-31T08:08:06Z

Test build #26458 has finished for PR 3519 at commit 5a54ea4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class IsotonicRegressionModel (

AmplabJenkins · 2015-01-31T08:08:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26458/
Test PASSed.

mengxr · 2015-01-31T08:46:26Z

LGTM. Merged into master. Thanks!!

zapletal-martin added 5 commits November 19, 2014 00:06

SPARK-3278 added initial version of Isotonic regression algorithm inc…

3de71d0

…luding proposed API

Merge remote-tracking branch 'upstream/master' into SPARK-3278

961aa05

SPARK-3278 isotonic regression refactoring and api changes

05d9048

SPARK-3278 added isotonic regression for weighted data. Added tests f…

629a1ce

…or Java api

SPARK-3278 added comments and cleaned up api to consistently handle w…

8f5daf9

…eights

SPARK-3278 scalastyle errors resolved

6046550

zapletal-martin added 4 commits December 27, 2014 11:51

Merge remote-tracking branch 'upstream/master' into SPARK-3278

c06f88c

Removed MonotonicityConstraint, Isotonic and Antitonic constraints. R…

089bf86

…eplced by simple boolean

Removed WeightedLabeledPoint. Replaced by tuple of doubles

34760d5

Removed WeightedLabeledPoint. Replaced by tuple of doubles

b8b1620

zapletal-martin added 7 commits December 30, 2014 11:19

SPARK-3278 PR 3519 refactoring WeightedLabeledPoint to tuple as per c…

cab5a46

…omments

Merge remote-tracking branch 'upstream/master' into SPARK-3278-weight…

8cefd18

…edLabeledPoint

SPARK-3278 refactored weightedlabeledpoint to (double, double, double…

deb0f17

…) and updated api

SPARK-3278 refactored weightedlabeledpoint to (double, double, double…

a24e29f

…) and updated api

SPARK-3278 Isotonic regression java api

941fd1f

Merge remote-tracking branch 'upstream/master' into SPARK-3278

823d803

Merge branch 'SPARK-3278-weightedLabeledPoint' into SPARK-3278

e9b3323

mengxr added 3 commits January 30, 2015 11:43

Merge remote-tracking branch 'zapletal-martin/SPARK-3278' into SPARK-…

5925113

…3278

add unit test for model construction

05422a8

minor

077606b

update paraPAVA

35d044e

mengxr and others added 4 commits January 30, 2015 13:22

compress pools and update tests

0b35c15

add cache back

4dfe136

Merge pull request #1 from mengxr/SPARK-3278

ded071c

Update isotonic regression

Merge remote-tracking branch 'upstream/master' into SPARK-3278

d8feb82

Merge remote-tracking branch 'origin/SPARK-3278' into SPARK-3278

e3c0e44

mengxr and others added 2 commits January 30, 2015 15:59

fix java tests

37ba24e

Merge pull request #2 from mengxr/isotonic-fix-java

5a54ea4

fix java tests

asfgit closed this in 34250a6 Jan 31, 2015

ahmed-mahran mentioned this pull request Dec 8, 2022

[SPARK-41008][MLLIB] Dedup isotonic regression duplicate features #38966

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLLIB][SPARK-3278] Monotone (Isotonic) regression using parallel pool adjacent violators algorithm #3519

[MLLIB][SPARK-3278] Monotone (Isotonic) regression using parallel pool adjacent violators algorithm #3519

zapletal-martin commented Nov 30, 2014

AmplabJenkins commented Nov 30, 2014

mengxr commented Dec 1, 2014

mengxr commented Dec 1, 2014

mengxr commented Dec 1, 2014

SparkQA commented Dec 1, 2014

SparkQA commented Dec 1, 2014

AmplabJenkins commented Dec 1, 2014

SparkQA commented Dec 1, 2014

SparkQA commented Dec 1, 2014

AmplabJenkins commented Dec 1, 2014

mengxr commented Dec 15, 2014

zapletal-martin commented Dec 29, 2014

mengxr commented Dec 29, 2014

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

SparkQA commented Jan 30, 2015

SparkQA commented Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

SparkQA commented Jan 31, 2015

SparkQA commented Jan 31, 2015

AmplabJenkins commented Jan 31, 2015

mengxr commented Jan 31, 2015

[MLLIB][SPARK-3278] Monotone (Isotonic) regression using parallel pool adjacent violators algorithm #3519

[MLLIB][SPARK-3278] Monotone (Isotonic) regression using parallel pool adjacent violators algorithm #3519

Conversation

zapletal-martin commented Nov 30, 2014

AmplabJenkins commented Nov 30, 2014

mengxr commented Dec 1, 2014

mengxr commented Dec 1, 2014

mengxr commented Dec 1, 2014

SparkQA commented Dec 1, 2014

SparkQA commented Dec 1, 2014

AmplabJenkins commented Dec 1, 2014

SparkQA commented Dec 1, 2014

SparkQA commented Dec 1, 2014

AmplabJenkins commented Dec 1, 2014

mengxr commented Dec 15, 2014

zapletal-martin commented Dec 29, 2014

mengxr commented Dec 29, 2014

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

SparkQA commented Jan 30, 2015

SparkQA commented Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

SparkQA commented Jan 31, 2015

SparkQA commented Jan 31, 2015

AmplabJenkins commented Jan 31, 2015

mengxr commented Jan 31, 2015