Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MLLIB][SPARK-3278] Monotone (Isotonic) regression using parallel pool adjacent violators algorithm #3519

Closed
wants to merge 44 commits into from

Conversation

zapletal-martin
Copy link
Contributor

This PR introduces an API for Isotonic regression and one algorithm implementing it, Pool adjacent violators.

The Isotonic regression problem is sufficiently described in Floudas, Pardalos, Encyclopedia of Optimization, Wikipedia or Stat Wiki.

Pool adjacent violators was introduced by M. Ayer et al. in 1955. A history and development of isotonic regression algorithms is in Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods and list of available algorithms including their complexity is listed in Stout, Fastest Isotonic Regression Algorithms.

An approach to parallelize the computation of PAV was presented in Kearsley, Tapia, Trosset, An Approach to Parallelizing Isotonic Regression.

The implemented Pool adjacent violators algorithm is based on Floudas, Pardalos, Encyclopedia of Optimization (Chapter Isotonic regression problems, p. 86) and Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods, also nicely formulated in Tibshirani, Hoefling, Tibshirani, Nearly-Isotonic Regression. Implementation itself inspired by R implementations Klaus, Strimmer, 2008, fdrtool: Estimation of (Local) False Discovery Rates and Higher Criticism and R Development Core Team, stats, 2009. I ran tests with both these libraries and confirmed they yield the same results. More R implementations referenced in aforementioned Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators
Algorithm (PAVA) and Active Set Methods
. The implementation is also inspired and cross checked with other implementations: Ted Harding, 2007, scikit-learn, Andrew Tulloch, 2014, Julia, Andrew Tulloch, 2014, c++, described in Andrew Tulloch, Speeding up isotonic regression in scikit-learn by 5,000x, Fabian Pedregosa, 2012, Sreangsu Acharyya. libpav and Gustav Larsson.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mengxr
Copy link
Contributor

mengxr commented Dec 1, 2014

add to whitelist

@mengxr
Copy link
Contributor

mengxr commented Dec 1, 2014

ok to test

@mengxr
Copy link
Contributor

mengxr commented Dec 1, 2014

test this please

@SparkQA
Copy link

SparkQA commented Dec 1, 2014

Test build #23979 has started for PR 3519 at commit 8f5daf9.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 1, 2014

Test build #23979 has finished for PR 3519 at commit 8f5daf9.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • sealed trait MonotonicityConstraint
    • class IsotonicRegressionModel(
    • case class WeightedLabeledPoint(label: Double, features: Vector, weight: Double = 1)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23979/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Dec 1, 2014

Test build #23990 has started for PR 3519 at commit 6046550.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 1, 2014

Test build #23990 has finished for PR 3519 at commit 6046550.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • sealed trait MonotonicityConstraint
    • class IsotonicRegressionModel(
    • case class WeightedLabeledPoint(label: Double, features: Vector, weight: Double = 1)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23990/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Dec 15, 2014

@zapletal-martin Some high-level comments:

  1. The implementation introduces APIs that are not necessary for this PR. For example, WeightedLabeledPoint and MonotonicityConstraint. It should be sufficient to have only IsotonicRegression and IsotonicRegressionModel. Please try to make the public APIs minimal.
  2. I think it would be better if an IsotonicRegression takes an RDD[(Double, Double)] instead of RDD[WeightedLabeledPoint] and use the natural ordering. It is easy for the new pipeline API, where IsotonicRegression expects 2 or 3 (with weight) columns and the model applies to a single column.

@zapletal-martin
Copy link
Contributor Author

@mengxr , thank you very much for your feedback.

  1. Sure I will do it.
  2. a) Can you please clarify if you are sugesting to use RDD[(Double, Double, Double)] - i.e label, feature, weight or RDD[(Double, Double)] - i.e. just label, weight and already expect the data to be ordered? Also I assume there should be API with weight default to 1 (so user does not have to specify it).

b) IsotonicRegressionModel extends RegressionModel. It implements methods predict(testData: RDD[Vector]) and predict(testData: Vector). Are these still relevant if we implement the changes in 1)? There would never be a Vector, just Double. Also we would need feature in 1) to be able to predict label.

c) How do you expect the java api to look like? Unfortunately the java/scala interop here is not very helpful. When train method expects tuple of scala.Double then when called from java you get:

[error] IsotonicRegressionModel model = IsotonicRegression.train(testRDD.rdd(), true);
[error] ^
[error] required: RDD<Tuple3<Object,Object,Object>>,boolean
[error] found: RDD<Tuple3<Double,Double,Double>>,boolean
[error] reason: actual argument RDD<Tuple3<Double,Double,Double>> cannot be converted to RDD<Tuple3<Object,Object,Object>> by method invocation conversion

There are solutions to this problem, but most of them quite ugly. See for example http://stackoverflow.com/questions/17071061/scala-java-interoperability-how-to-deal-with-options-containing-int-long-primi or http://www.scala-notes.org/2011/04/specializing-for-primitive-types/.

Is there another public java api that uses primitive type in generic that I could use as reference?

@mengxr
Copy link
Contributor

mengxr commented Dec 29, 2014

2a) (label: Double, feature: Double, weight: Double) sounds good to me. We may add weight support to LabeledPoint as part of SPARK-3702, which should be orthogonal to this PR. We can update the API here (before 1.3) once that gets merged.

2b) Isotonic regression is a univariate regression algorithm. It is not necessary to have its model extend RegressionModel. It should have predict(RDD[Double]) and predict(Double).

2c) Try train(JavaPairRDD<java.lang.Double, java.lang.Double>)

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26416 has finished for PR 3519 at commit 75eac55.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26416/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26417 has finished for PR 3519 at commit 3da56e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class IsotonicRegressionModel (

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26417/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26433 has started for PR 3519 at commit ded071c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26434 has started for PR 3519 at commit e3c0e44.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26433 has finished for PR 3519 at commit ded071c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26433/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26434 has finished for PR 3519 at commit e3c0e44.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class IsotonicRegressionModel (

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26434/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 31, 2015

Test build #26458 has started for PR 3519 at commit 5a54ea4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 31, 2015

Test build #26458 has finished for PR 3519 at commit 5a54ea4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class IsotonicRegressionModel (

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26458/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Jan 31, 2015

LGTM. Merged into master. Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants