Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5995] [ML] Make Prediction dev API public #5913

Closed
wants to merge 4 commits into from

Conversation

jkbradley
Copy link
Member

Changes:

  • Update protected prediction methods, following design doc. <--most interesting change
  • Changed abstract classes for Estimator and Model to be public. Added DeveloperApi tag. (I kept the traits for Estimator/Model Params private.)
  • Changed ProbabilisticClassificationModel method names to use probability instead of probabilities.

CC: @mengxr @shivaram @etrain

@SparkQA
Copy link

SparkQA commented May 5, 2015

Test build #31890 has finished for PR 5913 at commit 15b9957.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

shivaram commented May 5, 2015

Thanks @jkbradley - Change mostly looks good to me. I also notice you've changed the LogisticRegression to not have the custom transform ? Is that not required anymore ?

@jkbradley
Copy link
Member Author

@shivaram The updated ClassificationModel and ProbabilisticClassificationModel protected prediction methods should be more efficient now, so I don't think a specialized one for LogisticRegression is warranted.
Thanks for checking the PR!

* Find the index of a maximal element. Returns the first maximal element in case of a tie.
* Returns -1 if vector has length 0.
*/
private[spark] def findMax: Int = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findMax -> argmax? I think we can put this in a separate PR or move it to DenseVector only. The problem is with sparse vectors, where we need to return an index with value zero if nonzero elements are negative.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good point. I'll move it to DenseVector. We've often added helper methods in random places, rather than to linear algebra, utils, or the place they really should belong. I prefer we add these methods directly to the relevant place but make them private.

@jkbradley
Copy link
Member Author

Fixed! Any other comments?

@SparkQA
Copy link

SparkQA commented May 6, 2015

Test build #31945 has finished for PR 5913 at commit e9aa0ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented May 6, 2015

This looks good to me. There are two issues we may need to deal with in the future:

  1. Where to put multi-label classifiers in the class hierarchy?
  2. How to support distributively stored models? Those models could make predictions on individual records if the method call is triggered on the driver node. However, the predict method cannot be used in an RDD closure.

asfgit pushed a commit that referenced this pull request May 6, 2015
Changes:
* Update protected prediction methods, following design doc. **<--most interesting change**
* Changed abstract classes for Estimator and Model to be public.  Added DeveloperApi tag.  (I kept the traits for Estimator/Model Params private.)
* Changed ProbabilisticClassificationModel method names to use probability instead of probabilities.

CC: mengxr shivaram etrain

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #5913 from jkbradley/public-dev-api and squashes the following commits:

e9aa0ea [Joseph K. Bradley] moved findMax to DenseVector and renamed to argmax. fixed bug for vector of length 0
15b9957 [Joseph K. Bradley] renamed probabilities to probability in method names
5cda84d [Joseph K. Bradley] regenerated sharedParams
7d1877a [Joseph K. Bradley] Made spark.ml prediction abstractions public.  Organized their prediction methods for efficient computation of multiple output columns.

(cherry picked from commit 1ad04da)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@mengxr
Copy link
Contributor

mengxr commented May 6, 2015

Merged into master and branch-1.4. Thanks!

@asfgit asfgit closed this in 1ad04da May 6, 2015
@jkbradley
Copy link
Member Author

Made JIRAs for those 2 items:

  • multilabel abstractions: [https://issues.apache.org/jira/browse/SPARK-7409]
  • distributed models: [https://issues.apache.org/jira/browse/SPARK-7412]

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
Changes:
* Update protected prediction methods, following design doc. **<--most interesting change**
* Changed abstract classes for Estimator and Model to be public.  Added DeveloperApi tag.  (I kept the traits for Estimator/Model Params private.)
* Changed ProbabilisticClassificationModel method names to use probability instead of probabilities.

CC: mengxr shivaram etrain

Author: Joseph K. Bradley <joseph@databricks.com>

Closes apache#5913 from jkbradley/public-dev-api and squashes the following commits:

e9aa0ea [Joseph K. Bradley] moved findMax to DenseVector and renamed to argmax. fixed bug for vector of length 0
15b9957 [Joseph K. Bradley] renamed probabilities to probability in method names
5cda84d [Joseph K. Bradley] regenerated sharedParams
7d1877a [Joseph K. Bradley] Made spark.ml prediction abstractions public.  Organized their prediction methods for efficient computation of multiple output columns.
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
Changes:
* Update protected prediction methods, following design doc. **<--most interesting change**
* Changed abstract classes for Estimator and Model to be public.  Added DeveloperApi tag.  (I kept the traits for Estimator/Model Params private.)
* Changed ProbabilisticClassificationModel method names to use probability instead of probabilities.

CC: mengxr shivaram etrain

Author: Joseph K. Bradley <joseph@databricks.com>

Closes apache#5913 from jkbradley/public-dev-api and squashes the following commits:

e9aa0ea [Joseph K. Bradley] moved findMax to DenseVector and renamed to argmax. fixed bug for vector of length 0
15b9957 [Joseph K. Bradley] renamed probabilities to probability in method names
5cda84d [Joseph K. Bradley] regenerated sharedParams
7d1877a [Joseph K. Bradley] Made spark.ml prediction abstractions public.  Organized their prediction methods for efficient computation of multiple output columns.
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
Changes:
* Update protected prediction methods, following design doc. **<--most interesting change**
* Changed abstract classes for Estimator and Model to be public.  Added DeveloperApi tag.  (I kept the traits for Estimator/Model Params private.)
* Changed ProbabilisticClassificationModel method names to use probability instead of probabilities.

CC: mengxr shivaram etrain

Author: Joseph K. Bradley <joseph@databricks.com>

Closes apache#5913 from jkbradley/public-dev-api and squashes the following commits:

e9aa0ea [Joseph K. Bradley] moved findMax to DenseVector and renamed to argmax. fixed bug for vector of length 0
15b9957 [Joseph K. Bradley] renamed probabilities to probability in method names
5cda84d [Joseph K. Bradley] regenerated sharedParams
7d1877a [Joseph K. Bradley] Made spark.ml prediction abstractions public.  Organized their prediction methods for efficient computation of multiple output columns.
@jkbradley jkbradley deleted the public-dev-api branch July 25, 2016 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants