Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3934] [SPARK-3918] [mllib] Bug fixes for RandomForest, DecisionTree #2785

Closed
wants to merge 7 commits into from

Conversation

jkbradley
Copy link
Member

SPARK-3934: When run with a mix of unordered categorical and continuous features, on multiclass classification, RandomForest fails. The bug is in the sanity checks in getFeatureOffset and getLeftRightFeatureOffsets, which use the wrong indices for checking whether features are unordered.
Fix: Remove the sanity checks since they are not really needed, and since they would require DTStatsAggregator to keep track of an extra set of indices (for the feature subset).

Added test to RandomForestSuite which failed with old version but now works.

SPARK-3918: Added baggedInput.unpersist at end of training.

Also:

  • I removed DTStatsAggregator.isUnordered since it is no longer used.
  • DecisionTreeMetadata: Added logWarning when maxBins is automatically reduced.
  • Updated DecisionTreeRunner to explicitly fix the test data to have the same number of features as the training data. This is a temporary fix which should eventually be replaced by pre-indexing both datasets.
  • RandomForestModel: Updated toString to print total number of nodes in forest.
  • Changed Predict class to be public DeveloperApi. This was necessary to allow users to create their own trees by hand (for testing).

CC: @mengxr @manishamde @chouqin @codedeft Just notifying you of these small bug fixes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amp.lab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21695/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21700/
Test FAILed.

@chouqin
Copy link
Contributor

chouqin commented Oct 14, 2014

@jkbradley Thanks for the PR! It looks good to me.

@manishamde
Copy link
Contributor

@jkbradley Thanks. LGTM!

@mengxr
Copy link
Contributor

mengxr commented Oct 14, 2014

test this please

@SparkQA
Copy link

SparkQA commented Oct 14, 2014

QA tests have started for PR 2785 at commit e116473.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 14, 2014

QA tests have finished for PR 2785 at commit e116473.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Predict(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21715/
Test PASSed.

@@ -175,6 +175,7 @@ private class RandomForest (
treeToNodeToIndexInfo, splits, bins, nodeQueue, timer)
timer.stop("findBestSplits")
}
baggedInput.unpersist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try to merge master? This is already covered in #2775.

@jkbradley
Copy link
Member Author

@chouqin @manishamde @mengxr Thanks for taking a look! I think stuff is fixed.

@SparkQA
Copy link

SparkQA commented Oct 17, 2014

QA tests have started for PR 2785 at commit 9132321.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 17, 2014

QA tests have finished for PR 2785 at commit 9132321.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Predict(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21864/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Oct 17, 2014

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in 477c648 Oct 17, 2014
@jkbradley jkbradley deleted the dtrunner-update branch December 4, 2014 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants