Training fails if we have too many features (400+) #3

nruemmele · 2017-02-02T07:52:00Z

When using char-dist-features + header features for the domain "dbpedia", we get many features (400+). The training of RandomForestClassifier with Spark fails with the error:
Cause: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

Apparently, there's a bug in Spark, but it's not clear if there is an easy fix for this problem:
https://issues.apache.org/jira/browse/SPARK-16845
http://stackoverflow.com/questions/40044779/find-mean-and-corr-of-10-000-columns-in-pyspark-dataframe
https://issues.apache.org/jira/browse/SPARK-17092

SparkTestSpec reproduces this error currently.

nruemmele · 2017-02-02T07:57:31Z

https://jira.csiro.au/browse/SERENE-202

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training fails if we have too many features (400+) #3

Training fails if we have too many features (400+) #3

nruemmele commented Feb 2, 2017 •

edited

Loading

nruemmele commented Feb 2, 2017

Training fails if we have too many features (400+) #3

Training fails if we have too many features (400+) #3

Comments

nruemmele commented Feb 2, 2017 • edited Loading

nruemmele commented Feb 2, 2017

nruemmele commented Feb 2, 2017 •

edited

Loading