-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-7160][SQL] Support converting DataFrames to typed RDDs. #5713
Conversation
Is there a way to make the code paths for Scala types ( |
Hi @punya. Re: Scala and Java types, are you talking about: a) the code similarities between b) the general approach for Scala/Java-Catalyst conversions? I can see that ScalaReflection and JavaTypeInference do similar things, but don't know enough about the design to know whether this is ripe for refactoring. If it hasn't been done or scheduled, maybe someone should do a survey of the current state of things. |
Now I see a thread on spark-dev (which I just joined) about availability/discoverability of design docs in general. :D If there isn't one for this conversion stuff, it is probably a candidate for a design doc. |
ok to test |
Test build #31174 has finished for PR 5713 at commit
|
Thanks for doing this. We are fairly busy with 1.4 release deadline. I'd like to revisit this pull request in Spark 1.5 in the context of making user-defined types and closures more usable in DataFrames. |
Thanks for the update, Reynold! I'm just glad it's in the queue. :D Looking forward to the 1.4 release and will keep an eye out for any feedback on this PR... |
Just a head's up: I have a WIP patch which performs a significant refactoring of CatalystTypeConveters (#6222). I think that my patch should make the changes here easier to implement / understand, so I'd love to get your feedback on my proposed refactorings. |
@JoshRosen: As you saw from my comment on #6222 I think it looks good. As for this PR, yeah, it should be re-implemented on top of your patch. I think the conversion would still use the type hints given by def toScala(universeType: universe.Type, @Nullable catalystValue: CatalystType): ScalaType or given a default param, e.g.: def toScala(
@Nullable catalystValue: CatalystType,
universeType: Option[universe.Type] = None
): ScalaType And In any case, looks doable and should be cleaner. If you like, I can update this PR after you merge your patch. |
add51b6
to
3711a3e
Compare
Jenkins, this is ok to test. |
Test build #34456 has finished for PR 5713 at commit
|
3711a3e
to
0ee742d
Compare
Test build #41415 has finished for PR 5713 at commit
|
Hi @marmbrus, just repeating what I wrote at https://issues.apache.org/jira/browse/SPARK-7160, I updated this PR to bring it in sync with master. I rebased on yesterday and integrated with Thanks, |
val converter = | ||
CatalystTypeConverters.createToProductConverter[G](dataType.asInstanceOf[StructType]) | ||
converter(row.asInstanceOf[InternalRow]) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a kind of a nit, but could you limit the scope of the intercept
. I think this is right, but with the extra casts I'm worried the wrong thing might be throwing ClassCastException
Sorry for the unreasonable delay reviewing this. Over all it looks pretty good to me. I have a few small comments and there is a conflict (which is fortunately easy to resolve this time) |
Actually, I have one other concern. It doesn't seem that this works in the spark shell.
Unfortunately, I'm not sure if there is an easy way around this. |
@@ -887,4 +887,98 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { | |||
.select(struct($"b")) | |||
.collect() | |||
} | |||
|
|||
test("SPARK-7160: toTypedRDD[T] works with simple case class") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably move this into its own suite.
After spending some more time trying to fix the above issue, I wonder if it wouldn't be better to make this a whole separate code path, instead of trying to handle both cases inside of catalyst converters. That class is already pretty hard to follow and adding a whole set of branches everywhere makes it even worse. Another concern is that the extra branches might affect the performance of the other case. Thoughts? |
First off, thanks for the code review, Michael. Re: spark-shell, I thought I'd bumped into a known limitation of the REPL (e.g. https://groups.google.com/forum/#!msg/spark-users/bwAmbUgxWrA/HwP4Nv4adfEJ) but I should've raised it, so sorry. The workaround (in my head) was that people should just not use case classes defined within the REPL. Not sure what to do here, I'll have to think about it. Re: a separate code path, I actually had it that way when I submitted the first version of this PR. But then Josh refactored the converters and it seemed reasonable to follow his refactored code paths, since it was easier to track (and probably maintain) the conversions for each type, so that was the second (previous) version of this PR. I can see arguments either way. If we don't expect to add more converters/conversions, then maybe a separate code path would be fine. |
Regarding the REPL issues, I'm wondering if there isn't some way to handle the common case, where the outer pointer that is getting added to the class isn't actually needed for anything. I was hoping there would be some way to just pass For the separate code path issue. My biggest motivation here is I see a lot of things that aren't great from a performance perspective (the old code is the same). I think if this becomes popular we are going to have to code-generate the conversion functions. Having it separate would make this transition easier. |
Got it. I'll refactor back to a separate code path then (and address the other issues you identified). After that I'll try to figure out the REPL stuff. If you figure it out, then great. :D Hope that works. |
0ee742d
to
6d4bec2
Compare
Hi @marmbrus, I updated this PR to use a separate code path as you requested. I don't think I can solve the REPL thing, and I'm at the limits of my understanding of Scala reflection. fwiw, json4s seems to recommend just compiling case classes out-of-band, and then importing them into the REPL, for functionality similar to |
btw I force-pushed when updating this PR, since the diff would've otherwise looked weird. I rebased on a commit from 9/25. |
val message = | ||
s"""|Error constructing ${classTag.runtimeClass.getName}: ${e.getMessage}; | ||
|paramTypes: ${paramTypes}, dataTypes: ${dataTypes}, | ||
|convertedArgs: ${convertedArgs}""".stripMargin.replace("\n", " ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressing feedback, this error message provides more details when things go wrong calling the constructor.
Test build #43055 has finished for PR 5713 at commit
|
Hi @rayortigas, I have been working on a similar feature. So far I have something that work similarly but that is a little less easy to use, case class DemoCC(int: Int, boolean: Boolean, string: Option[String]) extends Serializable
object Demo {
def main(args: Array[String]): Unit = {
val sc = new SparkContext()
val sqlContext = new SQLContext(sc)
val inputData: Seq[Row] = Seq(
Row(true,1,"A",2.0),
Row(true,2,null,4.0),
Row(false,3,"C",9.0)
)
val schema = StructType(Seq(
StructField("boolean",BooleanType),
StructField("int",IntegerType),
StructField("string",StringType),
StructField("double",DoubleType)
))
val rdd: RDD[Row] = sc.parallelize(inputData,3)
val df: DataFrame = sqlContext.createDataFrame(rdd,schema)
/* The permutationPlan can be serialized, so we generate it once and for all on the driver. This will perform a preliminary check as well */
val permutationPlan = PermutationPlan[DemoCC](df)
/* The transformer cannot be serialized because TypeTag is not (really) serializable in scala 2.10 */
@transient lazy val transformer = new RowToCaseClassTransformer[DemoCC](permutationPlan)
/* Using "df.map(transformer)" instead would not work... */
val res = df.map{r => transformer(r)}
res.collect.foreach{println}
}
} I tried implementing On the other hand, I tried to be as generic as possible, and my implementation supports I've just found this pull request, and would love to contribute, but I am not sure how to proceed. I would love to discuss this further with you. I am not sure this is the most suitable place to do so. Regards, Furcy PS: thanks for the |
Hey, I'm really sorry for letting this sit so long. I got wrapped up trying to get ready for Spark 1.6. It would be great if you can look at SPARK-9999, which adds a new method |
case class A(x: String, y: Int) | ||
case class B(a: A, z: Double) | ||
case class C(x: String, a: Seq[A]) | ||
case class D(x: String, a: Map[Int, A]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are any test cases here that aren't covered by ExpressionEncoderSuite it would be awesome to add them.
Now that Spark 1.6 is almost release I think we can close this issue. Thanks again for working on it. |
https://issues.apache.org/jira/browse/SPARK-7160
databricks/spark-csv#52
cc:
@rxin (who made the original suggestion)
@vlyubin #5279
@punya #5578
@davies #5350
@marmbrus (ScalaReflection and more)