[SPARK-2179][SQL] Public API for DataTypes and Schema #1346

yhuai · 2014-07-09T19:32:51Z

The current PR contains the following changes:

Expose DataTypes in the sql package (internal details are private to sql).
Users can create Rows.
Introduce applySchema to create a SchemaRDD by applying a schema: StructType to an RDD[Row].
Add a function simpleString to every DataType. Also, the schema represented by a StructType can be visualized by printSchema.
ScalaReflection.typeOfObject provides a way to infer the Catalyst data type based on an object. Also, we can compose typeOfObject with some custom logics to form a new function to infer the data type (for different use cases).
JsonRDD has been refactored to use changes introduced by this PR.
Add a field containsNull to ArrayType. So, we can explicitly mark if an ArrayType can contain null values. The default value of containsNull is false.

New APIs are introduced in the sql package object and SQLContext. You can find the scaladoc at
sql package object and SQLContext.

An example of using applySchema is shown below.

import org.apache.spark.sql._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val schema =
  StructType(
    StructField("name", StringType, false) ::
    StructField("age", IntegerType, true) :: Nil)

val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Row(p(0), p(1).trim.toInt))
val peopleSchemaRDD = sqlContext. applySchema(people, schema)
peopleSchemaRDD.printSchema
// root
// |-- name: string (nullable = false)
// |-- age: integer (nullable = true)

peopleSchemaRDD.registerAsTable("people")
sqlContext.sql("select name from people").collect.foreach(println)

I will add new contents to the SQL programming guide later.

JIRA: https://issues.apache.org/jira/browse/SPARK-2179

* Expose `DataType`s in the sql package (internal details are private to sql). * Introduce `createSchemaRDD` to create a `SchemaRDD` from an `RDD` with a provided schema (represented by a `StructType`) and a provided function to construct `Row`, * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.

AmplabJenkins · 2014-07-09T19:36:12Z

Merged build triggered.

AmplabJenkins · 2014-07-09T19:36:18Z

Merged build started.

concretevitamin · 2014-07-09T20:54:25Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

+   *
+   * @group userf
+   */
+  def createSchemaRDD[A](rdd: RDD[A], schema: StructType, constructRow: A => Row) = {


Naming nit: functional lang usually uses "make", moreover SparkContext already has a public makeRDD.

AmplabJenkins · 2014-07-09T21:15:31Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-07-09T21:15:32Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16474/

ueshin · 2014-07-10T02:54:28Z

Hi, I'm wondering if MapType will have something like containsNull for ArrayType.

yhuai · 2014-07-10T03:30:44Z

Hi @ueshin, @marmbrus and I discussed about it. We think it is not semantically clear what a null means when it appears in the key or value field (considering a null is used to indicate a missing data value). So, we decided that key and value in a MapType should not contain any null value and we will not introduce containsNull to MapType. Does it make sense?

ueshin · 2014-07-10T05:27:11Z

@yhuai, I understand. Thank you for your reply.

rxin · 2014-07-10T06:42:13Z

@yhuai I haven't looked at the changes yet, but can you make sure the end API is usable in Java?

rxin · 2014-07-10T08:32:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala

+  private[sql] type JvmType = String
+  @transient private[sql] lazy val tag = typeTag[JvmType]
+  private[sql] val ordering = implicitly[Ordering[JvmType]]
+  def simpleString: String = "string"
 }


while you at it, add a blank line to separate each class

yhuai · 2014-07-10T16:47:26Z

Yeah, I will make sure new APIs are usable in Java and Python.

…the expected type.

SparkQA · 2014-07-10T22:33:03Z

QA tests have started for PR 1346. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16528/consoleFull

AmplabJenkins · 2014-07-10T22:36:28Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16527/

SparkQA · 2014-07-11T00:08:30Z

QA results for PR 1346:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class ArrayType(elementType: DataType) extends DataType {
case class StructField(name: String, dataType: DataType, nullable: Boolean) {
case class MapType(keyType: DataType, valueType: DataType) extends DataType {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16528/consoleFull

SparkQA · 2014-07-11T03:17:32Z

QA tests have started for PR 1346. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16547/consoleFull

SparkQA · 2014-07-11T04:50:22Z

QA results for PR 1346:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class ArrayType(elementType: DataType) extends DataType {
case class StructField(name: String, dataType: DataType, nullable: Boolean) {
case class MapType(keyType: DataType, valueType: DataType) extends DataType {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16547/consoleFull

SparkQA · 2014-07-11T05:12:44Z

QA tests have started for PR 1346. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16553/consoleFull

SparkQA · 2014-07-11T05:13:26Z

QA results for PR 1346:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class ArrayType(elementType: DataType) extends DataType {
case class StructField(name: String, dataType: DataType, nullable: Boolean) {
case class MapType(keyType: DataType, valueType: DataType) extends DataType {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16553/consoleFull

…ting schema.

…om a StructType.

SparkQA · 2014-07-11T19:17:50Z

QA tests have started for PR 1346. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16572/consoleFull

SparkQA · 2014-07-29T09:11:12Z

QA results for PR 1346:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17344/consoleFull

SparkQA · 2014-07-29T20:08:51Z

QA tests have started for PR 1346. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17372/consoleFull

SparkQA · 2014-07-29T20:09:43Z

QA results for PR 1346:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17372/consoleFull

SparkQA · 2014-07-29T20:23:55Z

QA tests have started for PR 1346. This patch DID NOT merge cleanly!
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17374/consoleFull

SparkQA · 2014-07-29T20:24:48Z

QA results for PR 1346:
- This patch FAILED unit tests.

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17374/consoleFull

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala sql/core/src/main/scala/org/apache/spark/sql/api/java/JavaSQLContext.scala sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala

yhuai · 2014-07-30T05:53:16Z

@chenghao-intel containsNull and valueContainsNull can be used for further optimization. For example, let's say we have an ArrayType column and the element type is IntegerType. If elements of those arrays do not have null values, we can use a primitive array internal. Since we will expose data types to users, we need to introduce these two booleans with this PR. It can be hard to add them once users start to use these APIs.

SparkQA · 2014-07-30T05:53:58Z

QA tests have started for PR 1346. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17423/consoleFull

SparkQA · 2014-07-30T07:14:35Z

QA results for PR 1346:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17423/consoleFull

marmbrus · 2014-07-30T07:20:15Z

Thanks for working on this! Merged to master.

chenghao-intel · 2014-07-30T08:56:19Z

Thank you @yhuai for the explanation.

yhuai · 2014-07-30T17:25:26Z

Maven build is failing. https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/244/console I am look at it.

chutium · 2014-08-04T16:59:40Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

+   *    StructType(
+   *      StructField("name", StringType, false) ::
+   *      StructField("age", IntegerType, true) :: Nil)
+   *


Hi @yhuai , why we need to define schema as a StructType, but not directly as a Seq[StructField]? i tried to build a Seq[StructField] from JDBC metadata in #1612 https://github.com/apache/spark/pull/1612/files#diff-3 (it followed the code of your JsonRDD :)

it seems we do not need this StructType anywhere.

For the completeness of our data types, we need StructType (Seq[StructField] is not a data type). For example, if the type of a filed is a struct, we need to have a way to describe that the type of this field is a struct. Also, because a row is basically a struct value, it is natural to use StructType to represent a schema.

o, yep, StructType is needed, i mean
def applySchema(rowRDD: RDD[Row], schema: StructType): SchemaRDD
could be
def applySchema(rowRDD: RDD[Row], schema: Seq[StructField]): SchemaRDD

then we do not need to always use schema.fields.map(f => AttributeReference...)

we can direct schema.map(f => AttributeReference...)

This might be crazy... but if StructType <: Seq[StructField] then we could pass in either StructType or Seq[StructField]. Should be possible to do this fairly easily

good, i merged the change and used this API applySchema(rowRDD, appliedSchema) in #1612

The current PR contains the following changes: * Expose `DataType`s in the sql package (internal details are private to sql). * Users can create Rows. * Introduce `applySchema` to create a `SchemaRDD` by applying a `schema: StructType` to an `RDD[Row]`. * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`. * `ScalaReflection.typeOfObject` provides a way to infer the Catalyst data type based on an object. Also, we can compose `typeOfObject` with some custom logics to form a new function to infer the data type (for different use cases). * `JsonRDD` has been refactored to use changes introduced by this PR. * Add a field `containsNull` to `ArrayType`. So, we can explicitly mark if an `ArrayType` can contain null values. The default value of `containsNull` is `false`. New APIs are introduced in the sql package object and SQLContext. You can find the scaladoc at [sql package object](http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.package) and [SQLContext](http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext). An example of using `applySchema` is shown below. ```scala import org.apache.spark.sql._ val sqlContext = new org.apache.spark.sql.SQLContext(sc) val schema = StructType( StructField("name", StringType, false) :: StructField("age", IntegerType, true) :: Nil) val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Row(p(0), p(1).trim.toInt)) val peopleSchemaRDD = sqlContext. applySchema(people, schema) peopleSchemaRDD.printSchema // root // |-- name: string (nullable = false) // |-- age: integer (nullable = true) peopleSchemaRDD.registerAsTable("people") sqlContext.sql("select name from people").collect.foreach(println) ``` I will add new contents to the SQL programming guide later. JIRA: https://issues.apache.org/jira/browse/SPARK-2179 Author: Yin Huai <huai@cse.ohio-state.edu> Closes apache#1346 from yhuai/dataTypeAndSchema and squashes the following commits: 1d45977 [Yin Huai] Clean up. a6e08b4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema c712fbf [Yin Huai] Converts types of values based on defined schema. 4ceeb66 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema e5f8df5 [Yin Huai] Scaladoc. 122d1e7 [Yin Huai] Address comments. 03bfd95 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema 2476ed0 [Yin Huai] Minor updates. ab71f21 [Yin Huai] Format. fc2bed1 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema bd40a33 [Yin Huai] Address comments. 991f860 [Yin Huai] Move "asJavaDataType" and "asScalaDataType" to DataTypeConversions.scala. 1cb35fe [Yin Huai] Add "valueContainsNull" to MapType. 3edb3ae [Yin Huai] Python doc. 692c0b9 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema 1d93395 [Yin Huai] Python APIs. 246da96 [Yin Huai] Add java data type APIs to javadoc index. 1db9531 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema d48fc7b [Yin Huai] Minor updates. 33c4fec [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema b9f3071 [Yin Huai] Java API for applySchema. 1c9f33c [Yin Huai] Java APIs for DataTypes and Row. 624765c [Yin Huai] Tests for applySchema. aa92e84 [Yin Huai] Update data type tests. 8da1a17 [Yin Huai] Add Row.fromSeq. 9c99bc0 [Yin Huai] Several minor updates. 1d9c13a [Yin Huai] Update applySchema API. 85e9b51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema e495e4e [Yin Huai] More comments. 42d47a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema c3f4a02 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema 2e58dbd [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema b8b7db4 [Yin Huai] 1. Move sql package object and package-info to sql-core. 2. Minor updates on APIs. 3. Update scala doc. 68525a2 [Yin Huai] Update JSON unit test. 3209108 [Yin Huai] Add unit tests. dcaf22f [Yin Huai] Add a field containsNull to ArrayType to indicate if an array can contain null values or not. If an ArrayType is constructed by "ArrayType(elementType)" (the existing constructor), the value of containsNull is false. 9168b83 [Yin Huai] Update comments. fc649d7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema eca7d04 [Yin Huai] Add two apply methods which will be used to extract StructField(s) from a StructType. 949d6bb [Yin Huai] When creating a SchemaRDD for a JSON dataset, users can apply an existing schema. 7a6a7e5 [Yin Huai] Fix bug introduced by the change made on SQLContext.inferSchema. 43a45e1 [Yin Huai] Remove sql.util.package introduced in a previous commit. 0266761 [Yin Huai] Format 03eec4c [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema 90460ac [Yin Huai] Infer the Catalyst data type from an object and cast a data value to the expected type. 3fa0df5 [Yin Huai] Provide easier ways to construct a StructType. 16be3e5 [Yin Huai] This commit contains three changes: * Expose `DataType`s in the sql package (internal details are private to sql). * Introduce `createSchemaRDD` to create a `SchemaRDD` from an `RDD` with a provided schema (represented by a `StructType`) and a provided function to construct `Row`, * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.

concretevitamin reviewed Jul 9, 2014
View reviewed changes

Provide easier ways to construct a StructType.

3fa0df5

rxin reviewed Jul 10, 2014
View reviewed changes

yhuai added 3 commits July 10, 2014 15:24

Infer the Catalyst data type from an object and cast a data value to …

90460ac

…the expected type.

Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema

03eec4c

Format

0266761

Remove sql.util.package introduced in a previous commit.

43a45e1

Fix bug introduced by the change made on SQLContext.inferSchema.

7a6a7e5

yhuai added 3 commits July 11, 2014 10:22

When creating a SchemaRDD for a JSON dataset, users can apply an exis…

949d6bb

…ting schema.

Add two apply methods which will be used to extract StructField(s) fr…

eca7d04

…om a StructType.

Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema

fc649d7

yhuai added 2 commits July 29, 2014 12:54

Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema

03bfd95

Address comments.

122d1e7

Scaladoc.

e5f8df5

Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema

4ceeb66

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala

chenghao-intel mentioned this pull request Jul 30, 2014

[SPARK-2710] [SQL] Build SchemaRDD from a JdbcRDD with MetaData #1612

Closed

yhuai added 3 commits July 29, 2014 22:18

Converts types of values based on defined schema.

c712fbf

Clean up.

1d45977

marmbrus mentioned this pull request Jul 30, 2014

[SPARK-2314][SQL] Override collect and take in python library, and count in java library, with optimized versions. #1592

Closed

asfgit closed this in 7003c16 Jul 30, 2014

yhuai mentioned this pull request Jul 30, 2014

Fix maven build. #1664

Closed

yhuai deleted the dataTypeAndSchema branch July 31, 2014 21:11

chutium reviewed Aug 4, 2014
View reviewed changes

[SPARK-2179][SQL] Public API for DataTypes and Schema #1346

[SPARK-2179][SQL] Public API for DataTypes and Schema #1346

Conversation

yhuai commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

concretevitamin Jul 9, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

ueshin commented Jul 10, 2014

yhuai commented Jul 10, 2014

ueshin commented Jul 10, 2014

rxin commented Jul 10, 2014

rxin Jul 10, 2014

Choose a reason for hiding this comment

yhuai commented Jul 10, 2014

SparkQA commented Jul 10, 2014

AmplabJenkins commented Jul 10, 2014

SparkQA commented Jul 11, 2014

SparkQA commented Jul 11, 2014

SparkQA commented Jul 11, 2014

SparkQA commented Jul 11, 2014

SparkQA commented Jul 11, 2014

SparkQA commented Jul 11, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

yhuai commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

marmbrus commented Jul 30, 2014

chenghao-intel commented Jul 30, 2014

yhuai commented Jul 30, 2014

chutium Aug 4, 2014

Choose a reason for hiding this comment

yhuai Aug 4, 2014

Choose a reason for hiding this comment

chutium Aug 5, 2014

Choose a reason for hiding this comment

marmbrus Aug 5, 2014

Choose a reason for hiding this comment

chutium Aug 6, 2014

Choose a reason for hiding this comment