Specifying a read schema with spark-avro #96

mwho · 2015-10-30T02:23:57Z

It would be nice to have an option to supply a read schema (in lieu of the embedded schema) when reading avro files via spark-avro.

For example, the Python Avro API allows the following:
reader = DataFileReader(data, DatumReader(readers_schema=schema))

The scenario is this: I have many .avro files, possibly with different schemas (due to schema evolution), and I would like to use a single "master" schema to ingest all of those avro files into a single Spark Dataframe.

The text was updated successfully, but these errors were encountered:

theouteredge · 2015-11-03T12:08:23Z

This will be a really good idea.
I've just been looking at Avro schema evolution and how I can manage this in spark

jongyonkim · 2015-11-20T20:26:08Z

I'm waiting for this feature as well. Otherwise there is no way of ingesting many avro files (each with their own schemas) with an up-to-date master schema.

cleaton · 2015-11-23T00:50:44Z

i'm also waiting for this feature as it seems spark is very slow in generating dataframe schema when a large number of sequence files are selected. It also seems to create thousands of broadcast variables and using a lot of memory on the driver node.

tomseddon · 2016-07-05T14:33:13Z

Hi, is there any progress on this feature? The only reason I'm not using spark-avro (using hadoopRDD instead) is because I need support for schemas that can evolve. I'm sometimes re-processing historical data which may not have some attributes which have been recently added, but the whole lot would be processed in one go ideally.

clockfly · 2016-07-06T23:19:45Z

In Spark 2.0, you can specify a schema using spark.read.schema(user_defined_schema).format(...).

scala> val df = Seq((1,2,3)).toDF("a", "b", "c")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]

scala> df.write.format("com.databricks.spark.avro").save("/tmp/output")

scala> spark.read.format("com.databricks.spark.avro").load("/tmp/output").show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
+---+---+---+

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

// Prepare a customized schema
scala> val struct = StructType(StructField("a", IntegerType, true) :: Nil)
struct: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true))

scala> spark.read.format("com.databricks.spark.avro").schema(struct).load("/tmp/output").show()
+---+
|  a|
+---+
|  1|
+---+

// Save the schema to a Json string, you can later save this to a file.
scala> struct.json
res6: String = {"type":"struct","fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]}

// Load the Json string back
scala> spark.read.format("com.databricks.spark.avro").schema(DataType.fromJson(res6).asInstanceOf[StructType]).load("/tmp/output").show()
+---+
|  a|
+---+
|  1|
+---+

tomseddon · 2016-07-12T13:35:59Z

That's great, thanks for the reply.

tomseddon · 2016-07-18T16:08:54Z

Sorry, another question on this @clockfly. Shouldn't I be able to specify an avsc file? I tried this and I'm getting scala.MatchError.

I've already had to specify my input schema in case classes for conversion to dataset, and re-specifying the exact same thing in StructType seems unweildy. The most convenient scenario would be to just plug in my latest avsc generated from avro-tools.

panda2727 · 2016-07-18T20:31:51Z

Thanks @clockfly.
Do you guys have any trouble to load spark-avro 3.0.0-preview (for Spark 2.0)?

The parameter used in spark shell: --packages com.databricks:spark-avro_2.10:3.0.0-preview
(I also tried --packages com.databricks:spark-avro_2.10:2.0.1)

I get an error: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.avro. Please use Spark package http://spark-packages.org/package/databricks/spark-avro

Thanks.

tomseddon · 2016-07-27T11:31:09Z

I've been playing around with this.

Should it not be able to support missing fields if they are nullable? I can't really use this feature unless I can specify an all encompassing master schema which will work for all versions, including earlier files that have new fields missing.

In @clockfly's example above, I should be able to specify a schema like this

val struct = StructType(StructField("a", IntegerType, true) :: StructField("b", IntegerType, true) :: StructField("c", IntegerType, true) :: StructField("d", IntegerType, true) :: Nil)

and have the input file read as follows:

+---+---+---+----+
|  a|  b|  c|   d|
+---+---+---+----+
|  1|  2|  3|null|
+---+---+---+----+

I believe the problem is that the StructField nullable property isn't enough. It needs a default setting of null as well, which I can't see a way around :(

clockfly · 2016-08-10T00:45:10Z

@tomseddon I have created a PR to solve this problem #155

sambit19 · 2016-08-23T19:41:09Z

Wondering how to make this work in Spark Streaming?
spark.read.format("com.databricks.spark.avro").schema(DataType.fromJson(res6).asInstanceOf[StructType])**.load("/tmp/output").**show()

Is microbatching and writing it to the file system would be the way to go ahead

Pradeepnitw · 2016-11-09T19:06:56Z

Can you please let me know if the fix#96 is included in 3.0.1 build or scheduled for any future release. I tried the test using 3.0.1, but does not seem to be working.

With this PR, we can specify a user-provided custom schema when reading avro files. The custom schema can contain non-existing fields. 1. If the custom schema contains non-existing field and the field is nullable, then we will fill the value as null. The non-existing fields can exists as top level columns, or nested columns. 2. If the custom schema contains non-existing field and the field is NOT nullable, then we will throw an exception. 3. If the custom schema is a subset of the avro file schema, then we will only retrieve the fields defined in custom schema. **Example:** ``` scala> val df = Seq((1,2,3)).toDF("a", "b", "c") scala> df.write.format("com.databricks.spark.avro").save("/tmp/output") scala> import org.apache.spark.sql.types._ // Prepare a customized schema scala> val struct = StructType(StructField("a", IntegerType, true) :: StructField("non_exist", IntegerType, true) :: Nil) scala> spark.read.format("com.databricks.spark.avro").schema(struct).load("/tmp/output").show() +---+---------+ | a|non_exist| +---+---------+ | 1| null| +---+---------+ // Save the schema to a Json string, you can later save this to a file. scala> val jsonSchema = struct.json jsonSchema: String = {"type":"struct","fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}},{"name":"non_exist","type":"integer","nullable":true,"metadata":{}}]} // Load the Json string back scala> spark.read.format("com.databricks.spark.avro").schema(DataType.fromJson(jsonSchema).asInstanceOf[StructType]).load("/tmp/output").show() +---+---------+ | a|non_exist| +---+---------+ | 1| null| +---+---------+ ``` Fix databricks#96 Author: Sean Zhong <seanzhong@databricks.com> Closes databricks#155 from clockfly/support_schema_evolution.

yanxiaole pushed a commit to yanxiaole/spark-avro that referenced this issue Dec 26, 2015

ISSUE databricks#96: Specifying a read schema

2c64248

yanxiaole mentioned this issue Dec 26, 2015

ISSUE #96: Specifying a read schema #109

Closed

yanxiaole pushed a commit to yanxiaole/spark-avro that referenced this issue Jan 18, 2016

ISSUE databricks#96: Specifying a read schema

5a789c4

ogirardot mentioned this issue Jan 24, 2016

Add a new feature to get Schema from explicit Avro class #113

Closed

yanxiaole pushed a commit to yanxiaole/spark-avro that referenced this issue Feb 3, 2016

ISSUE databricks#96: Specifying a read schema

6dd93ec

clockfly mentioned this issue Aug 10, 2016

fix #96, support user-define schema when reading avro files #155

Closed

liancheng closed this as completed in b078cca Aug 23, 2016

JoshRosen added the enhancement label Sep 15, 2016

JoshRosen modified the milestones: 3.0.1, 3.1.0 Sep 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specifying a read schema with spark-avro #96

Specifying a read schema with spark-avro #96

mwho commented Oct 30, 2015

theouteredge commented Nov 3, 2015

jongyonkim commented Nov 20, 2015

cleaton commented Nov 23, 2015

tomseddon commented Jul 5, 2016

clockfly commented Jul 6, 2016 •

edited

Loading

tomseddon commented Jul 12, 2016

tomseddon commented Jul 18, 2016

panda2727 commented Jul 18, 2016 •

edited

Loading

tomseddon commented Jul 27, 2016 •

edited

Loading

clockfly commented Aug 10, 2016

sambit19 commented Aug 23, 2016

Pradeepnitw commented Nov 9, 2016

Specifying a read schema with spark-avro #96

Specifying a read schema with spark-avro #96

Comments

mwho commented Oct 30, 2015

theouteredge commented Nov 3, 2015

jongyonkim commented Nov 20, 2015

cleaton commented Nov 23, 2015

tomseddon commented Jul 5, 2016

clockfly commented Jul 6, 2016 • edited Loading

tomseddon commented Jul 12, 2016

tomseddon commented Jul 18, 2016

panda2727 commented Jul 18, 2016 • edited Loading

tomseddon commented Jul 27, 2016 • edited Loading

clockfly commented Aug 10, 2016

sambit19 commented Aug 23, 2016

Pradeepnitw commented Nov 9, 2016

clockfly commented Jul 6, 2016 •

edited

Loading

panda2727 commented Jul 18, 2016 •

edited

Loading

tomseddon commented Jul 27, 2016 •

edited

Loading