-
Notifications
You must be signed in to change notification settings - Fork 310
Specifying a read schema with spark-avro #96
Comments
This will be a really good idea. |
I'm waiting for this feature as well. Otherwise there is no way of ingesting many avro files (each with their own schemas) with an up-to-date master schema. |
i'm also waiting for this feature as it seems spark is very slow in generating dataframe schema when a large number of sequence files are selected. It also seems to create thousands of broadcast variables and using a lot of memory on the driver node. |
Hi, is there any progress on this feature? The only reason I'm not using spark-avro (using hadoopRDD instead) is because I need support for schemas that can evolve. I'm sometimes re-processing historical data which may not have some attributes which have been recently added, but the whole lot would be processed in one go ideally. |
In Spark 2.0, you can specify a schema using
|
That's great, thanks for the reply. |
Sorry, another question on this @clockfly. Shouldn't I be able to specify an avsc file? I tried this and I'm getting scala.MatchError. I've already had to specify my input schema in case classes for conversion to dataset, and re-specifying the exact same thing in StructType seems unweildy. The most convenient scenario would be to just plug in my latest avsc generated from avro-tools. |
Thanks @clockfly. The parameter used in spark shell: --packages com.databricks:spark-avro_2.10:3.0.0-preview I get an error: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.avro. Please use Spark package http://spark-packages.org/package/databricks/spark-avro Thanks. |
I've been playing around with this. Should it not be able to support missing fields if they are nullable? I can't really use this feature unless I can specify an all encompassing master schema which will work for all versions, including earlier files that have new fields missing. In @clockfly's example above, I should be able to specify a schema like this
and have the input file read as follows:
I believe the problem is that the StructField nullable property isn't enough. It needs a default setting of null as well, which I can't see a way around :( |
@tomseddon I have created a PR to solve this problem #155 |
Wondering how to make this work in Spark Streaming? Is microbatching and writing it to the file system would be the way to go ahead |
Can you please let me know if the fix#96 is included in 3.0.1 build or scheduled for any future release. I tried the test using 3.0.1, but does not seem to be working. |
With this PR, we can specify a user-provided custom schema when reading avro files. The custom schema can contain non-existing fields. 1. If the custom schema contains non-existing field and the field is nullable, then we will fill the value as null. The non-existing fields can exists as top level columns, or nested columns. 2. If the custom schema contains non-existing field and the field is NOT nullable, then we will throw an exception. 3. If the custom schema is a subset of the avro file schema, then we will only retrieve the fields defined in custom schema. **Example:** ``` scala> val df = Seq((1,2,3)).toDF("a", "b", "c") scala> df.write.format("com.databricks.spark.avro").save("/tmp/output") scala> import org.apache.spark.sql.types._ // Prepare a customized schema scala> val struct = StructType(StructField("a", IntegerType, true) :: StructField("non_exist", IntegerType, true) :: Nil) scala> spark.read.format("com.databricks.spark.avro").schema(struct).load("/tmp/output").show() +---+---------+ | a|non_exist| +---+---------+ | 1| null| +---+---------+ // Save the schema to a Json string, you can later save this to a file. scala> val jsonSchema = struct.json jsonSchema: String = {"type":"struct","fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}},{"name":"non_exist","type":"integer","nullable":true,"metadata":{}}]} // Load the Json string back scala> spark.read.format("com.databricks.spark.avro").schema(DataType.fromJson(jsonSchema).asInstanceOf[StructType]).load("/tmp/output").show() +---+---------+ | a|non_exist| +---+---------+ | 1| null| +---+---------+ ``` Fix databricks#96 Author: Sean Zhong <seanzhong@databricks.com> Closes databricks#155 from clockfly/support_schema_evolution.
It would be nice to have an option to supply a read schema (in lieu of the embedded schema) when reading avro files via spark-avro.
For example, the Python Avro API allows the following:
reader = DataFileReader(data, DatumReader(readers_schema=schema))
The scenario is this: I have many .avro files, possibly with different schemas (due to schema evolution), and I would like to use a single "master" schema to ingest all of those avro files into a single Spark Dataframe.
The text was updated successfully, but these errors were encountered: