Namespace name is set when undefined #255

jung-kim · 2017-11-15T19:29:23Z

Since 4.0.0, within the nested structure we are seeing that namespace is defined despite us never explicitly setting them. Is this defined in spec?

If we set Map("recordName" -> "usageData", "recordNamespace" -> "abc") than namespace becomes "abc.usageData".

root
 |-- usageData: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- app: string (nullable = true)

{
  "type": "record",
  "name": "topLevelRecord",
  "fields": [
    {
      "name": "usageData",
      "type": [
        {
          "type": "array",
          "items": [
            {
              "type": "record",
              "name": "usageData",
              "namespace": ".usageData",  <<---- ??????
              "fields": [
                {
                  "name": "app",
                  "type": [
                    "string",
                    "null"
                  ]
                }
              ]
            },
            "null"
          ]
        },
        "null"
      ]
    }
  ]
}

The text was updated successfully, but these errors were encountered:

gengliangwang · 2017-11-19T17:13:49Z

Hi, the behavior you mentioned is from this commit:
25cab2a

From avro spec:

A namespace is a dot-separated sequence of such names. The empty string may also be used as a namespace to indicate the null namespace.

So namespace with leading dot should be OK. Can you specify where the problem is?

jung-kim · 2018-01-08T21:31:27Z

Problem is that when namespace is null namespace value should be empty or null.

when namespace is not specified, we are expecting namespace to be "" or null but right now it's ".usageData" in above case

matthew-fishkin · 2018-01-09T18:06:07Z

I am actually having the same issue as above. And it is coming back to bite us because we are trying to load the data into a Big Query table.

We don't set the namespace, and one is automatically generated that begins with a dot. We then get the following error:

The Apache Avro library failed to parse the header with the follwing error: Invalid namespace: .topic_scores

tovbinm · 2018-07-03T05:29:23Z

After this change was merged: loading a dataset, then saving and loading it again with the same schema, since every nested record is prefixed with invalid namespace (example below).
I think this change has to be reverted since it fixes one thing, but breaks a ton of others.

val schema = Schema.Parser().parse("""{
  "type": "record",
  "name": "TestRecord",
  "namespace": "a.b.c",
  "fields": [
    {
      "name": "key",
      "type": [
        {
          "type": "record",
          "name": "key",
          "fields": [
            {
              "name": "email",
              "type": [ "string", "null"]
            }
          ]
        },
        "null"
      ]
    }
  ]
}"""

val df = sql.read.format("com.databricks.spark.avro")
   .option("avroSchema", schema)
   .load("/tmp/random.avro")

df.show(false) // so far so good

df.write.format("com.databricks.spark.avro")
  .option("recordName", "TestRecord")
  .option("recordNamespace", "a.b.c")
  .option("avroSchema", schema)
  .save("/tmp/random.out")

val loaded = sql.read.format("com.databricks.spark.avro")
  .option("recordName", "TestRecord")
  .option("recordNamespace", "a.b.c")
  .option("avroSchema", schema)
  .load("/tmp/random.out")

loaded.show(false) // Failure! AvroTypeException is thrown

Caused by: org.apache.avro.AvroTypeException: Found a.b.c.key.key, expecting union
  at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
  at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
  at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
  at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
  at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
  at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
  at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
  at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
  at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
  at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
  at com.databricks.spark.avro.DefaultSource$$anonfun$buildReader$1$$anon$1.next(DefaultSource.scala:228)
  at com.databricks.spark.avro.DefaultSource$$anonfun$buildReader$1$$anon$1.next(DefaultSource.scala:205)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:108)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:108)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)

tovbinm · 2018-07-03T20:04:38Z

Even worse - each time you save/load/save/ your dataset it prepends a field name into the namespace.

tovbinm · 2018-07-03T20:05:55Z

I made a patched release by undoing the #249 PR - https://jitpack.io/#relateiq/spark-avro

gengliangwang · 2018-08-03T07:04:10Z

Fix it here: apache/spark#21974
Spark 2.4 release will have built in Avro package.

tovbinm · 2018-08-03T16:19:58Z

@gengliangwang try adding a test case I suggested above ^

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Namespace name is set when undefined #255

Namespace name is set when undefined #255

jung-kim commented Nov 15, 2017

gengliangwang commented Nov 19, 2017 •

edited

Loading

jung-kim commented Jan 8, 2018

matthew-fishkin commented Jan 9, 2018

tovbinm commented Jul 3, 2018

tovbinm commented Jul 3, 2018

tovbinm commented Jul 3, 2018

gengliangwang commented Aug 3, 2018

tovbinm commented Aug 3, 2018

Namespace name is set when undefined #255

Namespace name is set when undefined #255

Comments

jung-kim commented Nov 15, 2017

gengliangwang commented Nov 19, 2017 • edited Loading

jung-kim commented Jan 8, 2018

matthew-fishkin commented Jan 9, 2018

tovbinm commented Jul 3, 2018

tovbinm commented Jul 3, 2018

tovbinm commented Jul 3, 2018

gengliangwang commented Aug 3, 2018

tovbinm commented Aug 3, 2018

gengliangwang commented Nov 19, 2017 •

edited

Loading