Only validate name, not all components #96

idubinskiy · 2017-11-15T03:07:49Z

Despite the spec stating that each component is a name, the reference Java implementation only validates the actual name field. Match the Java implementation to be more compatible with OCF files written by other languages.

Specifically this has bitten us with a namespace field written by Spark that starts with a leading dot.

Despite the spec stating that each component is a name, the reference Java implementation only validates the actual name field. Match the [Java implementation](https://github.com/apache/avro/blob/branch-1.7/lang/java/avro/src/main/java/org/apache/avro/Schema.java#L467) to be more compatible with OCF files written by other languages.

jung-kim · 2017-11-15T03:34:01Z

FYI, library that writes this dot prefixed namespace name is the https://github.com/databricks/spark-avro.

I'm not sure what is the what is right or wrong because reading the documentation seemingly this library is actually following the schema specification and other have relaxed enforcements of naming convention.

Below is the dataframe schema that caused dot prefixed namespace name along with the avro schema generated by the spark-avro.

root
 |-- usageData: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- app: string (nullable = true)

{
  "type": "record",
  "name": "topLevelRecord",
  "fields": [
    {
      "name": "usageData",
      "type": [
        {
          "type": "array",
          "items": [
            {
              "type": "record",
              "name": "usageData",
              "namespace": ".usageData",
              "fields": [
                {
                  "name": "app",
                  "type": [
                    "string",
                    "null"
                  ]
                }
              ]
            },
            "null"
          ]
        },
        "null"
      ]
    }
  ]
}

karrick · 2018-01-31T18:05:23Z

I'm actually going to pull this in, and then put a feature flag on it.

I think while it's imperative to stick to the schema definition, we would be creating a useless library if it did not properly work with schemas created by other programs.

karrick · 2018-01-31T18:28:49Z

Release v2.1.0 has the exported variable RelaxedNameValidation, which when true, allows the first component of a schema name to be the empty string. The default value for this variable is false, and your program ought to set it before building any codecs when this behavior is desired.

Currently the output namespace is starting with ".", e.g. `.topLevelRecord` Although it is valid according to Avro spec, we should remove the starting dot in case of failures when the output Avro file is read by other lib: linkedin/goavro#96 Unit test Author: Gengliang Wang <gengliang.wang@databricks.com> Closes apache#21974 from gengliangwang/avro_namespace. (cherry picked from commit f45d60a)

jung-kim mentioned this pull request Nov 15, 2017

Namespace name is set when undefined databricks/spark-avro#255

Open

karrick merged commit e381dd7 into linkedin:master Jan 31, 2018

gengliangwang mentioned this pull request Aug 2, 2018

[SPARK-25002][SQL] Avro: revise the output record namespace apache/spark#21974

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only validate name, not all components #96

Only validate name, not all components #96

idubinskiy commented Nov 15, 2017

jung-kim commented Nov 15, 2017

karrick commented Jan 31, 2018

karrick commented Jan 31, 2018 •

edited

Loading

Only validate name, not all components #96

Only validate name, not all components #96

Conversation

idubinskiy commented Nov 15, 2017

jung-kim commented Nov 15, 2017

karrick commented Jan 31, 2018

karrick commented Jan 31, 2018 • edited Loading

karrick commented Jan 31, 2018 •

edited

Loading