Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only validate name, not all components #96

Merged
merged 1 commit into from
Jan 31, 2018

Conversation

idubinskiy
Copy link
Contributor

Despite the spec stating that each component is a name, the reference Java implementation only validates the actual name field. Match the Java implementation to be more compatible with OCF files written by other languages.

Specifically this has bitten us with a namespace field written by Spark that starts with a leading dot.

Despite the spec stating that each component is a name, the reference Java implementation only validates the actual name field. Match the [Java implementation](https://github.com/apache/avro/blob/branch-1.7/lang/java/avro/src/main/java/org/apache/avro/Schema.java#L467) to be more compatible with OCF files written by other languages.
@jung-kim
Copy link

FYI, library that writes this dot prefixed namespace name is the https://github.com/databricks/spark-avro.

I'm not sure what is the what is right or wrong because reading the documentation seemingly this library is actually following the schema specification and other have relaxed enforcements of naming convention.

Below is the dataframe schema that caused dot prefixed namespace name along with the avro schema generated by the spark-avro.

root
 |-- usageData: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- app: string (nullable = true)
{
  "type": "record",
  "name": "topLevelRecord",
  "fields": [
    {
      "name": "usageData",
      "type": [
        {
          "type": "array",
          "items": [
            {
              "type": "record",
              "name": "usageData",
              "namespace": ".usageData",
              "fields": [
                {
                  "name": "app",
                  "type": [
                    "string",
                    "null"
                  ]
                }
              ]
            },
            "null"
          ]
        },
        "null"
      ]
    }
  ]
}

@karrick
Copy link
Contributor

karrick commented Jan 31, 2018

I'm actually going to pull this in, and then put a feature flag on it.

I think while it's imperative to stick to the schema definition, we would be creating a useless library if it did not properly work with schemas created by other programs.

@karrick karrick merged commit e381dd7 into linkedin:master Jan 31, 2018
@karrick
Copy link
Contributor

karrick commented Jan 31, 2018

Release v2.1.0 has the exported variable RelaxedNameValidation, which when true, allows the first component of a schema name to be the empty string. The default value for this variable is false, and your program ought to set it before building any codecs when this behavior is desired.

otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
Currently the output namespace is starting with ".", e.g. `.topLevelRecord`

Although it is valid according to Avro spec, we should remove the starting dot in case of failures when the output Avro file is read by other lib:

linkedin/goavro#96

Unit test

Author: Gengliang Wang <gengliang.wang@databricks.com>

Closes apache#21974 from gengliangwang/avro_namespace.

(cherry picked from commit f45d60a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants