Schema Registry Support #1010

lovromazgon · 2023-04-25T17:01:18Z

lovromazgon
Apr 25, 2023
Maintainer

Introduction

This document describes the design for adding Schema Registry support to Conduit.

Background

A Schema Registry is a tool that helps Conduit understand the format of the data it receives. Often, data is sent in binary formats that can make it difficult to interpret without knowing the schema beforehand. By adding support for a Schema Registry, Conduit can understand how to parse the binary data and turn it into structured data that can be manipulated by processors.

At the same time, uploading the schema of a structured payload to a registry can also be helpful. This allows Conduit to take structured data and compress it into a binary format for transmission. By doing so, Conduit can optimize the amount of data being sent, making it faster and more efficient.

Goals

This design should allow Conduit to do the following:

Fetch a schema from a schema registry (e.g. Apicurio, Confluent schema registry).
Use that schema to decode raw data into structured data (e.g. Avro, Protobuf).
Encode structured data into raw data using a binary format + schema.
Upload the schema to a schema registry.

Note: outputting a binary format in connectors is not in the scope of this design. Here we only target decoding and encoding the key and payload fields of an OpenCDC record, but not the OpenCDC record itself.

Implementation options

Option 1 - 2 processors

Implement 2 separate processors.

The first processor knows how to fetch a schema from a schema registry and decode raw data using the schema. It needs to be able to process fields Record.Key, Record.Paylod.Before and/or Record.Payload.After. If the schema is not found the processor fails.

The second processor is able to encode structured data into a binary format (e.g. Avro, Protobuf) and produce a schema that can be used to decode it back to structured data. It can also upload the schema to a schema registry. If the schema can't be uploaded to the schema registry the processor fails.

By default, these processors use a predefined metadata field as the schema ID (e.g. conduit.key.schemaId, conduit.payload.before.schemaId or conduit.payload.after.schemaId). The user is able to override this behavior and configure a custom schema ID using static data and/or data taken from the record (e.g. metadata).

When other processors manipulate the record we do not need to track those changes and update the schema, because the schema is extracted on demand by the processor.

Each processor needs a schema registry URL in its configuration. To make this simpler we can later add the option to configure a pipeline schema registry URL and/or a global schema registry URL. The processor would then pick the first URL it can find in the hierarchy (processor > pipeline > global).

Pros

Connectors don't have to work with schemas.
The Record type does not need to change.
The record can be decoded/encoded at any point in the pipeline, the user can put the processor anywhere.
We do not have to track changes done to a record by processors.

Cons

User needs to understand and configure 2 processors.
A schema can't be extracted from structured data in a lossless way.
For example, consider an Avro schema where a field can contain multiple types and has a default value:
```
{
  "name": "example",
  "type": ["string", "int"],
  "default": "foo"
},
```
It's impossible to extract this schema from a concrete value without knowing the original schema.

Option 2 - Attach Schema to Record

This option builds on top of option 1 and tries to address the last con, lossless schema handling.

Conduit provides processors to decode and encode data using a schema, but additionally attaches the schema to the record. This means Conduit needs to somehow track changes done to the key and payload of a record and update the schema accordingly. Every time a processor changes a field (i.e. creates, updates, deletes it) Conduit needs to detect it and update the schema attached to the record.

Pros

The schema produced when encoding a record will closely match the schema used for decoding it. User expectations are met.

Cons

Complicated implementation (NB: we have some code that we could build upon).
Potential loss of performance. Every time a processor returns a record we need to compare it to the previous record and look for changes to update the schema.
Adding support for a new format will be harder, as we need to implement translations between that format and our internal schema.

Questions

There has to be an external schema registry running for a user to use this feature. Should we make it easier by including a schema registry in Conduit?
- ?

Recommendation

We propose to start with option 1 and add option 2 if/when we see that there's a real need to produce the same schema as we get by the schema registry.

simonl2002 · 2023-04-28T15:17:20Z

simonl2002
Apr 28, 2023
Maintainer

User needs to understand and configure 2 processors.

This honestly doesn't feel like much of a con to me. I'd say having this handling individually configurable and usable adds a degree of flexibility.

I agree with the recommendation of doing Option 1 as the first pass.

1 reply

maha-hajja May 2, 2023
Maintainer

+1

hariso · 2023-05-02T11:37:25Z

hariso
May 2, 2023
Maintainer

IIUC, the idea is to have a single processor, which would be configured with one of the supported schema registries (versus one processor per supported schema registry)?

1 reply

lovromazgon May 2, 2023
Maintainer Author

Well, two processors (encode + decode), but yeah, essentially the processors should be configurable and support different schema registries (confluent schema registry only at first) and different formats (we start with Avro, we continue with JSON Schema and Protobuf). I'm working on the actual design of the processors and will post them soon (i.e. their names, behavior, configuration parameters).

lovromazgon · 2023-05-18T17:09:29Z

lovromazgon
May 18, 2023
Maintainer Author

There are a couple of decisions we need to make regarding the implementation:

Choose a schema registry client
Choose a serialization/deserialization (serde) layer
Choose an avro library

The first two are kind of related since most libraries provide both, although in some cases they are decoupled so we could choose to take only one piece out of a library.

I'll break down each part into the available options and at the end explain my recommendation.

Schema Registry Client

In this part we just need a client that makes it easy to interact with a schema registry. What we need from the client:

Support authentication (basic auth, SSL)
Support encryption (SSL)
Support creating and fetching a schema
Support caching (nice to have, we can cache in the serde layer)

There are a couple of options:

github.com/confluentinc/confluent-kafka-go/schemaregistry. This library provides a schema registry client that is tested against a real instance of the Confluent schema registry. The client caches schemas and provides a mock (actually a fake) that can be used in tests. The configuration is needlessly complex and it's apparent that the code was "translated" from the Java library (not really Go idiomatic), so it's a bit awkward to use. It provides a serde layer as well which supports auto-registering schemas, although that's unusable for us, as it expects to get a specific Go struct (we have a map or a byte slice). The serde layer makes assumptions that it's used in the context of Kafka (e.g. the usage of topic) and it assumes that the user knows the type it wants to encode/decode in advance (i.e. no support for dynamic decoding). Another drawback is that the module size is significant while we really only need 1 package. Even though the library requires CGo we wouldn't need it since the schema registry package doesn't use CGo. This doesn't mean they won't introduce a dependency in the future that would suddenly require CGo (the probability for this is low, but higher than in other libs).
github.com/twmb/franz-go/pkg/sr. This library provides a simple client that does not provide caching, but it also provides a thin serde layer that caches schemas. It is written in idiomatic Go and does not make assumptions about the context in which it's used. It is much more minimal and does not implement any specific format for (de)serialization (e.g. avro, protobuf or JSON schema), so this part would have to be written by us. This also means that we have complete freedom to use any library for (de)serializing these formats. It does not support auto-registering schemas, this part would have to be done by us.
github.com/hamba/avro/registry. This is an Avro library that also provides a simple registry client. However, since it's an Avro library, the schema registry client assumes that the schema is always an Avro schema, which makes it unsuitable for us.
Write our own. Writing a client is mostly just boilerplate code and we could write it ourselves, the API is relatively simple. It would be like reinventing the wheel though, especially if we can find a client that already fits our needs.

Serialization/deserialization (serde) layer

Here we need a layer that allows us to serialize/deserialize payloads. Requirements for this layer:

Be agnostic about the format used (e.g. Avro, Protobuf, JSON Schema)
Support dynamic deserialization (no Go struct for target)
Support caching (in case the client doesn't cache already)
Support auto-registering schemas

Here we have similar options to before:

github.com/confluentinc/confluent-kafka-go/schemaregistry/serde. This is not only a serde layer but also includes implementations for Avro, Protobuf and JSON Schema. The issue is the assumptions the code makes - it assumes a Go struct is available for decoding the values into (we don't have it), it assumes there is a "topic" (it's running in the context of Kafka), since it provides implementations for each format it already chooses the libraries for each format. The layer is also coupled to the schema registry client from the same module, so using it by itself isn't possible without changes. It also contains the same concerns about CGo as mentioned above.
github.com/twmb/franz-go/pkg/sr. This library provides a thing serde layer alongside the client, which is not coupled to the client. Consequently, this means it does not auto-register schemas, so this part would need to be written by us. It does not provide any implementations for a specific format but makes it easy to plug in the implementation and choose any library we want (e.g. for JSON it's 2 lines).
Write our own. Our own serde layer could make sense if we don't find any existing implementation that fits our needs.

Avro library

Given that we are always working with dynamic data and have no specific Go struct we will need to write the logic to extract a schema from a payload (raw or structured). In other words, we can't use a function like github.com/heetch/avro.TypeOf, because it takes the type info from the type and a map is encoded as a map (duh!), while we treat it as a record.

The requirements for the Avro library:

Support decoding without knowing the target type (i.e. into map[string]any)
Support building a schema programmatically

Options:

github.com/linkedin/goavro/v2. This is the most used Go library for Avro. It supports encoding/decoding dynamic data, but lacks the feature to build a schema programmatically. According to my benchmarks it decodes 30% faster and decodes 70%(!) faster than Hamba.
github.com/hamba/avro/v2. This library provides support for encoding/decoding dynamic data and provides a way to programmatically build a schema. In the readme it claims to be faster than the linkedin library, but a closer look reveals that for our use case (dynamic data) linkedin/goavro is much faster (see my benchmarks).
github.com/heetch/avro focuses on providing support for (de)serializing Avro data from/into Go structs. Since we are working with dynamic data this library is not an option.

Recommendation

I recommend choosing the following libraries:

github.com/twmb/franz-go/pkg/sr. We take the schema registry client and serde layer from this library and add auto-registering of schemas. It already provides most of the functionality that we are looking for, we can extend it to fit our needs.
github.com/hamba/avro/v2. The deciding factor for this library over linkedin/goavro is the fact that we can use it to build a schema programmatically. I propose we also use it for decoding/encoding and measure how this impacts a Conduit pipeline. If the impact is too big we can optimize by switching the encoding/decoding part to linkedin/goavro.

1 reply

hariso May 25, 2023
Maintainer

This recommendation makes sense to me. 👍

lovromazgon · 2023-05-22T14:57:45Z

lovromazgon
May 22, 2023
Maintainer Author

Regarding the processors themselves, I propose to have 4 processors, one pair for decoding and encoding the payload, another for the key. Keep in mind that I'm choosing the name of the processors based on the naming we are currently using for processors (all lower case letters without separators), this will possibly be reworked in #997.

decodewithschemapayload - Takes the raw data in .Payload.After and decodes it using a schema from the schema registry. It requires the payload to contain raw data, otherwise it returns an error. After processing, the payload will contain structured data. If the decoding fails, the processor returns an error. The processor will extract the schema ID from the first 5 bytes of the raw data, following the wire format from Confluent. It provides the following configuration options:
- url - URL to the Confluent schema registry (or an API that's compatible with it like Apicurio), required field.
- Basic authentication
  - auth.basic.username - The username used for basic authentication
  - auth.basic.password - The password used for basic authentication
- TLS
  - tls.ca.cert - Public certificate of the schema registry in PEM format
  - tls.client.cert - Public certificate of the client in PEM format
  - tls.client.key - Private key of the client in PEM format
encodewithschemapayload - Takes structured data in .Payload.After and encodes it into raw data using a schema registered in the schema registry. It requires the payload to contain structured data, otherwise it returns an error. After processing, the payload will contain the encoded raw data. If the encoding fails, the processor returns an error. The processor will produce the Confluent wire format, meaning that the raw data will be prepended with the magic byte and the schema ID. The user can configure if the schema should be extracted from the payload and auto-registered, or they can provide a schema ID for a specific schema that needs to be used. Here is the list of configuration options:
- url - URL to the Confluent schema registry (or an API that's compatible with it like Apicurio), required field.
- Basic authentication
  - auth.basic.username - The username used for basic authentication
  - auth.basic.password - The password used for basic authentication
- TLS
  - tls.ca.cert - Public certificate of the schema registry in PEM format
  - tls.client.cert - Public certificate of the client in PEM format
  - tls.client.key - Private key of the client in PEM format
- schema.strategy - Defines the strategy for determining the schema for the record. Can be either registry or autoRegister¹. This field is required.
  - Schema strategy registry means the schema is downloaded from the schema registry. It will try to encode the data using the schema and return an error if the operation fails. This strategy needs additional settings:
    - schema.registry.subject - Is a Go Template describing how to choose the subject in case registry download is enabled. The template can use any field from the OpenCDC record and needs to produce a non-empty string, otherwise the processor returns an error.
    - schema.registry.version - Is a Go Template describing how to choose the version in case registry download enabled. The value latest is special and means the latest version will be used.
  - Schema strategy autoRegister¹ means the schema will be extracted from the record and automatically registered in the schema registry. This strategy needs additional settings:
    - schema.autoRegister.subject - Is a Go Template describing how to choose the subject in case auto-register is enabled.
    - schema.autoRegister.format - This defines which format to use for encoding the payload in case the schema should be extracted and registered.
decodewithschemakey - Same as decodewithschemapayload, but operates on field .Key.
encodewithschemakey - Same as encodewithschemapayload, but operates on field .Key.

I was thinking that autoRegister could also be extract or reflect, but went with autoRegister to make it abundantly clear that the extracted schema will be registered in the schema registry. It could also be just register, but then it would be too similar to the other option registry. ↩ ↩²

3 replies

maha-hajja May 23, 2023
Maintainer

can we have 2 processors instead of 4 by combining key and payload? like we already do in conduit? so we'd have decodewithschema and encodewithschema processors, with params to indicate whether the key should be decoded or the payload. Or is it simpler to have them separated?

lovromazgon May 24, 2023
Maintainer Author

Which processor do you mean? As far as I can see, all processors except js, httprequest, unwrap and valuetokey have the postfix payload or key to specify which field they are processing.

I'm not opposed to collapsing them into 2 processors, in fact, I'd encourage us to think about this in scope of #997. But now I'd rather follow the current convention and use the other issue to think through how we can make it better for all processors, not just this one.

simonl2002 May 24, 2023
Maintainer

I agree that we should maintain consistency for now then come up with a more wholistic solution when we tackle "better processors" which should be happening fairly soon.

raulb · 2024-05-21T13:56:07Z

raulb
May 21, 2024
Maintainer

Adding the reference of this other design document as it's being finalized #1532

0 replies

lovromazgon · 2024-05-21T16:00:03Z

lovromazgon
May 21, 2024
Maintainer Author

Done in #984.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema Registry Support #1010

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Schema Registry Support #1010

lovromazgon Apr 25, 2023 Maintainer

Introduction

Background

Goals

Implementation options

Option 1 - 2 processors

Pros

Cons

Option 2 - Attach Schema to Record

Pros

Cons

Questions

Recommendation

Replies: 6 comments · 6 replies

simonl2002 Apr 28, 2023 Maintainer

maha-hajja May 2, 2023 Maintainer

hariso May 2, 2023 Maintainer

lovromazgon May 2, 2023 Maintainer Author

lovromazgon May 18, 2023 Maintainer Author

Schema Registry Client

Serialization/deserialization (serde) layer

Avro library

Recommendation

hariso May 25, 2023 Maintainer

lovromazgon May 22, 2023 Maintainer Author

Footnotes

maha-hajja May 23, 2023 Maintainer

lovromazgon May 24, 2023 Maintainer Author

simonl2002 May 24, 2023 Maintainer

raulb May 21, 2024 Maintainer

lovromazgon May 21, 2024 Maintainer Author

lovromazgon
Apr 25, 2023
Maintainer

Replies: 6 comments 6 replies

simonl2002
Apr 28, 2023
Maintainer

maha-hajja May 2, 2023
Maintainer

hariso
May 2, 2023
Maintainer

lovromazgon May 2, 2023
Maintainer Author

lovromazgon
May 18, 2023
Maintainer Author

hariso May 25, 2023
Maintainer

lovromazgon
May 22, 2023
Maintainer Author

maha-hajja May 23, 2023
Maintainer

lovromazgon May 24, 2023
Maintainer Author

simonl2002 May 24, 2023
Maintainer

raulb
May 21, 2024
Maintainer

lovromazgon
May 21, 2024
Maintainer Author