Skip to content

v2.0.0

Compare
Choose a tag to compare
@github-actions github-actions released this 17 Jul 08:23
· 10 commits to refs/heads/main since this release
6e1d82a

Introduction of v2

Back in the days, Avro4k has been created in 2019. During 5 years, a lot of work has been done greatly around avro generic records and generating schemas.

Recently, kotlinx-serialization and kotlin did big releases, improving a lot of stuff (features, performances, better APIs). The json API of kotlinx-serialization propose a great API, so we tried to replicate its simplicity.

A big focus has been done to make Avro4k more lenient to simplify devs' life and improve adoption.

I hope this major release will make Avro easier to use, even more in pure kotlin 🚀

As a side note, we may implement our own plugins to generate data classes and schemas, stay tuned !

Highlights and Breaking changes

Party hard

Performances & benchmark

Long story Well... Trying to make a similar benchmark is complicated, as the v2 adds a lot of features and fixes compared to v1.

The following benchmark is not fully representative as it is not comparing all the features.

We will compare an easy use case: encoding and decoding a simple data class with all the primitive types, a String and a list of strings:

@Serializable
data class SimpleDataClass(
    val bool: Boolean,
    val byte: Byte,
    val short: Short,
    val int: Int,
    val long: Long,
    val float: Float,
    val double: Double,
    val string: String,
    val bytes: ByteArray,
)

The benchmark has been executed on a Macbook air M2 in a mono-threaded environment.

Avro4k v2 (binary) is MUCH faster than v1 (generic records), and also now more performant than jackson and the standard apache avro (using reflection). Not tested for the moment with SpecificRecord.

Encoding Performance

Version Encoding (ops/s) Relative Difference (%)
Avro4k v1 (generic records) 109 327 0%
Jackson 134 774 +23%
Avro4k v2 (generic records) 190 365 +74%
Apache avro ReflectData (direct binary) 332 438 +204%
Avro4k v2 (direct binary) 459 751 +321% 🚀

Decoding Performance

Version Decoding (ops/s) Relative Difference (%)
Avro4k v1 (generic records) 67 825 0%
Jackson 71 146 +5%
Avro4k v2 (generic records) 114 511 +69%
Apache avro ReflectData (direct binary) 151 287 +123%
Avro4k v2 (direct binary) 174 063 +157% 🚀

Migration guide

As there is a lot of changed APIs, classes, packages, and more, here is the migration guide. Don't hesitate to file an issue if something is missing!

Needs Kotlin 2.0.0 and kotlinx.serialization 1.7.0

You need at least Kotlin 2.0.0 and kotlinx.serialization 1.7.0 to use Avro4k v2.0.0+ (version matrix is indicated in the README) as there is breaking changes in kotlinx-serialization plugin and library (released in tandem with kotlin version).

More information here: kotlinx-serialization v1.7.0

ExperimentalSerializationApi

Since the API deeply changed, all the new functions, properties, classes, annotations that are annotated with ExperimentalSerializationApi will show you a warn as they could change at any moment. Those annotated members will be un-annotated after a few releases if they proved their stability 🪨

You can experience a lot of ExperimentalSerializationApi warnings, as everything has been reworked. The common APIs may be stable more quickly, so they could be un-annotated in the next minor release. For the more complex or less used APIs, they could be un-annotated later.

To suppress this warning, you may opt-in the experimental serialization API. It is advised to not opt-in globally in the compiler arguments to avoid surprises when using experimental stuff 😅

Warning

Any API removal with ExperimentalSerializationApi won't be considered as a breaking change regarding the semver standard, so given a version A.B.C, only the minor B number will be incremented, not the major A.

Direct binary serialization

Before, serializing avro using Avro4k was done through a generic step, that converted first the data classes to generic maps, and then pass this generic data to the apache avro library.

Now, encoding to and decoding from binary is done directly, that improved a lot the performances (see Performances & benchmark section).

Note

We are still supporting the generic data serialization as long as there is a solution for kafka schema registry serialization (future avro4k module to be created), but it may be removed in the future to simplify the avro4k library as it is not really a serialization but more a conversion.

Support anything to encode and decode at root level

Before, we were only able to encode and decode GenericRecord. No primitive, no arrays, no value class, just generic records.

Now, no need to wrap your value in a record, you can serialize nearly everything and generate the corresponding schema!

This includes any data class, enum, sealed interface or class, value class, primitive values or contextual serializers 🚀

Totally new API

The previous API needed to well understand how to use it, especially when playing with InputStream and OutputStream.

There is now different entrypoints for different purposes:

  • Avro: the main entrypoint to generate schemas, encode and decode in the avro format. This is the pure raw avro format without anything else around it.
  • AvroObjectContainer: the entrypoint to encode avro data files, following the official spec, and using Avro for each value serialization.
  • AvroSingleObject: the entrypoint for encoding a single object prefixed with the schema fingerprint, following the official spec, and also using Avro for value serialization.

Warning

Avro.encodeToByteArray is now encoding in pure binary avro. If you still need to encode in the object container format as the v1 (in the DATA format), you have to use AvroObjectContainer

Implicit nulls by default

Previously, when a nullable field was missing from the writer schema while decoding, then a failure happened.

Now, it decodes null and is not failing for all the nullable fields. To opt-out this feature, configure your Avro instance with implicitNulls = false.

It has been enabled by default to simplify the use of Avro4k and make it more lenient for a better adoption.

Implicit empty maps, collections and arrays by default

Previously, when a map or collection-like field was missing from the writer schema while decoding, then a failure happened.

Now, it decodes an empty collection and is not failing (an empty map, list, array or set depending on the field type). To opt-out this feature, configure your Avro instance with implicitEmptyCollections = false.

It has been enabled by default to simplify the use of Avro4k and make it more lenient for a better adoption.

Lenient

The apache avro library is strict regarding the types and strongly follow the avro spec. As an example, a float in kotlin can be written as a float, while being decoded as a float and a double.

Avro4k is pushing the lenience where a float can be written and read as a float, a double, a string, an int and a long in avro.

A type matrix has been written inside README.

No more reflection

Thanks to this little change,

Absolutely no more reflection, so that allows you to use android or GraalVM AOT native compilation (not tested, but should work, let us know!).

Unified & cleaned annotations

  • AvroJsonProp has been merged toAvroProp: the json content is automatically detected, so any non-json content is handled as a string
  • AvroAliases has been merged toAvroAlias: there is now a varags to pass as many aliases as you want using the same annotation
  • AvroInline has been removed in favor of kotlin native value class
  • AvroEnumDefault is now to be applied directly on the default enum member
  • ScalePrecision has been renamed to AvroDecimal to keep and unify to a common prefix. Also, the decimal's scale and precision do not have defaults anymore
  • AvroNamespace and AvroName has been replaced by the native kotlinx-serialization SerialName annotation
  • AvroStringable has been added to easily for a field type to be inferred as a string (this is working for all the primitive types and the built-in logical types)
  • AvroFixed is now only applying on compatible types (ByteArray, String, decimal logical type), annotating other types will just do nothing

Only ByteArray is now handled as BYTES

Previously, all the collections-like of bytes were handled as BYTES.

Now, only ByteArray is handled as BYTES, and the other collections-like of bytes are handled as arrays of INT. If you still want to encode a BYTES type, just use the ByteArray type or write your own AvroSerializer to control the schema and its serialization.

Improved custom serializers API

The custom serializer AvroSerializer API has been improved to enforce the custom encodings to provide their own schema (where it was before optional and covering only a sub-part of the use cases).

It also provides two additional methods serializeGeneric and deserializeGeneric to allow the custom serializer being used by other non-avro formats, that way we can now use the same classes and serializers for both avro and json formats 🚀

To finish, AvroEncoder now provides encodeResolving and AvroDecoder provides decodeResolving to delegate the possible union resolution and focus the custom serialization to the main types.
It has been included publicly as it is heavily used internally, and it provides a clean and performant way to handle unions thanks to inlined functions. Note that it's still experimental and could change in the future.

So for any custom serialization, schema, or logical type, you must implement your own AvroSerializer.

Caching

All schemas are cached using WeakIdentityHashMap to allow the GC to remove the cache entries in case of low available memory.

Also, some other internal expensive parts are cached for quicker encoding and decoding.

Normally we could use a WeakHashMap but we cannot rely on the equals/hashCode as different classes could have the same serial descriptor.
It should not happen, but let's be safe first and then iterate on it if needed 🛡

New logical type: duration

Following the new avro specs, the logical type duration has been added to the built-in logical types. It have been implemented for the following types:

  • kotlin.time.Duration (do not annotate it with @Serializable or @Contextual as it is a native kotlinx-serialization type)
  • A new avro4k class AvroDuration (same)
  • java.time.Duration (this time it needs to be annotated with @Serializable(with = JavaDurationSerializer::class) or @Contextual)
  • java.time.Period (this time it needs to be annotated with @Serializable(with = JavaPeriodSerializer::class) or @Contextual)

Better documentation

Last but not least, all the documentation has been reworked from scratch to fit all that new stuff! 📚

What's Changed

Full Changelog: v1.10.1...v2.0.0