Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE-3187] schema registry - json schema: support allOf oneOf anyOf #21354

Merged

Conversation

andijcr
Copy link
Contributor

@andijcr andijcr commented Jul 12, 2024

Add preliminary support to perform compatibility checks on allOf oneOf anyOf combinators

this first support requires that

  • both schemas use the same combinator
  • the schema arrays have compatible sizes
  • that each subschema from the smaller set has a distinct compatible pattern in the other set

These requirements are stricter than necessary and will be relaxed in a future commit.

Additional work in this PR:

  • relax multipleOf requirement: to support is_superset({"type": "number", "multipleOf": 1.1}, {"type": "number", "multipleOf": 3.3}), the remainder of 3.3/1.1 is compared against 3*ulp(newer)
  • adds support for true to alias {} and false to alias {"not":{}} (from draft6 and beyond)

Fixes https://redpandadata.atlassian.net/browse/CORE-432

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

  • none

Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool

src/v/pandaproxy/schema_registry/json.cc Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/json.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/schema_registry/test/test_json_schema.cc Outdated Show resolved Hide resolved
Comment on lines 1801 to 1255
// size are compatible, now we need to check that every schema from
// the smaller schema array has a unique compatible schema.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "smaller schema array" correct in this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

for allOf, newer can have more subschemas than older (each new subschema is more rules)
for oneOf/anyOf, newer can have less subschemas than older (any subschema removed reduces the degrees of freedom)
the algo tries to do a match and we are concerned to ensure that the smaller subschema array is covered

@andijcr
Copy link
Contributor Author

andijcr commented Jul 24, 2024

/dt

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jul 24, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/51953#0190e3da-bbed-46bf-b8b8-fbb4e2326391:

"rptest.tests.test_si_cache_space_leak.ShadowIndexingCacheSpaceLeakTest.test_si_cache.message_size=10000.num_messages=100000.concurrency=2"

new failures in https://buildkite.com/redpanda/redpanda/builds/51979#0190e58b-183e-4acd-94ab-f2f96a4f3122:

"rptest.tests.availability_test.AvailabilityTests.test_recovery_after_catastrophic_failure"

new failures in https://buildkite.com/redpanda/redpanda/builds/51979#0190e5a4-8a7d-4b2a-b9c6-9591869ea282:

"rptest.tests.availability_test.AvailabilityTests.test_recovery_after_catastrophic_failure"

@andijcr andijcr force-pushed the feat/core-3187/schema_registry_unions branch from d6cdf66 to bb90bd6 Compare July 24, 2024 15:07
@andijcr andijcr requested review from BenPope and pgellert July 24, 2024 15:45
pgellert
pgellert previously approved these changes Jul 25, 2024
Copy link
Contributor

@pgellert pgellert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@andijcr andijcr added this to the v24.2.1-final-RC milestone Jul 25, 2024
@andijcr
Copy link
Contributor Author

andijcr commented Jul 25, 2024

issues are known: #21626
#15479
#16208

BenPope
BenPope previously approved these changes Jul 25, 2024
andijcr added 9 commits July 25, 2024 11:22
a valid schema should be compatible with itself, check this in
test_compatibility_check
…ions

allow a test case to throw an exception as failure mode
from draft6, true/false are valid schemas and have to be treated as
{}/{"not":{}}

this commit:

- changes the return value for
get_true_schema/get_false_schema to ConstObject, needed to uniform to
get_object_or_empty,

- adds get_schema, helper that translates
missing -> {}
boolean -> {}/{"not":{}}
object -> object

- modify is_superset to translate its inputs with get_schema

- modify get_object_or_empty to translate its input with get_schema

the change allows deleting the helpers for
additionalProperties/additionalItems
increase the threshold to accept
{"type": "number", "multipleOf": 1.1}
{"type": "number", "multipleOf": 3.3}

std::abs(std::reminder(3.3, 1.1)) is ~5e-16

this is a fairly small value, but extremely bigger than the previous
10*epsilon limit

the new threshold depends on the Unit of Least Precision (ULP) relative
to the bigger value.
This value is the distance to the next floating
point value, and RapidJson documents that it's string->double parsing
code has at most a 3*ulp error. For this reason, this is the threshold
used to check if the reminder is close enough to zero
this check is meant to cover anyOf/oneOf/allOf.

base function that just check that no combinator is used by both schemas

note: json schema allows more than one combinator to appear in a schema,
but this it's difficult to define how is_superset should behave in this
case. since this seems to be an edge case, this implementation will
simply throw an exception.
the schemas themself can still be used but it's not possible to perform
compatibility checks on them
given the current implementation, a combinator can have 4 states
not set, allOf, oneOf, anyOf.

in case of state mismatch between schema, this commit throws an
exception as not implemented (that gets converted to false)

this is overly restrictive:
some combinations, like (not set, anyOf), or (anyOf, not set) or cases
where the schema array has only one value, might be compatible.

this will be revisited in a later commit
check that if both schema use the same combinator, then the schema
arrays have a compatible size.

for allOf, newer can have more schemas, because this can be seen as more
restritive rules
for oneOf/anyOf, newer can have less schemas, because this reduces the
json that can be valid
…hing

once the function has two schema arrays of suitable cardinality, we need
to check that every schema of the smaler set has a distinct match in the
other set.

this is solved in a general way by constructing a bipartite graph, and
computing the maximum cardinality matching on the graphs.

an example:
older= {"anyOf": [A, B, C, D]}
newer= {"anyOf": [1, 2, 3]}

for each pair
o ∈ [A, B, C, D]
n ∈ [1, 2, 3]
if (is_superset(o, n)) then we add a edge between o and n

so we find out that the list of edges is
[(A, 2), (A, 3), (B,1), (C, 2)],

one maximum cardinality matching is
[(A, 3), (B, 1), (C, 2)].
every subschema from newer (them smaller set) has a distinct companion
in older, meaning that we are sure that is_superset(older, newer) every
json matched by one of the subschemas of newer will be matched by a
subschema of older
testing shows that compatibility checks don't need to take in
consideration the "if"/"then"/"else" constructs
@andijcr andijcr dismissed stale reviews from BenPope and pgellert via c5cbb15 July 25, 2024 09:26
@andijcr andijcr force-pushed the feat/core-3187/schema_registry_unions branch from bb90bd6 to c5cbb15 Compare July 25, 2024 09:26
@andijcr
Copy link
Contributor Author

andijcr commented Jul 25, 2024

force push to fix merge conflicts

@vbotbuildovich
Copy link
Collaborator

@BenPope BenPope merged commit e42b5b2 into redpanda-data:dev Jul 25, 2024
19 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-21354-v24.2.x-891 remotes/upstream/v24.2.x
git cherry-pick -x c0bcd7f32b5f38223aef4ee99f239b787054ea5a 7f94644dbf2e009c23005e0a3ee550ece9d403b7 ea280172863467880c64b61b010961f78632aa2a 14c82a5741bbb9ef1cf25b3fbeef14e344f1b082 c26348dd2a35c1f3718fc649a23e69774b1f0d5e 1fae719f2d8e834bf35946421f4f0c3ee3250b43 baa5164996c970090724b55371d1cf4be15898e0 24a720d66af8d6a08d1daaf55c76e9472bb85d85 c5cbb156cc48b0390d214343050d6047db9b576f

Workflow run logs.

@andijcr andijcr deleted the feat/core-3187/schema_registry_unions branch July 25, 2024 13:25
@andijcr
Copy link
Contributor Author

andijcr commented Jul 25, 2024

/backport v24.2.x

throw as_exception(invalid_schema(fmt::format(
"{} not implemented for different combinators. input: older: '{}', "
"newer: '{}'",
__FUNCTION__,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use fmt_with_context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't get the function name with that if i'm not mistaken

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh but you get line number, so that's good enough?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants