Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Union type with dataclass ambiguous error and support superset comparison #5858

Merged
merged 30 commits into from
Nov 20, 2024

Conversation

mao3267
Copy link
Contributor

@mao3267 mao3267 commented Oct 18, 2024

Summary

We only support json schema draft

  1. same json schema
  2. big -> small (superset)
  3. inherit case (superset)

Tracking issue

Related to #5489

Why are the changes needed?

When a function accepts a Union of two dataclasses as input, Flyte cannot distinguish which dataclass matches the user's input. This is because Flyte only compares the simple types, and both dataclasses are identified as flyte.SimpleType_STRUCT in this scenario. As a result, there will be multiple matches, causing ambiguity and leading to an error.

union_test_dataclass.py

from typing import Union
from dataclasses import dataclass
from dataclasses_json import dataclass_json
from flytekit import task, workflow

@dataclass_json 
@dataclass
class A:
    a: int
    

@dataclass_json
@dataclass
class B:
    b: int


@task
def bar() -> A:
    return A(a=1)

@task
def foo(inp: Union[A, B]):
    print(inp)

@workflow
def wf():
    v = bar()
    foo(inp=v)

What changes were proposed in this pull request?

  1. To distinguish between different types using protobuf struct (dataclass, Pydantic.BaseModel), we compare their JSON schemas generated by marshmallow_jsonschema.JSONSchema (draft-07) or mashumaro.jsonschema.build_json_schema (draft 2020-12) for dataclass and Pydantic.BaseModel for itself. To check equivalence, we compare the bytes from marshaling the json schemas if they are in the same draft version. For now, we only consider supporting the comparison of schemas with the same version.

  2. We plan to support superset matching for dataclass/Pydantic.BaseModel with schemas in draft 2020-12, meaning that class A and class supersetA can be a match in the following example: (Pydantic.BaseModel example is in the screenshot section)

  3. Unit tests for different versions of json schema, including one-level, two-level, and strict subset examples in draft 2020-12 from mashumaro.

How was this patch tested? (my local testing repo here)

  1. Should Pass Tests
    • Run an example using union input with identical dataclass on remote (union_test_dataclass.py)
    • Run an example with inheritance on remote (union_test_inheritance.py)
    • Run an example using union input with identical BaseModel class on remote (union_test_basemodel.py)
  2. Should Fail Tests
    • upstream parent and downstream inherit child (dataclass_child_to_parent.py)
    • two level dataclass with different properties (dataclass_two_level.py)

Setup process

git clone https://github.com/flyteorg/flyte.git
gh pr checkout 5858
make compile
POD_NAMESPACE=flyte ./flyte start --config flyte-single-binary-local.yaml

Screenshots

  1. Example using union input with identical dataclass on remote (union_test_dataclass.py)

image

  1. Example with inheritance on remote (union_test_inheritance.py)

image

  1. Example using union input with identical BaseModel class on remote (union_test_basemodel.py)

image

  1. Example with upstream parent and downstream inherit child (dataclass_child_to_parent.py)

image

  1. Two level dataclass with different properties (dataclass_two_level.py)

image

Notes

Optional values in JSON Schemas

While handling Optional values in Python, both NoneType and the target type are accepted. However, when defining such values, default values must still be provided. This is why Optional properties without default values are marked as required in JSON schemas.

Ambiguity vs. Inheritance

Currently, we compare only the structure of the dataclass, excluding the title, as including it would block inheritance. This approach introduces ambiguity because it allows dataclasses with different names but the same structure to match. To resolve this, we may require additional information from Flytekit to determine whether two dataclasses have an inheritance relationship. The following example illustrates this challenge: B should match A, but C should not.

@dataclass
class A:
    a: int

@dataclass
class B(A):
    b: int

@dataclass
class C:
    a: int

Json Schema Compare Package (LGPL License)

We are now using json-schem-compare package to do strict subset matching. However, since the underlying JSON Schema compiler only supports drafts up to draft-7, it does not handle the prefixItems field when tuples are used in a dataclass.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

None

Docs link

TODO

Signed-off-by: mao3267 <chenvincent610@gmail.com>
Copy link

codecov bot commented Oct 21, 2024

Codecov Report

Attention: Patch coverage is 70.49180% with 18 lines in your changes missing coverage. Please review.

Project coverage is 37.07%. Comparing base (d2ddfe2) to head (7f28c35).
Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
flytepropeller/pkg/compiler/validators/typing.go 70.49% 13 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5858      +/-   ##
==========================================
- Coverage   37.14%   37.07%   -0.07%     
==========================================
  Files        1313     1316       +3     
  Lines      131619   132176     +557     
==========================================
+ Hits        48891    49010     +119     
- Misses      78481    78908     +427     
- Partials     4247     4258      +11     
Flag Coverage Δ
unittests-datacatalog 51.58% <ø> (ø)
unittests-flyteadmin 54.10% <ø> (ø)
unittests-flytecopilot 22.23% <ø> (ø)
unittests-flytectl 62.46% <ø> (ø)
unittests-flyteidl 7.25% <ø> (ø)
unittests-flyteplugins 53.72% <ø> (+0.03%) ⬆️
unittests-flytepropeller 42.68% <70.49%> (-0.43%) ⬇️
unittests-flytestdlib 57.59% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@pingsutw pingsutw assigned mao3267 and pingsutw and unassigned pingsutw Oct 23, 2024
@Future-Outlier Future-Outlier self-assigned this Oct 28, 2024
…yteorg#5489-dataclass-mismatch

Signed-off-by: mao3267 <chenvincent610@gmail.com>
Signed-off-by: mao3267 <chenvincent610@gmail.com>
Signed-off-by: mao3267 <chenvincent610@gmail.com>
Signed-off-by: mao3267 <chenvincent610@gmail.com>
Signed-off-by: mao3267 <chenvincent610@gmail.com>
…t (one level) dataclass

Signed-off-by: mao3267 <chenvincent610@gmail.com>
@fg91
Copy link
Member

fg91 commented Nov 1, 2024

One note @mao3267, this problem does not only affect dataclasses but any other type that uses protobuf struct as transport. Even combinations of different types that all use protobuf struct.
We for instance have an internal type transformer for pydantic base models (historic reasons before an official one was introduced as a plugin). It has exactly the same problem because it also uses protobuf struct.
It would be nice if the solution found for this problem was general and not only working for dataclasses. That's what I'm sligthy worried about when reading

distinguish between different dataclasses, we compare their JSON schemas generated by either marshmallow_jsonschema.JSONSchema (draft-07) or mashumaro.jsonschema.build_json_schema

Maybe we can use the literal type's "type structure" field for this? Or we should document how transformers for other types can provide the schema in a way that they can "participate in the logic".

TL;DR: It would be nice if the compiler logic in flytepropeller didn't have "special treatment" for dataclasses but general treatment for json-like structures with schemas that the dataclass transformer makes use of - but other transformers can as well.

@mao3267 mao3267 changed the title [WIP] Fix Union type with dataclass ambiguous error and support superset comparison Fix Union type with dataclass ambiguous error and support superset comparison Nov 8, 2024
@Future-Outlier
Copy link
Member

One note @mao3267, this problem does not only affect dataclasses but any other type that uses protobuf struct as transport. Even combinations of different types that all use protobuf struct. We for instance have an internal type transformer for pydantic base models (historic reasons before an official one was introduced as a plugin). It has exactly the same problem because it also uses protobuf struct. It would be nice if the solution found for this problem was general and not only working for dataclasses. That's what I'm sligthy worried about when reading

distinguish between different dataclasses, we compare their JSON schemas generated by either marshmallow_jsonschema.JSONSchema (draft-07) or mashumaro.jsonschema.build_json_schema

Maybe we can use the literal type's "type structure" field for this? Or we should document how transformers for other types can provide the schema in a way that they can "participate in the logic".

TL;DR: It would be nice if the compiler logic in flytepropeller didn't have "special treatment" for dataclasses but general treatment for json-like structures with schemas that the dataclass transformer makes use of - but other transformers can as well.

just dicussed with @mao3267 , he will explain how it works now, this will support both dataclass and pydantic basemodel in summary.

Signed-off-by: mao3267 <chenvincent610@gmail.com>
@Future-Outlier
Copy link
Member

Let's get this done this week @mao3267

@mao3267 mao3267 marked this pull request as ready for review November 11, 2024 02:48
@mao3267
Copy link
Contributor Author

mao3267 commented Nov 11, 2024

just dicussed with @mao3267 , he will explain how it works now, this will support both dataclass and pydantic basemodel in summary.

Currently, we support both Pydantic BaseModel and dataclass, including their combinations and nested structures. For dataclass_json, we only support equivalence without superset matching due to significant differences between its JSON schema (draft-07) and the newer version used by both Pydantic and dataclass (draft 2020-12).

Although the schemas from Pydantic BaseModel and dataclass adhere to the same version, they differ in how they handle fields. For instance, only the schema generated by Pydantic BaseModel records the title in properties. Additionally, the required and additionalProperties fields are omitted if no required properties exist or if additional properties are disallowed. To address these discrepancies, we preprocess the schema before comparison, which involves removing the title field, and modify the logic while comparing for the required and additionalProperties field.

Signed-off-by: Future-Outlier <eric901201@gmail.com>
@wild-endeavor
Copy link
Contributor

wild-endeavor commented Nov 11, 2024

the exact match part is okay and we should fix that but I'm confused about the other part.

looking at this example

@task
def my_task(input: Union[supersetA, B]):
    print(input)

@workflow
def wf():
    a = foo()
    my_task(a)

can you explain why this should work? foo creates an A object with only one field. my_task is a task that takes in supersetA or B. B is not relevant here. supersetA takes in three fields.

Why doesn't mypy complain in this example? Or does it?

I almost feel like if we're going to go down this route, it should be the other way around. If foo returned supersetA and my_task took in union of A and B. The reason is because supersetA contains more fields than A

cc @fg91 if you want to take a look as well.

wasn't there another pr where we were discussing an LGPL library also?

@@ -19,6 +19,8 @@ type trivialChecker struct {
}

func removeTitleFieldFromProperties(schema map[string]*structpb.Value) {
// TODO: Explain why we need this
// TODO: givse me example about dataclass vs. Pydantic BaseModel
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example comparing dataclass and Pydantic BaseModel. As shown, the schema for dataclass includes a title field that records the name of the class. Additionally, the additionalProperties field is absent from the Pydantic BaseModel schema because its value is false. cc @eapolinario

dataclass Pydantic.BaseModel
image image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add the comment, writing the entire schema would make it too lengthy. Would it be acceptable to use something like this instead?

class A:
	a: int

Pydantic.BaseModel: 	{"properties": {"a": {"title": "A", "type": "integer"}}}
dataclass: 			{"properties": {"a": {"type": "integer"}}, "additionalProperties": false}

Copy link
Member

@fg91 fg91 Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you proposing to preprocess the schemas so that one can mix and match dataclasses and base models given their schemas are aligned? I.e. task expects a dataclass with schema "A" and I pass a base model that has the same schema.

I personally feel this is not necessary and think it would be totally acceptable to consider a dataclass and a base model not a match by default. Especially if this makes things a lot more complicated in the backend otherwise because the schemas need to be aligned. What do you think about this?

If you are confident in the logic I'm of course not opposing the feature but if you feel this makes things complicated and brittle, I'd rather keep it simple and more robust.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it actually make things more complicated, will remove related logic.

@mao3267
Copy link
Contributor Author

mao3267 commented Nov 12, 2024

can you explain why this should work? foo creates an A object with only one field. my_task is a task that takes in supersetA or B. B is not relevant here. supersetA takes in three fields.
Why doesn't mypy complain in this example? Or does it?

Class B is used as an example of a type that does not match Class A.
Mypy doesn't report any errors. I am not familiar with mypy, what kind of error do you expect mypy to raise?

I almost feel like if we're going to go down this route, it should be the other way around. If foo returned supersetA and my_task took in union of A and B. The reason is

In the discussion here, we assumed that upstream refers to the exact input type and downstream refers to the expected type for the task input. Did we misunderstand this? By the way, I’m also curious about the reason of supporting this superset matching. It will be helpful to decide our route.

wasn't there another pr where we were discussing an LGPL library also?

The LGPL discussion applies to this PR as well, it is not mentioned because we are no longer using it.

cc @Future-Outlier @wild-endeavor

@fg91
Copy link
Member

fg91 commented Nov 12, 2024

Currently, we support both Pydantic BaseModel and dataclass, including their combinations and nested structures.

@Future-Outlier @mao3267
My question above wasn't specifically about base models but about how generalizable the solution is.
Let's consider the scenario that an org wants to build a custom internal type transformer for a json-like type similar to dataclasses or base models.
Is there a way they can provide the schema of their type in the to_literal_type method of their type transformer so that the backend can automatically perform schema checks for Union types? Or are there implementation details that are required in the backend too that limit this to dataclasses/base models and would be required for any additional json-like type?

The latter would be slightly concerning to me. Glancing over the code gives me the impression that there is quite a bit of dataclass/pydantic logic we need to apply. I wonder whether this could be done in the respective flytekit type transformer so that the backend is agnostic to the respective type and as long as the type transformer provides the schema in the right way, the backend can make use of it.

It would be really great if there was a tutorial in https://docs.flyte.org/en/latest/api/flytekit/types.extend.html in the end that documents how users need to provide the schema in their respective to_literal_type implementation so that the backend can automatically make use of it in the union type checker.

@wild-endeavor
Copy link
Contributor

wild-endeavor commented Nov 12, 2024

can you explain why this should work? foo creates an A object with only one field. my_task is a task that takes in supersetA or B. B is not relevant here. supersetA takes in three fields.
Why doesn't mypy complain in this example? Or does it?

Class B is used as an example of a type that does not match Class A. Mypy doesn't report any errors. I am not familiar with mypy, what kind of error do you expect mypy to raise?

I almost feel like if we're going to go down this route, it should be the other way around. If foo returned supersetA and my_task took in union of A and B. The reason is

In the discussion here, we assumed that upstream refers to the exact input type and downstream refers to the expected type for the task input. Did we misunderstand this?

Oh got it, but this only works because there are defaults right? The original case you linked to should work because
So if your supersetA was

@dataclass
class A:
    a: int
    b: Optional[int] = None
    c: str

then this should not work correct? (because c is missing).

By the way, I’m also curious about the reason of supporting this superset matching. It will be helpful to decide our route.

I don't think of this as superset matching. I think of this as schema compatibility, which is why I was thinking we'd find some off-the-shelf library that can just do it for us. 'Is this schema compatible with this other schema?'

wasn't there another pr where we were discussing an LGPL library also?
The LGPL discussion applies to this PR as well, it is not mentioned because we are no longer using it.

Is there a comment somewhere that explains why we no longer need, or can't use, that library or a library like it?

Re @fg91's comments

Is there a way they can provide the schema of their type in the to_literal_type method of their type transformer so that the backend can automatically perform schema checks for Union types? Or are there implementation details that are required in the backend too that limit this to dataclasses/base models and would be required for any additional json-like type?

I don't know the answer but it should be yes to the first question. It should be possible to easily provide a json schema and have everything on the backend just work. Isn't this the case @Future-Outlier? There should be no special logic for dataclasses or pydantic in the backend, at all. We should remove it if there is.

@mao3267
Copy link
Contributor Author

mao3267 commented Nov 13, 2024

Replying @wild-endeavor

then this should not work correct? (because c is missing).

Yes. This should not work.

Is there a comment somewhere that explains why we no longer need, or can't use, that library or a library like it?

No. I just decided to use another package that directly compares the JSON schema. This is not documented anywhere.

Replying @fg91

My question above wasn't specifically about base models but about how generalizable the solution is.
Let's consider the scenario that an org wants to build a custom internal type transformer for a json-like type similar to dataclasses or base models.
Is there a way they can provide the schema of their type in the to_literal_type method of their type transformer so that the backend can automatically perform schema checks for Union types? Or are there implementation details that are required in the backend too that limit this to dataclasses/base models and would be required for any additional json-like type?

The latter would be slightly concerning to me. Glancing over the code gives me the impression that there is quite a bit of dataclass/pydantic logic we need to apply. I wonder whether this could be done in the respective flytekit type transformer so that the backend is agnostic to the respective type and as long as the type transformer provides the schema in the right way, the backend can make use of it.

It would be really great if there was a tutorial in https://docs.flyte.org/en/latest/api/flytekit/types.extend.html in the end that documents how users need to provide the schema in their respective to_literal_type implementation so that the backend can automatically make use of it in the union type checker.

To support any other custom types, could we limit the provided schema from a certain package like Mashumaro for the compatibility check? Without limitations, it will be hard to cover all kinds of scenarios. Or does anyone know possible solutions to support JSON schema from different versions and packages?

@fg91
Copy link
Member

fg91 commented Nov 13, 2024

There should be no special logic for dataclasses or pydantic in the backend, at all. We should remove it if there is.

This is exactly what I'm trying to say :)

To support any other custom types, could we limit the provided schema from a certain package like Mashumaro for the compatibility check? Without limitations, it will be hard to cover all kinds of scenarios. Or does anyone know possible solutions to support JSON schema from different versions and packages?

Yes, I think restricting what kind of schema needs to be supplied is absolutely reasonable! I think it would be good to add a tutorial here https://docs.flyte.org/en/latest/api/flytekit/types.extend.html that states something like "if you want propeller to understand the schema of your type e.g. to distinguish in union types, you need to provide a schema in the to_literal_type method in this specific way". And I personally feel that the dataclass and pydantic type transformers should provide the schema in this general way so that the backend doesn't have to have type specific implementations for dataclasses/base models.
What do you think about this?

(As a side note, as I described here, for cache checks we don't use the schema in metadata but the so-called type structure. Maybe it's difficult to fix this in hindsight but I kinda wished that there was a single unified way type transformers make the schema available to propeller that is used for everything, cache checks, union type checks, ...)

@wild-endeavor
Copy link
Contributor

@fg91 and @EngHabu mind chiming in on dataclass compatibility? Just thinking about it from the basics, if I have json schemas representing two dataclasses, let's say,

@dataclass
class A2:
    a: int
    b: Optional[int] = None
    c: str = "hello"

and

@dataclass
class A1:
    a: int

which of the following two are valid?
Case 1

def wf():
  a1 = create_a1()  # -> A1
  use_a2(a2=a1)  # a2: A2

Case 2

def wf():
  a2 = create_a2()  # -> A2
  use_a1(a1=a2)  # a1: A1

Just thinking about compatibility in the loosest sense, both should be valid. The reason is that in the first case, when calling the downstream task use_a2, fields b and c have defaults.

In the second case, the a field can be taken from the a2 object and b and c discarded.

The implication here though is that if you have

@task
def make_a1() -> A1: ...

@task
def use_either(a: typing.Union[A1, A2]): ...

This this will fail

use_either(a=make_a1())

because A1 will match more than one variant. flytekit itself will not fail I think (right @Future-Outlier?) but we'll never get there because the compiler will fail.

Should we just do exact matches only? Plus of two examples earlier (case 1 & 2), both will fail mypy type checking.

mao3267 and others added 6 commits November 15, 2024 10:19
Signed-off-by: mao3267 <chenvincent610@gmail.com>
Signed-off-by: mao3267 <chenvincent610@gmail.com>
Signed-off-by: mao3267 <chenvincent610@gmail.com>
Signed-off-by: mao3267 <chenvincent610@gmail.com>
Signed-off-by: mao3267 <chenvincent610@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just synced with @mao3267

  • work
  1. same
  2. big -> small
  3. child -> parent
  • not work
  1. different attribute
class A:
	a: int

class B:
	a: int

A -> B
  1. parent -> child

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just synced with @mao3267 , here is my summary.

  1. if we support child -> parent case, we can't support 2 dataclass ambiguous error.
    example:
@dataclass
class A:
	a: int

@dataclass
class B:
	a: int
	
# A -> B can pass and work

this is due to json schema's restriction.
we should use MRO to solve this or find other protocol.

Signed-off-by: mao3267 <chenvincent610@gmail.com>
Signed-off-by: mao3267 <chenvincent610@gmail.com>
Comment on lines +9 to +11
"github.com/santhosh-tekuri/jsonschema"
"github.com/wI2L/jsondiff"
jscmp "gitlab.com/yvesf/json-schema-compare"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

santhosh-tekuri/jsonschema" uses Apache License.
wI2L/jsondiff uses MIT License.
yvesf/json-schema-compare uses LGPL License.

Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested all examples you provided, and even do some additional test.
not gonna lie, this is AMAZING.

image

Signed-off-by: mao3267 <chenvincent610@gmail.com>
…failure

Signed-off-by: mao3267 <chenvincent610@gmail.com>
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, really enjoy to work with you.

@Future-Outlier Future-Outlier enabled auto-merge (squash) November 20, 2024 03:57
@Future-Outlier Future-Outlier merged commit 47edd99 into flyteorg:master Nov 20, 2024
53 of 55 checks passed
@fg91
Copy link
Member

fg91 commented Nov 20, 2024

Thank you for tackling this problem @mao3267 and @Future-Outlier 🙇 🙇

Yes, I think restricting what kind of schema needs to be supplied is absolutely reasonable! I think it would be good to add a tutorial here https://docs.flyte.org/en/latest/api/flytekit/types.extend.html that states something like "if you want propeller to understand the schema of your type e.g. to distinguish in union types, you need to provide a schema in the to_literal_type method in this specific way".

Would it be possible to document as proposed here how type transformers need to provide the schema?

@mao3267
Copy link
Contributor Author

mao3267 commented Nov 21, 2024

Would it be possible to document as proposed here how type transformers need to provide the schema?

No problem! Will update the documents later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants