Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform from dataclasses_json to DataClassJSONMixin #129

Closed
hhcs9527 opened this issue Jul 24, 2023 · 13 comments
Closed

Transform from dataclasses_json to DataClassJSONMixin #129

hhcs9527 opened this issue Jul 24, 2023 · 13 comments
Labels
wontfix This will not be worked on

Comments

@hhcs9527
Copy link

hhcs9527 commented Jul 24, 2023

Is your feature request related to a problem? Please describe.
Currently, our project tried to migrate the serialize function in dataclasses_json to DataClassJSONMixin.
If we just replace the to_json / from_json function with DataClassJSONMixin ones, the speed remains the same, the reason I believe is the difference in how dataclasses_json implements the to_dict / from_dict.

Describe the solution you'd like
It would be great to have a function to transform the dataclasses_json to DataClassJSONMixin.

Describe alternatives you've considered
or Provide a general way to accept the init data

@dataclass
class StockPosition(DataClassJSONMixin):
   # ticker: str
   # name: str
   # balance: int
   some init functions, can auto-detect the content of the input class.

Additional context
Add any other context about the feature request here.

@Fatal1ty
Copy link
Owner

Fatal1ty commented Jul 24, 2023

Hi @hhcs9527

If we just replace the to_json / from_json function with DataClassJSONMixin ones, the speed remains the same

So that I can help you, could you provide more details about how you did it? I just replaced DataClassJsonMixin from dataclasses-json with DataClassJSONMixin from mashumaro and got significant improvement in from_json / to_json:

Source code
from dataclasses import dataclass
from mashumaro.mixins.json import DataClassJSONMixin
from timeit import timeit
from dataclasses_json import DataClassJsonMixin


@dataclass
class StockPosition1(DataClassJSONMixin):
    ticker: str
    name: str
    balance: int


@dataclass
class StockPosition2(DataClassJsonMixin):
    ticker: str
    name: str
    balance: int


t1 = timeit(
    "loader(data)",
    globals={
        "loader": StockPosition1.from_json,
        "data": '{"ticker": "AAPL", "name": "Apple", "balance": 42}',
    },
)
print('mashumaro from_json', t1)


t2 = timeit(
    "loader(data)",
    globals={
        "loader": StockPosition2.from_json,
        "data": '{"ticker": "AAPL", "name": "Apple", "balance": 42}',
    },
)
print('dataclasses_json from_json', t2)

obj1 = StockPosition1("AAPL", "Apple", 42)
obj2 = StockPosition2("AAPL", "Apple", 42)

t1 = timeit("dumper()", globals={"dumper": obj1.to_json})
print('mashumaro to_json', t1)


t2 = timeit("dumper()", globals={"dumper": obj2.to_json})
print('dataclasses_json to_json', t2)
mashumaro from_json 1.439956832997268
dataclasses_json from_json 21.087273874989478
mashumaro to_json 1.4008548330166377
dataclasses_json to_json 13.31320291600423

the reason I believe is the difference in how dataclasses_json implements the to_dict / from_dict

AFAIK, dataclasses-json doesn't compile loader / dumper for specific schema as mashumaro does, which is what makes it so slow.

@Fatal1ty Fatal1ty added the needs information Further information is requested label Jul 24, 2023
@hhcs9527
Copy link
Author

@Fatal1ty Thanks for the help.
the following example shows the concept of my problem.

I have a object from @DataClass and DataClassJsonMixin. However, due to the speed issue, we would like to transform such object to DataClassJSONMixin.

But I have no idea how to retrieve the data in DataClassJsonMixin to DataClassJSONMixin => which I believe is the main reason that DataClassJSONMixin is fast.

I've tried several ways

  1. init the object as a property of DataClassJSONMixin => fail
  2. replace the function in DataClassJsonMixin to_json / to_dict / from_json / from_dict with DataClassJSONMixin => works, which I have no idea why. However, in a real-life example, performance doesn't work like that.
  3. replace with customize serialization strategy => no idea how to serialize, all I know is to call DataClassJsonMixin.to_dict => but it's just not fast

Here is my example for way 2.
speed : my_portfolio3 (mix one) > my_portfolio (DataClassJSONMixin) > my_portfolio2 (DataClassJsonMixin)

import json
from enum import Enum
from typing import List, Any, Union, Generic, TypeVar, Tuple, cast
from dataclasses import dataclass, field
from dataclasses_json import dataclass_json, DataClassJsonMixin
from mashumaro.mixins.json import DataClassJSONMixin
from mashumaro.types import SerializationStrategy
T = TypeVar("T")


def timeDiff(obj):
    import time

    start = time.time()
    json_str = obj.to_json()
    end = time.time()
    print("to_json cost:", (end - start)*10000)

    start = time.time()
    type(obj).from_json(json_str)
    end = time.time()
    print("from_json cost:", (end - start)*10000)
    print()

class Currency(Enum):
    USD = "USD"
    EUR = "EUR"

@dataclass
class CurrencyPosition(DataClassJSONMixin):
    currency: Currency
    balance: float

@dataclass
class StockPosition(DataClassJSONMixin):
    ticker: str
    name: str
    balance: int

@dataclass
class Portfolio(DataClassJSONMixin):
    currencies: List[CurrencyPosition]
    stocks: List[StockPosition]

my_portfolio = Portfolio(
    currencies=[
        CurrencyPosition(Currency.USD, 238.67),
        CurrencyPosition(Currency.EUR, 361.84),
    ],
    stocks=[
        StockPosition("AAPL", "Apple", 10),
        StockPosition("AMZN", "Amazon", 10),
    ]
)

json_string = my_portfolio.to_json()
# print("my_portfolio og: ", json_string)
# print(json_string == Portfolio.from_json(my_portfolio.to_json()).to_json())


@dataclass_json
@dataclass
class CurrencyPosition2:
    currency: Currency
    balance: float

@dataclass_json
@dataclass
class Portfolio2():
    currencies: List[CurrencyPosition]
    stocks: List[StockPosition]

my_portfolio2 = Portfolio2(
    currencies=[
        CurrencyPosition(Currency.USD, 238.67),
        CurrencyPosition(Currency.EUR, 361.84),
    ],
    stocks=[
        StockPosition("AAPL", "Apple", 10),
        StockPosition("AMZN", "Amazon", 10),
    ]
)

timeDiff(my_portfolio)

timeDiff(my_portfolio2)

my_portfolio3 = Portfolio2(
    currencies=[
        CurrencyPosition(Currency.USD, 238.67),
        CurrencyPosition(Currency.EUR, 361.84),
    ],
    stocks=[
        StockPosition("AAPL", "Apple", 10),
        StockPosition("AMZN", "Amazon", 10),
    ]
)

Portfolio2.to_dict = DataClassJSONMixin.to_dict
Portfolio2.from_dict = DataClassJSONMixin.from_dict
Portfolio2.to_json = DataClassJSONMixin.to_json
Portfolio2.from_json = DataClassJSONMixin.from_json
timeDiff(cast(DataClassJsonMixin, my_portfolio3))

Hope this helps, thanks!

@Fatal1ty
Copy link
Owner

Hope this helps, thanks!

Yes, thank you, this example shed some light here, but you made a few mistakes. Let me explain.

Portfolio2.to_dict = DataClassJSONMixin.to_dict
Portfolio2.from_dict = DataClassJSONMixin.from_dict
Portfolio2.to_json = DataClassJSONMixin.to_json
Portfolio2.from_json = DataClassJSONMixin.from_json

You can't simply replace class attributes to the ones from DataClassJSONMixin class because the base class methods from_dict / to_dict do nothing, they need to be built for Portfolio2 specifically. If you add print statement to you timeDiff you will see that print(obj.to_json()) is {} and print(type(obj).from_json(json_str)) is <mashumaro.mixins.json.DataClassJSONMixin object at 0x...> which is not you were expected and which is the reason why my_portfolio3 (mix one) is the fastest in your inaccurate but still a benchmark.

@dataclass_json
@dataclass
class Portfolio2():
    currencies: List[CurrencyPosition]
    stocks: List[StockPosition]

The second point. You're mixing Portfolio2 with CurrencyPosition and StockPosition not with CurrencyPosition2 and StockPosition2. This works for dataclasses_json because it fortunately assumes your dataclasses have methods with the same names that mashumaro introduces. But if you're going to mix the libraries it will definitely be error prone.

my_portfolio (DataClassJSONMixin) > my_portfolio2 (DataClassJsonMixin)

I strongly believe this is a typo and you actually didn't get this result. I run your code on my laptop with Python 3.11 and got the following numbers:

to_json cost: 0.11205673217773438
from_json cost: 0.17881393432617188

to_json cost: 2.0599365234375
from_json cost: 2.582073211669922

to_json cost: 0.030994415283203125
from_json cost: 0.030994415283203125

So, mashumaro is roughly 18x faster on to_json and 14x faster on from_json.

I have a object from @DataClass and DataClassJsonMixin. However, due to the speed issue, we would like to transform such object to DataClassJSONMixin.

Based on all the above I'd like to ask you, why don't you just remove @dataclass_json decorator or DataClassJsonMixin base usage and replace it with DataClassJSONMixin (or even DataClassORJSONMixin for even better performance)?

@Fatal1ty
Copy link
Owner

Fatal1ty commented Jul 24, 2023

You can't simply replace class attributes to the ones from DataClassJSONMixin class because the base class methods from_dict / to_dict do nothing, they need to be built for Portfolio2 specifically

Just for the reference, it's not documented and subjected to changes but specific from_dict / to_dict / from_json / to_json methods can be compiled and added to Portfolio2 by using internal CodeBuilder that is used by mashumaro mixins. So, instead of this:

Portfolio2.to_dict = DataClassJSONMixin.to_dict
Portfolio2.from_dict = DataClassJSONMixin.from_dict
Portfolio2.to_json = DataClassJSONMixin.to_json
Portfolio2.from_json = DataClassJSONMixin.from_json

You're better to do this:

from mashumaro.core.meta.code.builder import CodeBuilder

cb = CodeBuilder(Portfolio2)  # if you need to call from_dict / to_dict in your code
cb.add_pack_method()
cb.add_unpack_method()

cb = CodeBuilder(
    Portfolio2, format_name="json", encoder=json.dumps, decoder=json.loads
)  # if you need to call from_json / to_json in your code
cb.add_pack_method()
cb.add_unpack_method()

It will give Portfolio2 the same (de)serialization speed as for Portfolio but in real life you will be forced to do this for all dataclasses you depend on until the related issue is solved:

@hhcs9527
Copy link
Author

Hi @Fatal1ty ,
Thanks for the advice, I just tried the methods, and it works! (which is the way I preferred, but the performance in the real-world case still has some issues)

Not sure what you mean by the last paragraph.

@Fatal1ty
Copy link
Owner

Thanks for the advice, I just tried the methods, and it works! (which is the way I preferred, but the performance in the real-world case still has some issues)

Can you share more details about your performance issues?

P.S. I found your recent changes in flytekit and I haven't delved deeply into the internals of flytekit, but as far as I understand, to_python_value / to_literal are called every time on values of the same type. If this is the case, then I strongly recommend not to do so, since compilation is expensive and must be performed once per data type. You should cache it somehow. With the upcoming release you will be able to check if the data type doesn't have internal method __mashumaro_from_json__ / __mashumaro_to_json__ and compile it if needed, but it's a dirty hack that has its own cost.

Anyway, I'm still struggling to understand why you're trying to mix dataclasses_json with mashumaro in this weird way? To help me better understand your motivation, I will quote my question:

Based on all the above I'd like to ask you, why don't you just remove @dataclass_json decorator or DataClassJsonMixin base usage and replace it with DataClassJSONMixin (or even DataClassORJSONMixin for even better performance)?

I found chapter "Using Custom Python Objects" in flytekit documentation that says that you support dataclasses with dataclass_json mixed functionality. So, why not extend dataclasses support either to dataclasses with @dataclass_json decorator (or DataClassJsonMixin since it's interchangeable) or to dataclasses with mashumaro mixins?

@hhcs9527
Copy link
Author

You're right, we should start at the bottom to replace the DataClassJsonMixin with DataClassJSONMixin.
That's what I'm going to do, thanks for the help along the way.
Thanks for the kind support and suggestion, also the amazing implementation for DataClassJSONMixin!

I think this issue is resolved, thanks.

@hhcs9527
Copy link
Author

By the way, can you point out the path where you implement the mashumaro_from_json for any type?
Does it mean the compile, also does the compile happen in the create object process?

Thanks for the clarification.

@Fatal1ty
Copy link
Owner

Fatal1ty commented Jul 26, 2023

By the way, can you point out the path where you implement the mashumaro_from_json for any type?

To be more accurate, method __mashumaro_from_json__ is assigned only to dataclass types, not to any type. CodeBuilder constructs the code of this method line by line specifically for a dataclass that it sees for the first time here. But I should emphasize that __mashumaro_*__ methods were introduced recently and are only available in master branch at this moment. I'm going to make a release in the next few days.

Does it mean the compile, also does the compile happen in the create object process?

The moment when compilation occurs depends on the use of the library. Let me elaborate this.

  1. The most common case case — a user has a dataclass with a mixin provided by mashumaro. This dataclass doesn't have forward references and lazy compilation is disabled (default behaviour). In this case methods from_* / to_* are being compiled and assigned to the dataclass at import time. So, when you have a reference to this dataclass type or dataclass instance you can use its from_* / to_* methods.
  2. The same case as above but this time a dataclass has forward references (e.g. it has a field of a type that is declared below and is not yet evaluated) or lazy compilation is explicitly enabled. In this case the real method compilation will be delayed until it's called for the first time. So, when you have a reference to this dataclass type or dataclass instance you can still use its from_* / to_* methods. To pull this off there will be one-shot method from_*/ to_* that will build the real method and replace itself with it. It can be dangerous if you keep a reference to this one-shot method because you will end up with repeated compilation on every call. To avoid this it's recommended either not to save a reference to a dataclass method, or use a lambda function just in case.
  3. Manual compilation using internal CodeBuilder. It's pretty straightforward here. When you call add_pack_method / add_unpack_method your dataclass will get new methods compiled and assigned. So, it's not recommended to do it every time in the name of performance.

Also, you can be interested in this issue:

While it's not done yet you can add support for dataclasses without any mixins as follows:

import json

from dataclasses import dataclass
from typing import Type, TypeVar

from mashumaro.core.meta.code.builder import CodeBuilder

T = TypeVar("T")


class JSONDecoder:
    def __init__(self):
        self._cache = {}

    def from_json(self, type_: Type[T], data: str) -> T:
        type_id = id(type_)
        method = self._cache.get(type_id)
        if method:
            return method(data)
        cb = CodeBuilder(type_, format_name="json", decoder=json.loads)
        cb.add_unpack_method()
        method = getattr(type_, "__mashumaro_from_json__")
        self._cache[type_id] = method
        return method(data)

    def from_json2(self, type_: Type[T], data: str) -> T:
        method = getattr(type_, "__mashumaro_from_json__")
        if method:
            return method(data)
        cb = CodeBuilder(type_, format_name="json", decoder=json.loads)
        cb.add_unpack_method()
        method = getattr(type_, "__mashumaro_from_json__")
        return method(data)

It has a small additional performance cost but is still faster than dataclasses-json:
Variant without mixins and with getattr turned out to be even faster for json comparing to using DataClassJSONMixin. It happens because DataClassJSONMixin doesn't compile a specific from_json method (as DataClassORJSONMixin does), instead it calls a specific compiled from_dict method underhood.

Source code
import json

from dataclasses import dataclass
from typing import Type, TypeVar
from timeit import timeit

from mashumaro.core.meta.code.builder import CodeBuilder
from dataclasses_json import DataClassJsonMixin
from mashumaro.mixins.json import DataClassJSONMixin

T = TypeVar("T")


class JSONDecoder:
    def __init__(self):
        self._cache = {}

    def from_json(self, type_: Type[T], data: str) -> T:
        type_id = id(type_)
        method = self._cache.get(type_id)
        if method:
            return method(data)
        cb = CodeBuilder(type_, format_name="json", decoder=json.loads)
        cb.add_unpack_method()
        method = getattr(type_, "__mashumaro_from_json__")
        self._cache[type_id] = method
        return method(data)

    def from_json2(self, type_: Type[T], data: str) -> T:
        method = getattr(type_, "__mashumaro_from_json__")
        if method:
            return method(data)
        cb = CodeBuilder(type_, format_name="json", decoder=json.loads)
        cb.add_unpack_method()
        method = getattr(type_, "__mashumaro_from_json__")
        return method(data)


@dataclass
class MyClass1:
    x: int


@dataclass
class MyClass2(DataClassJSONMixin):
    x: int


@dataclass
class MyClass3(DataClassJsonMixin):
    x: int


json_decoder = JSONDecoder()
data = '{"x": "42"}'
json_decoder.from_json(MyClass1, data)  # warmup

print(
    "without mixins (1)",
    timeit(
        "decoder(cls, data)",
        globals={
            "decoder": json_decoder.from_json,
            "data": data,
            "cls": MyClass1,
        },
    ),
)

print(
    "without mixins (2)",
    timeit(
        "decoder(cls, data)",
        globals={
            "decoder": json_decoder.from_json2,
            "data": data,
            "cls": MyClass1,
        },
    ),
)

print(
    "DataClassJSONMixin",
    timeit(
        "decoder(data)",
        globals={
            "decoder": MyClass2.from_json,
            "data": data,
        },
    ),
)

print(
    "DataClassJsonMixin",
    timeit(
        "decoder(data)",
        globals={
            "decoder": MyClass3.from_json,
            "data": data,
        },
    ),
)
without mixins (1) 1.086791625013575
without mixins (2) 1.0074725000013132
DataClassJSONMixin 1.0947887079964858
DataClassJsonMixin 13.494744666997576

I hope this will help you make the right decision.

P.S. You are free to close this issue if everything was answered.

@Fatal1ty Fatal1ty added wontfix This will not be worked on and removed needs information Further information is requested labels Jul 26, 2023
@hhcs9527
Copy link
Author

Thanks for the explanation.
Does mashumaro support generated data class from its own JSON schema?

@Fatal1ty
Copy link
Owner

Does mashumaro support generated data class from its own JSON schema?

If you mean dataclass code generation from JSON schema, then answer is no. It’s a job for another library as I see it.

@hhcs9527
Copy link
Author

Do you know which library? Thanks.

@Fatal1ty
Copy link
Owner

Fatal1ty commented Jul 27, 2023

Do you know which library? Thanks.

Take a look at datamodel-code-generator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants