Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doesn't serialize custom types in list/dict #28

Closed
purarue opened this issue Mar 28, 2021 · 7 comments
Closed

Doesn't serialize custom types in list/dict #28

purarue opened this issue Mar 28, 2021 · 7 comments

Comments

@purarue
Copy link
Contributor

purarue commented Mar 28, 2021

Only recently realized that I could put Dict/List items as values on a NamedTuple cached by cachew, previously was doing some weird stuff

So, in the process of switching more things to support cachew, I ran into an issue here, when one of the items in the List being serialized by cachew couldn't be converted to json

from datetime import datetime
from typing import List, NamedTuple, Iterator
from cachew import cachew

class action(NamedTuple):
    dt: datetime
    val: int


class wrapped(NamedTuple):
    actions: List[action]  # list causes this to be serialized with json.dumps
    val: str


@cachew
def values() -> Iterator[wrapped]:
    yield wrapped(actions=[action(datetime.now(), 5)], val="something")

list(values())
list(values())

generates quite the error:

cachew: error while setting up cache, falling back to non-cached version
(builtins.TypeError) Object of type datetime is not JSON serializable
[SQL: INSERT INTO "table" (actions, val) VALUES (?, ?)]
Traceback (most recent call last):
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1204, in _execute_context
    context = constructor(dialect, self, conn, *args)
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 839, in _init_compiled
    param.append(processors[key](compiled_params[key]))
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/sql/type_api.py", line 1232, in process
    return process_param(value, dialect)
  File "/home/sean/.local/lib/python3.9/site-packages/cachew/__init__.py", line 163, in process_bind_param
    return json.dumps(value)
  File "/usr/lib/python3.9/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib/python3.9/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python3.9/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/usr/lib/python3.9/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type datetime is not JSON serializable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/sean/.local/lib/python3.9/site-packages/cachew/__init__.py", line 1006, in cachew_wrapper
    flush()
  File "/home/sean/.local/lib/python3.9/site-packages/cachew/__init__.py", line 993, in flush
    conn.execute(values_table.insert().values(chunk))
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1011, in execute
    return meth(self, multiparams, params)
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1124, in _execute_clauseelement
    ret = self._execute_context(
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1206, in _execute_context
    self._handle_dbapi_exception(
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1510, in _handle_dbapi_exception
    util.raise_(
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1204, in _execute_context
    context = constructor(dialect, self, conn, *args)
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 839, in _init_compiled
    param.append(processors[key](compiled_params[key]))
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/sql/type_api.py", line 1232, in process
    return process_param(value, dialect)
  File "/home/sean/.local/lib/python3.9/site-packages/cachew/__init__.py", line 163, in process_bind_param
    return json.dumps(value)
  File "/usr/lib/python3.9/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib/python3.9/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python3.9/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/usr/lib/python3.9/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
sqlalchemy.exc.StatementError: (builtins.TypeError) Object of type datetime is not JSON serializable
[SQL: INSERT INTO "table" (actions, val) VALUES (?, ?)]
cachew: error while setting up cache, falling back to non-cached version
(builtins.TypeError) Object of type datetime is not JSON serializable
[SQL: INSERT INTO "table" (actions, val) VALUES (?, ?)]
Traceback (most recent call last):
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1204, in _execute_context
    context = constructor(dialect, self, conn, *args)
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 839, in _init_compiled
    param.append(processors[key](compiled_params[key]))
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/sql/type_api.py", line 1232, in process
    return process_param(value, dialect)
  File "/home/sean/.local/lib/python3.9/site-packages/cachew/__init__.py", line 163, in process_bind_param
    return json.dumps(value)
  File "/usr/lib/python3.9/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib/python3.9/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python3.9/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/usr/lib/python3.9/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type datetime is not JSON serializable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/sean/.local/lib/python3.9/site-packages/cachew/__init__.py", line 1006, in cachew_wrapper
    flush()
  File "/home/sean/.local/lib/python3.9/site-packages/cachew/__init__.py", line 993, in flush
    conn.execute(values_table.insert().values(chunk))
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1011, in execute
    return meth(self, multiparams, params)
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1124, in _execute_clauseelement
    ret = self._execute_context(
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1206, in _execute_context
    self._handle_dbapi_exception(
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1510, in _handle_dbapi_exception
    util.raise_(
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1204, in _execute_context
    context = constructor(dialect, self, conn, *args)
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 839, in _init_compiled
    param.append(processors[key](compiled_params[key]))
  File "/home/sean/.local/lib/python3.9/site-packages/sqlalchemy/sql/type_api.py", line 1232, in process
    return process_param(value, dialect)
  File "/home/sean/.local/lib/python3.9/site-packages/cachew/__init__.py", line 163, in process_bind_param
    return json.dumps(value)
  File "/usr/lib/python3.9/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib/python3.9/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python3.9/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/usr/lib/python3.9/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
sqlalchemy.exc.StatementError: (builtins.TypeError) Object of type datetime is not JSON serializable
[SQL: INSERT INTO "table" (actions, val) VALUES (?, ?)]

I'm not expecting this to be fixed, as I understand the problems with serializing/deserializing from JSON, I was just stuck here for a while as the readme says it should work for lists and datetime, so I wasn't sure where this error why there was an error (its being thrown here)

Makes sense that its using JSON, as otherwise you would have to maintain an intersection table and map ids from another table onto list/dict items, which sounds like a pain

Anyways, as far as any changes, either a better warning message could be raised, and/or the documentation could be updated to better reflect this pitfall, so that no one else runs into the cryptic error in the future?

@karlicoss
Copy link
Owner

Yep, definitely agree about the error message.
I guess in principle it would be nice to support as well -- after all, the types that can be stored natively in sqlite and as json are pretty much the same anyway.
In fact in hindsight I'm not sure if I should have really bothered with flattening onto columns at all.

Pros are:

  • It gives some nice properties like ability to use sqlite queries
  • can potentially be more compact on disk (no need to store key names)
  • might be quicker to deserialize? Although I'd expect these benefits to be pretty marginal

Cons are:

  • but you can't flatten everything anyway (like Lists/Jsons)
  • it makes it a bit more complicated/inconsistent by needing to handle both sqlite columns and json-ish things

An alternative is to simply convert everything to a json-like thing and just store it in the single column. In addition:

  • stil possible to do queries against these json objects (although didn't use this much so far https://www.sqlite.org/json1.html#jex )
  • still might be possible to optionally 'cache' specific columns as sqlite native on demand
  • might be possible to compress the database somehow as well (don't know much about it, and haven't had such need anyway)
  • it might make it fairly trivial to support alternative backends (like postgres/redis/etc)

Still not sure about this, what do you think?

@purarue
Copy link
Contributor Author

purarue commented Mar 30, 2021

An alternative is to simply convert everything to a json-like thing and just store it in the single column

yeah, the only thing I'm unsure of here is if recursive NT/classes would be converted back? I've had a few cases where if I have very basic namedtuples as a list, and it converts it back to a list of lists (converted from a namedtuple to a tuple, and then since JSON doesnt have the concept of a tuple I guess its just stored as a list) instead of a list of namedtuples - here are those definitions

Seems like there isnt a ton of benefit to using a sqlite database then though? Since its just storing JSON blobs, with no type information, its just the sqlalchemy machinery you're using to convert to NTs and back. Though, I don't know if mapping onto the columns is slower than doing the conversion from a dict back to the namedtuple, will just have to be tested...

I do this manually and with a lot more error checking here by inspecting the signature for the namedtuple, but I assume the sqlalchemy bindings are better for this

@karlicoss
Copy link
Owner

the only thing I'm unsure of here is if recursive NT/classes would be converted back?

As long as we can rely on schemas (i.e. fully annotated type) passed both on encoding and decoding, this should work? It's not different from laying out on sqlite, e.g. datetime types are stored as strings, but are reconstructed back into datetime because it has access to type info https://github.com/karlicoss/cachew/blob/master/src/cachew/__init__.py#L383-L385

mapping onto the columns is slower than doing the conversion from a dict back to the namedtuple, will just have to be tested...

Yeah, not sure either

Seems like there isnt a ton of benefit to using a sqlite database then though?

Yeah, good question. And not in FAQ, so wonder if I seriously considered it :) Let's think...

Cons of having a JSONL file over sqlite with single-column json blob:

  • might be nice to eventually support in-memory cache by leveraging on sqlalchemy, in that sense makes more sense to keep it sqlite
  • hash would need to be kept somewhere. Could just put it in the filename? but then it's harder to recycle old caches... although maybe still possible to do atomically, e.g. take a snapshot of everything with mtimes, and then once the writer finishes, compare them and delete the ones that weren't changed?

Pros:

  • atomic guarantees are easier to implement in a sense (comparing to dealing with sqlite): write to a temporary file, and then rename call is atomic in linux (at least on reasonable file systems?). And if the file is replaced while someone is reading it, it's also OK, the readers' file descriptors still linger, so in this sense it's all good
  • implementing retention, i.e. keeping last N hash keys instead of just one might be even easier -- just a matter of removing/keeping more files

?

  • reading might be faster in sqlite, although considering we're storing variable length column probably can't be that much faster? Although sqlite probably keeps an index, so it doesn't need to find newlines like reading from files would. But probably need to measure

Hmm.

@karlicoss
Copy link
Owner

Somewhat related https://github.com/jsonpickle/jsonpickle
Although it's more like a regular pickle, not sure

In [6]: 
class ObjWithDate(object):
    def __init__(self):
        ts = datetime.datetime.now()
        self.data = dict(a='a', ts=ts)
        self.data_ref = dict(b='b', ts=ts)

In [7]: jsonpickle.encode(ObjWithDate())
Out[7]: '{"py/object": "__main__.ObjWithDate", "data": {"a": "a", "ts": {"py/object": "datetime.datetime", "__reduce__": [{"py/type": "datetime.datetime"}, ["B+UEEAc6MQuycA=="]]}}, "data_ref": {"b": "b", "ts": {"py/id": 2}}}'

@karlicoss
Copy link
Owner

Wanted to simplify cachew a bit by potentially switching to third party serialization -- tried marshmallow/pydantic/dataclasses_json/cattrs (might post some notes later).
Basically for one reason or another they are not working properly for this case (usually some issues with Union support, or having to inherit your classes from some sort of base model).

So decided to optimize the one we have instead, and as an experiment tried just serializing as json in a single column (instead of flattening it over multiple like we do currently)

And it actually seems quite a bit faster (and the code is much easier to read!), not sure why I settled on the flattening back then, perhaps wasn't benchmarking properly.

    test_nosql.py::test_many_from_cachew
    building      1000000 objects of type <class 'test_nosql.test_many_from_cachew.<locals>.TE2'>: 0.53s
    serializing   1000000 objects of type <class 'test_nosql.test_many_from_cachew.<locals>.TE2'>: 1.00s
    json dump     1000000 objects of type <class 'test_nosql.test_many_from_cachew.<locals>.TE2'>: 0.51s
    sqlite dump   1000000 objects of type <class 'test_nosql.test_many_from_cachew.<locals>.TE2'>: 1.16s
    sqlite load   1000000 objects of type <class 'test_nosql.test_many_from_cachew.<locals>.TE2'>: 0.31s
    jsonl dump    1000000 objects of type <class 'test_nosql.test_many_from_cachew.<locals>.TE2'>: 0.15s
    jsonl load    1000000 objects of type <class 'test_nosql.test_many_from_cachew.<locals>.TE2'>: 0.21s
    json load     1000000 objects of type <class 'test_nosql.test_many_from_cachew.<locals>.TE2'>: 0.52s
    deserializing 1000000 objects of type <class 'test_nosql.test_many_from_cachew.<locals>.TE2'>: 1.29s
    
    for comparison, current cachew stats for this test
    - reading from db 1.21s -- corresponds to "sqlite load", note that sqlalchemy in cachew introduces some slowdown too which we should address
    - deserializing 3.29 -- this roughtly a sum of "json load + deserializing" -- so about a 50% speedup

We won't necessarily lose on individual column mappings either, it's possible to define virtual columns for accessing inside JSON like described here https://antonz.org/json-virtual-columns/

Using a text file with json (jsonl) instead of sqlite also gives quite a big speedup (especially for writing) -- so I might think about switching to it later as well (or adding as an extra 'backend')
Added benefit would be that cache could be easily compressed as well, so might be easier to share between computers.

For now it's just an isolated experiment here https://github.com/karlicoss/cachew/tree/nosql, but I'll transpant to cachew soon

@karlicoss
Copy link
Owner

Wrote up a comparison here, might be interesting for other projects as well https://github.com/karlicoss/cachew/blob/master/doc/serialization.org

Actually cattrs seems very close to my new implementation, tiny bit faster for serializing, and tiny bit slower for deserializing -- so I'll need to do a bit more testing.

@karlicoss
Copy link
Owner

should be fixed in the latest pypi version!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants