-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Natively mapping BSON data types to wrapper classes or callbacks #900
Comments
Hi! We have discussed this several times, and our conclusion was to not implement something in the TypeMap for this. You have already found #246 and #370, but there is a ticket too. There we have an example on how we think you should handle this case instead. This ticket (and your issue) has led to us writing a new tutorial, which just went through code review and will get published on https://docs.mongodb.com/php-library/master/tutorial/ . cheers, |
Your tutorial solution expects that the program logic which interacts with the driver is aware of the the stored data structure and hence knows exactly where date values are stored. Correct me if I'm wrong, but I do not see how this is useful in situations like ours, where the database abstraction layer is not schema aware. |
We have an outstanding ticket, for delegating type conversions to a (single) callable. We're not quite sure whether we will be implementing this, as we have not done any research into its performance impacts. Just like your assertion that deserialisation takes 3 times as long, we expect there to be quite a performance hit in the driver as well. Secondary, in order to prevent dataloss we need an accompanying way for serialising the converted data back into something that the database understands—for example, it needs to turn the non-native DateTime object back into a UTCDateTime datatype when storing. Henceforth, we are reluctant to commit to this for now. Our current thinking is that type conversions should be done by model classes that an application has for representing the data. This is why we have the If you would set your top-level "document" type in the TypeMap to for example
with a typeMap like:
This does mean that your DBAL needs to know what data from which collections it is loading, but it does solve the round trip issue that your callback ( Your option |
Slight correction here. The argument for requiring array, stdClass, and Unserializable is that those are the only types for which the driver can be assured it can properly initialize a PHP value with BSON document/array data. It's straightforward for array/stdClass (we set keys or public properties) and for Unserializable we can rely on the It is still possible to lose data with an Unserializable. For instance, The important distinction is that we've at least given them all the tools to handle conversion to/from PHP and BSON properly and without a risk of data loss. They can easily implement Serializable in their own model classes. If we provide some mechanism to allow a "scalar" BSON type to be converted to an arbitrary PHP type (e.g. UTCDateTime to DateTime), the user effectively has no tools available to extend DateTime and ensure it serializes properly on its own. Consider that type maps are only used in the BSON-to-PHP direction. The common code path for that conversion is |
Ah, sorry, I had not found PHPC-999 and PHPC-1000 previously, which indeed suggest effectively the same as I did in my original post.
Truth be told, this is something that was bugging my about my own proposal as well. My first thought after investigating our performance problem was actually that the driver should simply have a straight-forward option to unserialize BSON dates natively to PHP DateTime instead of UTCDateTime. If accompanied by the native ability to serialize DateTime(Interface) objects, this would likely provide the best performance and maintain the serialize/unserialize symmetry to avoid data loss. However, in order to be useful to us, this feature would have to include an option to specify the desired time zone for the DateTime instances created by the unserializer. Otherwise we would still end up having to do our recursive data walk in order to set the time zone on all date objects. Given the previous rejection of PHPC-760, I decided to propose a generic approach instead. Whatever the technical solution, I believe it is not an outlandish request to want to work conveniently with date values in the form of the DateTime objects PHP native provides (in the correct local time zone). Hence the number of issues/tickets related to this general subject. Date handling with PHP/Mongo has been a longstanding source of complaints and debates in my team, already going back to the old mongo driver. I'm sure many more developers feel the same way and would welcome any simplification you can send our way.
Herein lies the problem. I do not believe that a database driver should impose a bias towards certain programming paradigms or design patterns, if neither the underlying database nor the programming language do so. Mongo is a database for unschematized semi-structured data. Which is why it is great for our application, which deals with (partially) unschematized semi-structured data, using procedural programming and transformer / pipeline patterns. Yet you want to tell me that a schema-aware OOP representation of our database documents is the only supported way to perform primitive type conversions efficiently. Don't get me wrong. All the OOP support options built into the serialization/unserialization logic are convenient and elegant for many applications, but they are not a one-size-fits-all solution. |
I'm with @ralfstrobel it should be added to TypeMap, his proposed syntax would be a good addition. From a practical point of view, the solution proposed by @derickr adds a lot of overhead that we don't need on most apps.
That won't be a problem if you decide to implement the |
@TCB13: Thanks for bumping this. @ralfstrobel makes a good argument here and I think PHPC-999 will ultimately be the best solution as it would allow applications to specify a callable to which the driver can delegate type conversions. We have a number of higher priority tasks at the moment (cross-driver features to support new server functionality) but know that the issue is still on our radar to POC and investigate further once we can return to PHP-specific work. |
Closing, but feel free to follow the aforementioned tickets for updates. They've also been organized under an epic (PHPC-1454) to track the theme of ODM improvements. |
There has finally been some development on this issue, though sadly not in the direction I had in mind. PHPC-999 has now been closed as won't fix. The new intended direction seems to be PHPLIB-1172, based on PHPC-2242. The latter certainly opens up an interesting new approach. I will try out the new BSON/Document class, though I fear manually iterating over every value and deserializing it will also be quite slow and no advantage over conversion to array and then walking the array structure. |
Hi Ralf, there has been indeed some movement on this. In version 1.16 of the extension, we introduced two new classes to handle raw BSON: The first purpose of The The previous deserialisation mechanism (including passing However, introduction of these raw BSON classes is only half the story. In the PHP library, we're adding a codec mechanism to allow users to customise how BSON documents are deserialised to PHP objects and vice versa. One big drawback of the current mechanism is that serialisation needs to be handled by the object itself (through using the This brings me to Codecs. Essentially, Codecs are the factories that we considered including introducing to the A common use case is to want to work with PHP's date classes instead of the BSON class DateTimeCodec implements DocumentCodec
{
public function canEncode($value): bool
{
return $value instanceof DateTimeInterface;
}
public function encode($value): Document
{
// todo: throw if can't encode
return Document::fromPHP([
'utcDateTime' => new UTCDateTime($value),
'tz' => $value->getTimeZone()->getName(),
]);
}
public function canDecode($value): bool
{
// Todo: can be optimised
return $value instanceof Document && $value->has('utcDateTime') && $value->has('tz');
}
public function decode($value): DateTimeImmutable
{
// todo: throw if can't decode
$dateTime = $value->get('utcDateTime')->toDateTime();
$dateTime->setTimeZone(new DateTimeZone($value->get('tz')));
return DateTimeImmutable::createFromMutable($dateTime);
}
} We can now decouple the serialisation logic from the actual object, meaning that any POPO (plain old PHP object) can now be stored in the database without being restricted to only public properties being serialised. If we want to combine them, we can still have our model class implement the Now that we have our codec, let's use it. PHPLIB will come with a $client = new Client();
$codec = new LazyBSONDocumentCodec();
$collection = $client->selectCollection('my_db', 'my_coll', ['codec' => $codec]);
$document = $collection->findOne(['_id' => 1]);
$library = new LazyBSONCodecLibrary();
$library->attachCodec(new DateTimeCodec());
$codec->attachCodecLibrary($library); When we use this codec, any As for the performance of this, it goes without saying that not eagerly decoding big BSON structures provides a big performance boost and saves memory. I am currently working on performance benchmarks to keep an eye on these savings and eventual cost due to codecs being involved. I can share some performance numbers for the BSON structures:
The test uses a 2.7 MB file that we use for performance benchmarking our BSON implementation. It contains more than 94000 keys. You can see that converting the structure to a PHP object or array (using type maps) takes around 24 ms. Directly accessing the first element takes 1 microsecond, while accessing the last element (which involves libbson continuously advancing the internal BSON pointer) takes roughly 3 ms. However, we can also see that iteration takes relatively long due to how PHP iterators work (each iteration step in a In the Please note that all the codec APIs are currently a work in progress and may change until we release them in our 1.17 library release. The Last but not least, I'm curious to hear about your use case. Clearly you have had an interest in this functionality for a long time, so I'm interested to know whether you think this system will work for you, and if it wouldn't, what we can improve to make it more useful. Please note that all code examples have been written in this comment box without IDE support. All syntax errors potentially found in these examples have most certainly been introduced only for your reading pleasure. |
Hi Andreas, thank you for the detailed explanation. I do really like how you have finally made BSON fully natively represented in PHP and I'm sure your new approach is very useful and convenient for many applications. However, I do not see it being useful for us... Quoting myself from 2018...
We do not know the semantics of what we are storing beyond the data type of atomic values. We do not use model objects that need to be serialized and deserialized or could provide a lazy decoding mechanism. We do in fact not even use your driver library. Instead we purely interact with the driver extension to pump associative PHP arrays in and out of the database as fast as we possibly can. Our only problem being that these arrays can also contain temporal values, represented as DateTime objects, which is where this discussion originated. Due to our need for maximum performance, a PHP userland implementation of a codec that has to be asked if it can decode every single value is not going to work for us. And, as I said before, likely neither is PHP userland traversal of your BSON representation objects each time we need to load a document. But I will give that a try to see how practical it is for us. |
Hi Ralf, thank you for explaining your use case. There are multiple points in your response I want to elaborate on.
I apologise if this sounds somewhat condescending, but that is a relatively straightforward use case that can already be accomplished with little to no performance impact. I'll also note that the introduction of the
I understand the need for performance when inserting or reading data. I just cooked up a small performance benchmark using the same large document above. I'll note that we have some time set aside later this quarter to implement more performance benchmarks, but here's the result of a quick test. The
Without going into too much details, the time margin of error in these benchmarks accounts for any time differences, so there isn't too much to be gained by using the extension only. I also observed similar results when running the same bench for reading data:
Note that these tests are done on the code that already knows about codecs but without a codec being set on the collection. You're essentially giving up a lot of usability for little to no performance gain.
I'll note that any solution that goes beyond "always decode this one particular BSON type this particular way" will involve the extension asking some userland implementation "can you decode this value". If you provide a closure to the extension to invoke, this isn't going to be magically faster because it is invoked from the extension rather than from a PHP script. |
Yes, I understand that. And again, I am not saying that the work you have put into the new codec concept will not be extremely useful for many other projects. However, our use-case really is as simple as "always decode this one particular BSON type this particular way". And we wish to do this with as little overhead as possible.
Sure, but my point was never about the performance of closure invocations, but about unnecessary iteration and type checks in PHP userland... Note that any BSON type except Date is currently handled exactly the way we want it by the TypeMap "array" mode. Solely for Date values would we wish to invoke a closure. Date values make up approximately 3% of the values we store. Currently we retrieve documents as "array". So the extension already has to iterate over the entire BSON document structure, look at every value and its type, and then convert it either to a scalar zval or one of the BSON class instances. Afterwards, we have to iterate over the entire returned array again in PHP userland and look at every value again to check if it is a UTCDateTime instance we wish to convert (pointlessly 97% of the time). My original proposal was all about eliminating this second iteration by having the extension defer to a closure only when it actually encountered a BSON Date value. Using the new "bson" documents, it is true that everything could be done in only one iteration, but this would take place in PHP userland, which I imagine is slower than in C. We would also have to handle recursion for Document and PackedArray as well as conversion of Int64 by hand. Like I said, I will test this, but I am not very hopeful it is vastly superior to the array method. |
I finally found the time to do some representative synthetic benchmarking...
Basic retrieval of documents is certainly a lot faster in bson mode, as long as you do not need to work with the data. However, eventually our business logic needs them as arrays with UTCDateTime values converted to DateTime...
Our current logic which retrieves documents as array and then needs to traverse each array again to find and convert dates almost cuts the retrieval performance in half. The new alternative to retrieve objects as BSON wrapper objects and then recursively traverse this structure to retrieve a converted array is even slower. To be fair to the new BSON approach, it was not designed to improve retrieval performance of full documents, but to improve retrieval of a few sparse fields from each objects...
Here the new approach could theoretically have a slight edge over our current approach, which needs to traverse the full document array unnecessarily to convert dates which may not even be retrieved later. But this really only applies assuming the database was not already given a projection to reduce the document fields to those which will be retrieved later. @alcaeus I'm afraid there is not much news here. The new BSON option is certainly nice, just not for us. If the driver had an option to return BSON date values as PHP DateTimes natively or yield to a callback, this could still almost double our retrieval performance. |
Hi @ralfstrobel, would you be able to share the code you used for the benchmarks above? I'd like to take a look to see where we can find some room for improvements. We're currently not planning any significant changes in the BSON API, but we are thinking about how users interact with the MongoDB data structures in the future. I'll caution that any kind of callback API (e.g. being able to pass a closure in charge of decoding a specific BSON type) might not be very performance-efficient either. The main performance issue in your case is always going to be deep-traversing the entire structure looking for |
Hi @alcaeus, thanks for getting back to me so quickly! The specific test code I used is somewhat entangled with our query and testing infrastructure, but I've distilled the important parts into a standalone script that essentially does the same things and shows similar results. |
Issues #246 and #370 discussed difficulties decoding mongo dates to DateTime with specific time zones. Developer consensus favors userland conversion/wrapping as the most flexible solution. What I would like to propose in this issue is a performance optimization and generalization for this process.
Problem Description
Our application uses a self-developed database abstraction layer, which translates UTCDateTime objects to native PHP DateTime objects and vice versa, so that mongo specifics are not exposed to higher business logic. However, since this layer is unaware of document structure, this requires a naive iteration over all loaded and stored documents, effectively the following...
For large documents with sparse use of date values, this recursive search and replace leads to a very significant performance loss, with retrieval of the translated data taking up to 3x as long as native retrieval from mongo. Removing the content of the callback function reveals that this is not even caused by the actual conversion of UTCDateTime to DateTime, but mostly by the iteration and closure callback overhead.
Proposed solution
While deserializing from BSON, the driver naturally encounters all $date values and could directly convert them to a desired format (such as DateTime or a PHPC-640 wrapper), avoiding the need for subsequent traversal.
This approach would require extending the TypeMap syntax in a way that allows definitions for primitive BSON data types. Multiple variations are imaginable:
a) Mapping data types to a conversion callback:
b) Mapping data types to a custom wrapper class:
While $date is likely the most common use case, this solution could certainly also be applied for other types such as $binary or $oid.
The text was updated successfully, but these errors were encountered: