Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental support for BSON (de)serialization #1254

Closed
wants to merge 29 commits into from

Conversation

julian-becker
Copy link
Contributor

@julian-becker julian-becker commented Sep 24, 2018

This PR includes basic support for BSON (de)serialization, as proposed in #1244.

It includes support for BSON records and record-entries of the following types (c.f. bsonspec):

  • 0x01: double
  • 0x02: string
  • 0x03: document
  • 0x04: array
  • 0x08: boolean
  • 0x0A: null
  • 0x10: int32
  • 0x12: int64

It presently does not include support for the following BSON entry types:

  • 0x05: binary
  • 0x06: undefined (deprecated)
  • 0x07: ObjectId
  • 0x09: UTC Date-Time
  • 0x0B: Regular Expression
  • 0x0C: DB Pointer
  • 0x0D: JavaScript Code
  • 0x0E: Symbol (deprecated)
  • 0x0F: JavaScript code w/ scope
  • 0x11: Timestamp
  • 0x13: 128-bit decimal floating point
  • 0x7F: Max Key
  • 0xFF: Min Key

The tests for the supported conversions can be found in unit-bson.cpp.
Feedback and discussions welcome, in particular with respect to the parts of BSON which are presently not yet supported.

For the initial discussion of this change and easier review/smaller patch, I have restrained as much as possible from modifying existing code. In particular, I have restrained from refactorings in the binary_reader and binary_writer, which could be beneficial to improve separation of concerns (not mixing procedures targeting different file formats in a single class).

Let me know what you think and I will happily adjust the code where needed/desired.

@coveralls
Copy link

coveralls commented Sep 24, 2018

Coverage Status

Coverage remained the same at 100.0% when pulling 24a4142 on julian-becker:feature/bson into 4b2a006 on nlohmann:develop.

@nlohmann
Copy link
Owner

nlohmann commented Oct 2, 2018

@julian-becker Great work! Can you tell me when you think you're done so I could go through the code?

@julian-becker
Copy link
Contributor Author

@nlohmann I'm done so far with the consolidation of the code. Feedback and ideas for improvement welcome.
Also, regarding the support of the BSON-specific types not yet supported: If you want, I could extend the PR to include BSON --> JSON one-trip conversions for some of these (I'm guessing some of these could be simply mapped to a string, the timestamp could be just mapped to an integer).

include/nlohmann/detail/input/binary_reader.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/input/binary_reader.hpp Outdated Show resolved Hide resolved
switch (j.type())
{
default:
JSON_THROW(type_error::create(317, "JSON value cannot be serialized to requested format"));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the type of the value to the exception text.

include/nlohmann/json.hpp Show resolved Hide resolved
include/nlohmann/json.hpp Show resolved Hide resolved
include/nlohmann/json.hpp Show resolved Hide resolved
@julian-becker
Copy link
Contributor Author

julian-becker commented Oct 7, 2018

@nlohmann Thanks a lot for your feedback! Sorry for my initial sloppiness regarding the documentation. I have improved it somewhat in my latest commit.

EDIT:
Just a thought: Maybe it would make sense to add CI-support for the doxygen build (if not done already) and set the WARN_IF_UNDOCUMENTED=YES and WARN_AS_ERROR=YES.

include/nlohmann/detail/output/output_adapters.hpp Outdated Show resolved Hide resolved
include/nlohmann/json.hpp Show resolved Hide resolved
…of the BSON-output.

This way, the output_adapter can work on simple output iterators and no longer requires random access iterators.
Copy link
Owner

@nlohmann nlohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @julian-becker, I really appreciate your work on BSON and I am looking forward to having this in the library. I hope you are not bugged out by my constant review remarks, but I spent so much effort into CBOR, MessagePack, and UBJSON that I know about the subtleties a binary format brings. Therefore, I would like to have as may things "fixed" early.

include/nlohmann/detail/output/binary_writer.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/output/output_adapters.hpp Outdated Show resolved Hide resolved
@julian-becker
Copy link
Contributor Author

@nlohmann No worries, that's what a review is for and that's why it's useful. Your feedback is very welcome, and I will do my best to tidy things up.
Fixing things once they are broken in the field will be a lot more painful than just fixing it now ;-)

@julian-becker
Copy link
Contributor Author

@nlohmann I will start by looking into some more thorough and intensive tests over the next couple of days ... After that, I could check the documentation.

@nlohmann
Copy link
Owner

Alright - if you look at the CBOR test suite, you know what I mean. It would be great if some of the existing test files could be converted to BSON. And if a well-accepted BSON test suite exists, it would be great to have it in the project as well. I would still want to merge the branch once you have nothing more to add to the parser, because then Google's OSS Fuzz can already start searching for bugs.

@julian-becker
Copy link
Contributor Author

@nlohmann My ongoing efforts for more intense tests revealed one thing that may need addressing before merging:
The handling of large std::uint64_t values which exceed INT64_MAX. Those cannot presently be serialized to BSON and deserialized again without loss in information. While the binary representation survives the roundtrip to BSON and back to JSON, the type-information cannot be restored. Presently, the implementation silently (may be considered a bug) translates those values to std::int64_t with the result that large positive std::uint64_t values end up being negative std::int64_t when reading back from BSON to JSON.
Maybe we should throw an error instead that informs the client that the corresponding large unsigned value cannot be serialized to BSON.
What is your opinion on this issue? If you think the presently implemented behavior is acceptable, feel free to merge, otherwise I can add corresponding additional error-reporting.

@nlohmann
Copy link
Owner

So BSON is does not support integers larger than INT64MAX? This sounds similar to UBJSON where a similar situation is currently treated with an exception, see #1288 (comment).

@julian-becker
Copy link
Contributor Author

As far as I understand the spec, there is only a signed int32 (type 0x10), and a signed int64 (type 0x12). There is an uint64 (type 0x11), however, this is to be interpreted explicitly as a timestamp, so is not really an option.
I think throwing a json.exception.out_of_range.407 is reasonable here and we should try to make it consistent with the handling of such cases UBJSON. I will adjust the code accordingly.

@julian-becker
Copy link
Contributor Author

also: keys are encoded as c-strings, with the effect that code point U+0000 cannot be serialized. This is not yet caught at the moment and will result in incorrect results if U+0000 is contained. I should probably adjust the error-handling to catch & report this.

@nlohmann
Copy link
Owner

FYI: I improved the error messages for binary formats, see #1303. Please update this PR once #1303 is merged.

Copy link
Owner

@nlohmann nlohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments.

test/src/unit-bson.cpp Outdated Show resolved Hide resolved
test/src/unit-bson.cpp Outdated Show resolved Hide resolved
test/src/unit-bson.cpp Outdated Show resolved Hide resolved
…in the key-string as part of `out_of_range.409`-message
@@ -48,46 +48,53 @@ TEST_CASE("BSON")
SECTION("null")
{
json j = nullptr;
REQUIRE_THROWS_AS(json::to_bson(j), json::type_error);
REQUIRE_THROWS_AS(json::to_bson(j), json::type_error&);
CHECK_THROWS_WITH(json::to_bson(j), "[json.exception.type_error.317] JSON value of type 0 cannot be serialized to requested format");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "type 0" etc. is a helpful description for users of the library. Please use type_name() to explicitly name the type (null, boolean, etc.).

Furthermore, "to requested format" should be replaced by "BSON" to be explicit.

@nlohmann
Copy link
Owner

I only had another small comment to the latest commit.

Furthermore, it would be great if you could add a BSON translation to the files

  • test/data/json_testsuite/sample.json
  • test/data/json_tests/pass3.json
  • test/data/json.org/*.json

with a .json.bson extension. These files can then be used to test round-tripping (see the CBOR test suite). Furthermore, the .bson files can be used by the fuzz testers.

@nlohmann
Copy link
Owner

@julian-becker I have some time to work on this this week. I branched your branch and make the changes there myself.

@nlohmann
Copy link
Owner

I created branch https://github.com/nlohmann/json/compare/julian-becker-feature/bson and shall make more changes there.

@julian-becker
Copy link
Contributor Author

@nlohmann Sure, feel free to continue this work. Sorry that I have not found as much time as I had hoped to put into this. One thing I noticed though was that for large containers, things get rather slow... which I suspect is mostly due to the size-precomputation. Thinking about the present implementation some more, I came to realize that -- in the worst case -- it is of order O(n^2), for the case of a deeply nested object containing an object containing an object ...

@nlohmann
Copy link
Owner

No worries - I thank you for your work!

We may speed up the code by adding a cache like std::unordered_map<const json*, size_t> to avoid re-calculation of sizes.

@nlohmann
Copy link
Owner

There may be a problem in the current code:

I added the examples from http://bsonspec.org/faq.html as tests. The first example is OK, but the second example does not properly roundtrip: Parsing the BSON input, we get the correct JSON value. But serializing it again yields

  { '.', 0, 0, 0, 4, 'B', 'S', 'O', 'N', 0, '#', 0, 0, 0, 2, 0, 8, 0, 0, 0,
  'a', 'w', 'e', 's', 'o', 'm', 'e', 0, 1, 0, '3', '3', '3', '3', '3', '3', 20,
  '@', 16, 0, '?', 7, 0, 0, 0, 0 }

(46 characters)

However, the original input is

  { '1', 0, 0, 0, 4, 'B', 'S', 'O', 'N', 0, '&', 0, 0, 0, 2, '0', 0, 8, 0, 0,
  0, 'a', 'w', 'e', 's', 'o', 'm', 'e', 0, 1, '1', 0, '3', '3', '3', '3', '3',
  '3', 20, '@', 16, '2', 0, '?', 7, 0, 0, 0, 0 }

(49 characters)

Apparently, our encoding takes fewer bytes, but I have not yet understood why. Do you have an idea?

@julian-becker
Copy link
Contributor Author

julian-becker commented Oct 25, 2018

our representation assigns an empty e_name for the elements in an array, whereas the examples seemingly have names:
"0" --> "awesome",
"1" --> 5.05,
"2" --> 1986

It seems I have not read the notes on the array in the spec:

Array - The document for an array is a normal BSON document with integer values for the keys, starting with 0 and continuing sequentially. For example, the array ['red', 'blue'] would be encoded as the document {'0': 'red', '1': 'blue'}. The keys must be in ascending numerical order.

So we have to generate increasing integer names for the array elements.

@julian-becker
Copy link
Contributor Author

Just pushed a fix for the array serialization.

@nlohmann
Copy link
Owner

Thanks for the fix. As I already made a lot of changes in the https://github.com/nlohmann/json/compare/julian-becker-feature/bson branch, I see that I can take over the patch.

Once I did that, I propose to open a new PR based on that branch and close this PR.

@nlohmann nlohmann mentioned this pull request Oct 25, 2018
@nlohmann
Copy link
Owner

Closing this PR to avoid diverging too much from #1320. Future discussion should take place in #1320.

@nlohmann nlohmann closed this Oct 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants