Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support UBJSON-derived Binary JData (BJData) format #3336

Merged
merged 51 commits into from
Apr 29, 2022

Conversation

fangq
Copy link
Contributor

@fangq fangq commented Feb 17, 2022

Dear Niels (@nlohmann) and developers, thank you for developing this great library!

I am currently working on a project funded by the US National Institute of Health (NIH), named NeuroJSON (http://neurojson.org), with a goal of developing standardized data formats for data sharing among broad neuroimaging community (MRI, fMRI, EEG etc). For this purpose, we have adopted JSON and a UBJSON-derived binary JSON as our primary formats. We noticed that several prominent neuroimaging packages have used your library for their JSON supports, for example FreeSurfer.

I would like to contribute a patch to read/write our UBJSON-derived binary JSON format - Binary JData (BJData). The format specification for BJData can be found here.

The BJData format extends UBJSON to address several limitations (see ubjson/universal-binary-json#109), these extensions include

  1. 4 new data type markers are added to support missing types from UBJSON, including
    • [u] - uint16,
    • [m] - uint32,
    • [M] - uint64 and
    • [h] - half/float16
  2. it further extends the "optimized format" to allow efficient storage of N-dimensional (ND) packed arrays - a data type that is of great importance to the scientific community. This is done by accepting a 1D integer array following the # marker. For example, a 2x3 2D int8 array [[1,2,3],[4,5,6]] can be stored as [ $ i # [ $ i # i 2 2 3 1 2 3 4 5 6 where # [ $ i # i 2 2 3 stores the dimensional vector (2x3) as the size, or alternatively [ $ i # [ i 2 i 3 ] 1 2 3 4 5 6.
  3. BJData uses little-Endian as the default byte-order, as opposed to big-Endian for UBJSON/Msgpack/CBOR, to simplify adoption.
  4. Updated - 03/02/2022 only non-zero-fixed-length data types are allowed in optimized container types ($), which means [{SHTFTN can not follow $, but UiuImLMLhdDC are allowed.

In this patch, I extended json.hpp to read/write BJData formatted binary files/streams. Two driver functions to_bjdata and from_bjdata were added. I've also added a unit-testing script modified from the UBJSON test file, with additional testing on the new data markers and optimized N-D array container format (as well as little-endianness). The test suite runs without any error.

The only item that I need some help/feedback is the storage of packed ND-array (i.e. numerical arrays that have uniform type). UBJSON only supports 1D packed array via the optimized format (and collapse the type marker to the header following $), but efficiently storing N-D array is essential in neuroimaging & scientific community.

My updated binary_reader can recognize the above N-D array optimized container but I failed to find a place to store such dimensional vector information in the json data structure (and subsequently use it to pack the ND array in the writer). So my current approach is to serialize the N-D array as a 1D vector after parsing. Because of this, the tests involving optimized ND array constructs are not yet round-trip invariant, so I temporarily disabled those round-trip tests.

Feel free to review my patch, and in the meantime, I am wondering if you have any suggestions/comments on how to better parse/handle the N-D packed array construct I mentioned above. Other than that, the rest of the code seems to work quite well.

thanks


Pull request checklist

Read the Contribution Guidelines for detailed information.

  • Changes are described in the pull request, or an existing issue is referenced.
  • The test suite compiles and runs without error.
  • Code coverage is 100%. Test cases can be added by editing the test suite.
  • The source code is amalgamated; that is, after making changes to the sources in the include/nlohmann directory, run make amalgamate to create the single-header file single_include/nlohmann/json.hpp. The whole process is described here.

Please don't

  • The C++11 support varies between different compilers and versions. Please note the list of supported compilers. Some compilers like GCC 4.7 (and earlier), Clang 3.3 (and earlier), or Microsoft Visual Studio 13.0 and earlier are known not to work due to missing or incomplete C++11 support. Please refrain from proposing changes that work around these compiler's limitations with #ifdefs or other means.
  • Specifically, I am aware of compilation problems with Microsoft Visual Studio (there even is an issue label for these kind of bugs). I understand that even in 2016, complete C++11 support isn't there yet. But please also understand that I do not want to drop features or uglify the code just to make Microsoft's sub-standard compiler happy. The past has shown that there are ways to express the functionality such that the code compiles with the most recent MSVC - unfortunately, this is not the main objective of the project.
  • Please refrain from proposing changes that would break JSON conformance. If you propose a conformant extension of JSON to be supported by the library, please motivate this extension.
  • Please do not open pull requests that address multiple issues.

@fangq fangq requested a review from nlohmann as a code owner February 17, 2022 03:00
@coveralls
Copy link

coveralls commented Feb 17, 2022

Coverage Status

Coverage remained the same at 100.0% when pulling f6ceebc on NeuroJSON:develop into 1a90c94 on nlohmann:develop.

Copy link
Contributor

@gregmarr gregmarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, but as the owner @nlohmann will need to weigh in. A few suggestions on the code.

include/nlohmann/detail/input/binary_reader.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/input/binary_reader.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/input/binary_reader.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/input/binary_reader.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/input/binary_reader.hpp Outdated Show resolved Hide resolved
@fangq
Copy link
Contributor Author

fangq commented Feb 20, 2022

After spending the past few days polishing this patch, all tests turned green in the last completed build. The suggestions by @gregmarr have also been addressed and incorporated. The PR is ready for someone to take another look.

Copy link
Owner

@nlohmann nlohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether it's a good idea to add this to the library. My concerns are:

  • It brings some 4K LOC to an already blown library. The fact that it's blown is not your fault, of course, but as long as we cannot make binary format support optional, this goes into every deployment of the library.
  • I could not find a "final" specification, but rather a proof of concept or work in progress. Changing the specification then requires an adjustment of this implementation which most likely would be a breaking change. I'd rather not link this library's public API to such an external factor.
  • I would like to see benchmarks that show the format is really superior to the existing formats in any aspect. On https://json.nlohmann.me/features/binary_formats/#sizes, you see that CBOR and MessagePack provide the best compression for several examples. Could you please check how BJData performs for these files?

What do you think? @fangq @gregmarr

include/nlohmann/json.hpp Outdated Show resolved Hide resolved
include/nlohmann/json.hpp Outdated Show resolved Hide resolved
include/nlohmann/json.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/output/binary_writer.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/output/binary_writer.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/output/binary_writer.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/input/binary_reader.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/input/binary_reader.hpp Outdated Show resolved Hide resolved
@nlohmann nlohmann added aspect: binary formats BSON, CBOR, MessagePack, UBJSON state: please discuss please discuss the issue or vote for your favorite option labels Feb 20, 2022
@fangq
Copy link
Contributor Author

fangq commented Feb 20, 2022

@nlohmann, thanks for the comments. my replies below

It brings some 4K LOC to an already blown library. The fact that it's blown is not your fault, of course, but as long as we cannot make binary format support optional, this goes into every deployment of the library.

a total of 3442 added lines out of the 4261 total are from the test directory. the remaining 800 additions are double-counted because of the amalgamation. so the net addition in the library itself is just ~400 lines, which is <2% of the json.hpp lines.

if the length of the test unit is your concern, I am happy to remove all tests that produce identical results as UBJSON and only keep those are related to the new features. I previously thought that keeping the two test units similar making it easier to maintain moving forward (and add additional robustness to the tests).

I could not find a "final" specification, but rather a proof of concept or work in progress. Changing the specification then requires an adjustment of this implementation which most likely would be a breaking change. I'd rather not link this library's public API to such an external factor.

The spec you see in our github is intend to be the stable version and we are now moving on to implementing/disseminating this for various software tools. I totally agree that a stable spec is essential and it also aligns with our goals.

I would like to see benchmarks that show the format is really superior to the existing formats in any aspect. On https://json.nlohmann.me/features/binary_formats/#sizes, you see that CBOR and MessagePack provide the best compression for several examples. Could you please check how BJData performs for these files?

BJData is designed to be an improved version of UBJSON. The new data makers (u/m/M/h) simply extend UBJSON to unambiguously map signed and unsigned integers, which UBJSON is lacking. The file size saving largely comes from the use of optimized N-D array containers. If the data file contains many N-D (2D/3D/...) arrays, such as in canada.json, the space saving can be significant.

As you saw from my initial post, this PR contains only the reader to the ND array container (ND arrays are (appropriately) serialized to a 1D vector), but not the writer part because I have not yet identified an internal data structure to store the dimensional vector. However, my MATLAB JSON reader/writer, JSONLab, contains full implementation for both UBJSON and BJData formats (for MATLAB/Octave), so I was able to create the file size comparison using my parser:

** updated to use minimized .json files as the baseline **

Format canada.json twitter.json citm_catalog.json jeopardy.json
JSON 2251051 631515 1727204 55554625
JSON(mini, reference) 2090235 466907 500300 52508729
UBJSON 1112348 361370 382342 52044479
UBJSON% 53.2% 77.4% 76.4% 99.1%
BJData 894934 360864 381939 51880363
BJData% 42.8% 77.3% 76.3% 98.8%

Although this is a different library, I want to focus on the changes from UBJSON to BJData, which was a result of the format improvement. As you can see, the behaviors of BJData and UBJSON are quite similar. The size improvement in BJData is more prominent in canada.json because there are a lot of 2D arrays and those are serialized more efficiently with the optimized header (the headers for all nested 1D vectors are merged to the front).

Let me know what you think.

@nlohmann
Copy link
Owner

My mistake - I added the test suite to the added LOC. Sorry for that.

@fangq
Copy link
Contributor Author

fangq commented Feb 22, 2022

@nlohmann, I froze the Draft 2 of the BJData specification so that we can refer the implemented version using the below stable document:
https://github.com/NeuroJSON/bjdata/blob/Draft_2/Binary_JData_Specification.md
or
http://neurojson.org/bjdata

include/nlohmann/detail/input/binary_reader.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/input/binary_reader.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/output/binary_writer.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/output/binary_writer.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/output/binary_writer.hpp Outdated Show resolved Hide resolved
include/nlohmann/detail/output/binary_writer.hpp Outdated Show resolved Hide resolved
@fangq
Copy link
Contributor Author

fangq commented Feb 23, 2022

@nlohmann and @gregmarr, most suggested changes have been committed.

@nlohmann: @gregmarr and I wanted to hear your opinions for some of the remaining review comments, see discussions above.

the updated library, if accepted, should be able to read/write most BJData files. I do want to pick your brains on completing the ndarray optimized header support - as I mentioned above, a major size reduction mechanism offered by BJData is the recognition and collapsing of packed ND-array markers - you can see the nearly 20% size reduction over UBJSON from my above benchmark using canada.json.

to fully support this, on the reader side, we need to

  • recognize and parse [ following # and get the data size (already done, but only the total size is returned, not the dimensional vector), and
  • Approach (1): if possible, store the dimensional vector in the metadata of the array element, while keeping the serialized data in a 1D vector, or, alternatively, Approach (2): split the serialized vector into nested vectors like the standard nested JSON array -- between the two, the first approach is more efficient and easier to implement, it also does not need computation to recollapse the type/size when writing it back, but it requires internal data structure to store such info, which I am not familiar if it exists

on the writer side

  • if we take Approach (1) on the reader side, then check the dimensional vector, if exist in the metadata, and write the data payload (as a 1D vector) back in optimized ndarray format,
  • for nested numeric arrays (iUIulmLM), regardless it came from Approach (2) or native in the data, extend this lambda function to recognize same_prefix for all recursive levels (not just the top-level as in the current code), return the total depth (Nd), and also another lambda function to be applied to each depth level (i = 0 to Nd-1) to test 1) if they have same_length, and if so, 2) return the size as the dimension for the lower depth (i+1). Once the same_prefix is confirmed and the dimensional vector is retrieved, the serialized data can be written in the ndarray format.

without this feature, the current patch is still going go meet most of our needs (with the combination of our standardized data annotation method.), but certainly a full implementation to the spec would be helpful for general cases.

happy to hear your thoughts on this.

@gregmarr
Copy link
Contributor

gregmarr commented Feb 23, 2022

Just brainstorming here. These both seem like a bit of magic, might be very confusing for the user, and I'm not even sure that the writer has the ability to support the first one.

Could you create an object containing a vector of dimensions and then a 1D vector of values? The writer would then need to recognize this object and write it appropriately.

{ "BJDataDimensions": [], "BJDataFlattenedValues": [] }

Another possibility is that the 1D vector contains first the number of dimensions, then the dimension values, and then the actual data. The downsides here are that these could be stored as doubles when they're really ints, and that you have to account for the extra indices at the beginning of the vector when accessing them. (This seems like essentially what is happening in the actual data file.)

@gregmarr
Copy link
Contributor

I suppose the choice of 1 vs 2 for the reader side depends on how you expect users to access the data. Either way, I think the data needs to be accessible to the user, unless you're just using this library to load the data and save it again, and don't expect the user to actually access the stored data. I don't know your expected users, but I think either your proposal 2 or my first proposal of an object would be the easiest for users to work with.

@fangq
Copy link
Contributor Author

fangq commented Feb 23, 2022

Could you create an object containing a vector of dimensions and then a 1D vector of values? The writer would then need to recognize this object and write it appropriately.

{ "BJDataDimensions": [], "BJDataFlattenedValues": [] }

this was actually my initial plan, and the easiest to implement, because this is basically the same approach we took to serialize complex data structures on the data annotation level (defined in our JData Spec). In our JData annotation format, ND arrays can be serialized in an object form (before storage in JSON/UBJSON/BJData/Msgpack/HDF5 etc)

   {
       "_ArrayType_": "typename",
       "_ArraySize_": [N1,N2,N3,...],
       "_ArrayData_": [1,2,3,4,5,6]
   }

where typename has a 1-to-1 mapping to all BJData markers.

this form is still portable, and is JSON/UBJSON/BJData compliant - when serving this data to JData-aware programs, the above JData annotation tags can be recognized and optionally assemble the object back to the ND array in the native format - like I did in this Python module and my MATLAB toolbox. However, such data can also serve to non-JData-aware programs, where ND packed array data can be accessed via the above generic JSON object keys.

if both of you are on board of this approach, I will be happy to add this implementation.

@fangq
Copy link
Contributor Author

fangq commented Feb 28, 2022

@gregmarr and @nlohmann, I've just pushed a few commits to finally complete this patch. Now json.hpp fully support all features of the BJData format, capable of round-trip handling of ndarray objects, including bi-directional translation of bjdata ndarray optimized containers to JData-annotated JSON representation {"_ArrayType_":"...", "_ArraySize_":[...],"_ArrayData_":[...]} which provides an intuitive interface to make such data accessible to users.


Update (03/01/2022) - please ignore the below quoted question, I figured out and added the relevant tests.

I have a quick question on completing the test coverage, there are a few lines in binary_reader remaining to be tested for return values of sax->member() calls, see

https://coveralls.io/builds/46919342/source?filename=include%2Fnlohmann%2Fdetail%2Finput%2Fbinary_reader.hpp#L2112

particularly, I am wondering if you can point to me which test in unit-bson.cpp to specifically test the below condition

https://coveralls.io/builds/46919342/source?filename=include%2Fnlohmann%2Fdetail%2Finput%2Fbinary_reader.hpp#L167

My calls are pretty straightforward ((!sax->start_object(3) || !sax->key(key) || !sax->start_array(dim.size()))), I found it hard to construct a test to make those fail. The above BSON line was the only place that I found facing a similar test.


Other than that, this patch is pretty much done. It adds ~600 lines of code to json.hpp, but enables reading/writing BJData format that inherits almost all the benefits of UBJSON (among which, simplicity and quasi-human-readability are the most valuable IMHO) yet further extends it with complete binary type representations and ND packed array support - which can lead to significant space savings in storing/exchanging scientific/imaging data files. Overall I think it would be a suitable addition to the already comprehensive binary-JSON readers/writers provided in this library. I am committed to supporting all codes/tests related to this feature if accepted.

Let me know what you think

@fangq fangq requested a review from nlohmann March 13, 2022 14:36
nlohmann added a commit to nlohmann/json_test_data that referenced this pull request Apr 3, 2022
Add bjdata test files for unit testing of nlohmann/json#3336
@nlohmann
Copy link
Owner

nlohmann commented Apr 3, 2022

There is now version 3.1.0 of the test data available: https://github.com/nlohmann/json_test_data/releases/tag/v3.1.0

@fangq
Copy link
Contributor Author

fangq commented Apr 4, 2022

In e6b2b73, I've updated the test data version so it can download the latest files. I also enabled the file-based unit tests for bjdata. it compiles and runs fine.

@nlohmann
Copy link
Owner

nlohmann commented Apr 4, 2022

Can you please update to the latest develop branch - this should fix the MSVC 2017 jobs.

@nlohmann nlohmann added the please rebase Please rebase your branch to origin/develop label Apr 4, 2022
@nlohmann nlohmann removed the please rebase Please rebase your branch to origin/develop label Apr 4, 2022
@fangq
Copy link
Contributor Author

fangq commented Apr 4, 2022

@nlohmann, the msvc 2017 tests work now after rebasing - but msvc 2019-2022 still fail, the error was caused by missing the new *.bjdata file from the test dataset. I suspect that some of the msvc 2019 ci settings were not updated and are still checking out test data v3.0.0.

can you let me know where these ci files might be located?

@nlohmann
Copy link
Owner

nlohmann commented Apr 6, 2022

@nlohmann, the msvc 2017 tests work now after rebasing - but msvc 2019-2022 still fail, the error was caused by missing the new *.bjdata file from the test dataset. I suspect that some of the msvc 2019 ci settings were not updated and are still checking out test data v3.0.0.

can you let me know where these ci files might be located?

The stack overflow can be fixed by adjusting this line https://github.com/nlohmann/json/blob/develop/test/CMakeLists.txt#L80 and add the new test binary there.

The version of the test data is set only in one place: https://github.com/nlohmann/json/blob/develop/cmake/download_test_data.cmake

@fangq
Copy link
Contributor Author

fangq commented Apr 16, 2022

@nlohmann, I am wondering if you are considering accepting this PR? I see more PRs related to binary reader/writer are being prepared. I can resolve the current conflict, although I worried that more will come if we leave this open.

please let me know. thanks!

@nlohmann
Copy link
Owner

@nlohmann, I am wondering if you are considering accepting this PR? I see more PRs related to binary reader/writer are being prepared. I can resolve the current conflict, although I worried that more will come if we leave this open.

please let me know. thanks!

I will approve it once I am back from vacation. Sorry for the inconvenience. Thanks for your work and patience.

@falbrechtskirchinger
Copy link
Contributor

@fangq I had a chance to think about potential implications for alternate string types some more and do not believe there are any issues.

@fangq
Copy link
Contributor Author

fangq commented Apr 26, 2022

@falbrechtskirchinger, thanks, let me fix the remaining CI errors as a result of merging fd2eb29

Copy link
Owner

@nlohmann nlohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@nlohmann nlohmann added release item: ✨ new feature and removed state: please discuss please discuss the issue or vote for your favorite option labels Apr 29, 2022
@nlohmann nlohmann added this to the Release 3.11.0 milestone Apr 29, 2022
@nlohmann nlohmann merged commit ee51661 into nlohmann:develop Apr 29, 2022
@nlohmann
Copy link
Owner

Thanks a lot @fangq for your work and patience!

@falbrechtskirchinger Let's fix the remaining issues in separate PRs.

@fangq
Copy link
Contributor Author

fangq commented Apr 29, 2022

thank you @nlohmann and everyone who had helped reviewing this PR! super excited that this format is now supported in this extremely popular (and extremely well crafted) library! thank you for all the wonderful works!

@nlohmann
Copy link
Owner

nlohmann commented May 1, 2022

Format canada.json twitter.json citm_catalog.json jeopardy.json
JSON 2251051 631515 1727204 55554625
JSON(mini, reference) 2090235 466907 500300 52508729
UBJSON 1112348 361370 382342 52044479
UBJSON% 53.2% 77.4% 76.4% 99.1%
BJData 894934 360864 381939 51880363
BJData% 42.8% 77.3% 76.3% 98.8%

I am trying to reconstruct the data above.

I come to the following vector sizes:

canada.json

  • CHECK(ubjson_1_size == 1112030);
  • CHECK(ubjson_2_size == 1224148);
  • CHECK(ubjson_3_size == 1169069);

twitter.json

  • CHECK(bjdata_1_size == 425342);
  • CHECK(bjdata_2_size == 429970);
  • CHECK(bjdata_3_size == 429970);

citm_catalog.json

  • CHECK(bjdata_1_size == 390781);
  • CHECK(bjdata_2_size == 433557);
  • CHECK(bjdata_3_size == 432964);

jeopardy.json

  • CHECK(bjdata_1_size == 50710965);
  • CHECK(bjdata_2_size == 51144830);
  • CHECK(bjdata_3_size == 51144830);

The numbers were calculated with:

const auto ubjson_1_size = json::to_ubjson(j).size();
const auto ubjson_2_size = json::to_ubjson(j, true).size();
const auto ubjson_3_size = json::to_ubjson(j, true, true).size();

@fangq Any idea why my numbers differ so much?

@nlohmann nlohmann mentioned this pull request May 1, 2022
3 tasks
@fangq
Copy link
Contributor Author

fangq commented May 2, 2022

@nlohmann, regarding file size comparisons, as you can see from the above results, the current BJData output is nearly identical to that from UBJSON. This is expected - although the currently reader can read all valid BJData files and the writer can write BJData-compliant files, the writer has not yet taken full advantage of the BJData features to output more compact file.

As I mentioned in #3336 (comment), the main size reduction mechanism in BJData is to use ND-array optimized format to compress the type-markers from all data elements (assumed to be uniform type) in an otherwise nested array construct.

To automatically perform such compression, we will need to extend the same_prefix lambda function to not only recognize/compress the markers in sub-arrays 1-level lower, but also at all nested levels to 1) verify that they all have the same length, 2) verify they have the same prefix (a numeric type on the lowest level, and '[' on all upper levels).

the files I used to populate the previously mentioned file size comparison are attached below (did not include jeopardy because of the large size). All these files were produced by my MATLAB/Octave json/ubjson/bjdata/msgpack parser JSONLab - because MATLAB/Octave natively support ND-arrays, so all data encoding and decoding can use such array information to fully compress.

bjd_ubj_sizetest.zip

I can do a little bit more in-depth comparison between the .ubj and .bjd files (especially canada.ubj and canada.bjd in the above zip file) once I get a debugging tool updated.

@nlohmann
Copy link
Owner

nlohmann commented May 2, 2022

Thanks for the clarification - good to know I did not make any wrong calls.

So what needs to be done to have BJData really use its potential for the benchmarks? Just extend the same_prefix function?

@fangq
Copy link
Contributor Author

fangq commented May 3, 2022

So what needs to be done to have BJData really use its potential for the benchmarks? Just extend the same_prefix function?

yes, I believe so. other than that, to trigger such "deep compression", an extra parameter, const bool use_ndarray should be added to to_bjdata().

currently, this PR supports the below round-trip conversion:

BJData ND-array ([ $ ◌ # [ $ ◌ # ◌ ... ....) <-> {"_ArrayType_":"...","_ArraySize_":[...],"_ArrayData_":[....]}

the reason we had to use a structured data on the JSON side is because there is no native C++ data structure to hold this ND-array without losing the dimensional/type info (unless you are aware of a data structure that is more suitable).

if a nested vector will be recognized and packed as a BJData ND array stream, then we may need to add a parameter to from_bjdata() to decide whether the decoder should unpack the data into the above structure format or a nested vector format (which needs to be recognized/compressed again when encoding).

@nlohmann
Copy link
Owner

nlohmann commented May 8, 2022

So basically, the reader is feature complete as it can read BJData created with this optimization, but the writer currently only creates output without that optimization?

@fangq
Copy link
Contributor Author

fangq commented May 9, 2022

So basically, the reader is feature complete as it can read BJData created with this optimization, but the writer currently only creates output without that optimization?

that is correct, with the exception that, if a bjdata ND-array is read by the reader, it stays optimized when writing to disk.

the only thing currently missing from the writer is to recognize a packed nd-array from an un-optimized input, such as JSON input like [[1,2],[3,4],[5,6]]

@nlohmann
Copy link
Owner

nlohmann commented May 9, 2022

that is correct, with the exception that, if a bjdata ND-array is read by the reader, it stays optimized when writing to disk.

How do you achieve this? Where do you store this information?

@falbrechtskirchinger
Copy link
Contributor

that is correct, with the exception that, if a bjdata ND-array is read by the reader, it stays optimized when writing to disk.

How do you achieve this? Where do you store this information?

Is that because ND-arrays are stored as objects which hold metadata in _ArraySize_, etc.?

@nlohmann
Copy link
Owner

nlohmann commented May 9, 2022

Oof. I really need a more detailed look at the specification. I don't like the way how metadata is leaked to the JSON values.

@falbrechtskirchinger
Copy link
Contributor

You can find the specification for the annotations here:
https://github.com/fangq/jdata/blob/master/JData_specification.md#data-annotation-keywords

@fangq
Copy link
Contributor Author

fangq commented May 10, 2022

Oof. I really need a more detailed look at the specification. I don't like the way how metadata is leaked to the JSON values.

yes, it is not ideal - in MATLAB/Octave, I could use the native matrix data structure to store ND-array information (including dimension and type info), same in Python (numpy.ndarray) and JavaScript (numjs/ndarray - 3rd party lib), but for C++, at least from my limited reading of the std data structures, I still could not find a native ND-array container to hold such information - one could use nested vector/array objects for ND-arrays, like vector<vector<T,N2>,N1>, array<array<T,N2>,N1> but the depth/dimensions is not programmable (again, I can be wrong).

The JData spec aims to define a language-neutral approach to represent scientific data structures (ND arrays, complex-valued/sparse arrays, tables, maps, binary, compressed binary etc), so that it can be used as a data exchange layer between programming environments. It uses JSON tags to serialize metadata associated with each type of data structure, and can be used as a fallback to certain specialized data structure that is not supported in certain environment, for example, to support sparse-array in Python without installing scipy.

@nlohmann
Copy link
Owner

Thanks for clarifying.

It would be great if you could have a look at #3464 - in particular the files

  • docs/mkdocs/docs/api/basic_json/from_bjdata.md
  • docs/mkdocs/docs/api/basic_json/to_bjdata.md
  • docs/mkdocs/docs/features/binary_formats/bjdata.md (in particular in comparison to, e.g., docs/mkdocs/docs/features/binary_formats/cbor.md)

I'd really have the documentation on par with the other binary formats - in particular the limitations of the implementation.

@fangq
Copy link
Contributor Author

fangq commented May 11, 2022

@nlohmann, do you want me to fill in the Serialization and Deserialization sections of binary_formats/bjdata.md? or you've already drafted it? happy to adapt those from ubjson.md.

@nlohmann
Copy link
Owner

nlohmann commented May 11, 2022

@nlohmann, do you want me to fill in the Serialization and Deserialization sections of binary_formats/bjdata.md?

Yes, exactly.

or you've already drafted it? happy to adapt those from ubjson.md.

No. #3464 (which I will merge to develop soon) has the current state. Best would be if you would make a PR to #3464.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants