Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Unify model format customize string or Json #4887

Open
tongwu-sh opened this issue Dec 13, 2021 · 7 comments
Open

[RFC] Unify model format customize string or Json #4887

tongwu-sh opened this issue Dec 13, 2021 · 7 comments

Comments

@tongwu-sh
Copy link
Contributor

Summary

Currently we have 2 model formats: customize string and json. It is hard to maintain 2 serialize/deserialize for model. Specially for some components with complex parameters like category encoders... (almost 1/3 code is about serialize/deserialize and hard to extend). We'd better only support 1 model format.

Motivation

Json format is more reasonable to me, it is standard format and easy to handle hierarchy structure. Backward should be a big problem, I would suggest for new features we can use DumpToString() == DumpToJson(), and replace old component to use json as default format and keep customize string as deprecate method.

@tongwu-sh
Copy link
Contributor Author

tongwu-sh commented Dec 13, 2021

@jameslamb , @shiyu1994 , @StrikerRUS , @guolinke, @hzy46 not sure your opinion on this? thanks!

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Dec 13, 2021

I'm for JSON too! However, I guess we should compare performance of these two formats to make a thoughtful choice.

Note, XGBoost is also using JSON as their main serialization format.

dmlc/xgboost#3980
dmlc/xgboost#5046
dmlc/xgboost#5772

@StrikerRUS
Copy link
Collaborator

Linking original issue: #2604.

@guolinke
Copy link
Collaborator

I remember these is a multi-threading speed-up for string format model file loading.
If the loading speed of json format is comparable, I am okay with the change.
And we may need to consider backward compatibility.

@jameslamb jameslamb changed the title Unify model format customize string or Json [RFC] Unify model format customize string or Json Dec 17, 2021
@jameslamb
Copy link
Collaborator

@guolinke 's comments make sense to me, that it would be harder to take advantage of parallelism when parsing JSON than when parsing the existing text format. But I think it's worth trying and benchmarking!

I'll also note here that @lemire, author of fast_double_parser who was so helpful with #3405, has worked on a fast JSON parser that we could consider vendoring in (similarly to fmt, fast_double_parser, `etc.): https://github.com/simdjson/simdjson

@lemire
Copy link

lemire commented Dec 17, 2021

@jameslamb We would be there to help if you choose to adopt simdjson. At this point, it is a mature library, with extensive tests and documentation. We support two APIs: our On Demand front-end (high perf. default) and a conventional DOM approach.

@StrikerRUS
Copy link
Collaborator

Our friends at XGBoost are trying to adopt Universal Binary JSON format as a binary serialization format: dmlc/xgboost#7545.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants