Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] On-demand serialization + standardization of attributes #9924

Merged
merged 53 commits into from
Jan 10, 2024

Conversation

david-cortes
Copy link
Contributor

@david-cortes david-cortes commented Dec 26, 2023

ref #9810

This PR changes the serialization logic of xgb.Booster objects to trigger only on-demand, by using an R altrep list class on which serialization methods are implemented, and not all of the expected altrep methods are implemented so as to avoid potential unwanted conversions that might lose the serializers.

Since the serialization logic changes with this PR, the way in which attributes are kept in model objects also needed to be changed:

  • Now there is a clear division between R-specific attributes, which can be arbitrary objects and are accessible and settable through attributes(model); and C-level attributes which are kept in the model JSON, which can be accessed and set through xgb.attributes(model).
  • Some attributes which were previously part of the R class were now moved to C-level attributes.
  • As serialization is now on-demand and R attributes are optional, there's no further need for an xgb.Booster.handle class. I removed it as part of this PR, since it would not have any use-case where it'd have some advantage over xgb.Booster without optional attributes.

This separation means that now accessing attributes is not as simple as calling model$<attribute> - now one needs to explicitly use either attributes or xgb.attributes.

Since the logic for keeping track of attributes was changed here, this PR required doing changes throughout pretty much all the R code. As it is right now, all of the tests are passing, but I am not entirely confident that this PR won't break something not covered in the tests, and I am not sure if I have updated all the docs that became outdated after these changes.

The changes here also now make xgboost incompatible with {caret}, so I've removed all the references to it. I also removed suggestions around it since this package is not in active development anymore as it was superceded by {tidymodels}. Haven't tested {mlr} but I have a feeling it might also break.

A couple notes about the PR and about many things I noticed - would be ideal if maintainers could open independent issues about some of these if needed:

Prioritization

As this PR creates merge conflicts with all others and it's the hardest to keep track of, would be very helpful to merge it before others like inplace predict or quantile dmatrices or roxygen updates.

Shallow and deep copies

Before this PR, serialization and de-serialization was triggered multiple times throughout the runs of different functions. Doing this was very inefficient, and had the potential to create inconsistencies between the R attributes and the C booster.

After this PR, there are no such duplications - objects are updated in-place, at the C level. It's now also possible to incrementally update a booster in-place through xgb.train, which is controlled by an optional parameter training_continuation, but I wasn't sure how this parameter should play along with xgboost parameter process_type.

  • Note that there's one edge case that I didn't know how to cover: if the user sends an interrupt signal, a booster that's being modified in-place will be left in an inconsistent state between the R attributes (like evaluation log) and the C booster. The R attributes are nevertheless not used in predict or similar methods.

Since one might now need to make copies of the booster, a helper function xgb.copy.Booster was also implemented in the public interface. I could not find any C-level function to duplicate a booster so I used the ubj to-bytes serializer for it, which I am guessing is not the most efficient way.

Information lost during serialization

It seems that the serialization functions that save models to disk (e.g. XGBoosterSaveModel), regardless of the format that they end up using (json/ubj), will never save custom user attributes that one sets to the booster outside of training, except for some particular attributes like 'niter'; unlike the serialization functions that save them to bytes like XGBoosterSaveModelToRaw which keep all of the custom attributes.

Additionally, these serializers also seem to lose feature names and types if they were set in the booster through XGBoosterSetStrFeatureInfo. As such, I modified some of the tests to manually remove the feature names for the comparisons, but I have a feeling that something here could be improved.

Serialization compatibility

After this PR, it will not be possible to load models that were saved with R serializers like saveRDS with a previous xgboost version, and it will not be possible to load models saved with saveRDS after this PR in previous xgboost versions. I updated the compatibility note to mention the breakpoint being xgboost version 2.1.0, which I suppose will be the next release.

There's a test which downloads serialized files from the internet and which will need to be updated. For now, I simply modified the test to skip the Rds files, but in the future would be more logical to update the files there.

I'm also thinking that it might be better to commit those model files here and bundle them in the R package instead of downloading them when the tests are run - if the files are updated now, unless the download link is changed, then the same test running on older xgboost versions will fail - will also make testing faster as there won't be a need to download files from the internet.

Serialization advise

XGBoost's own docs advise users to not use R-specific serializers, but lots of the functionalities from the public interface actually require using R serializers, such as function xgb.gblinear.history, which uses attributes from callbacks that aren't part of the standard C booster and thus do not get saved in functions like XGBoosterSaveModel or XGBoosterSaveModelToRaw. Moreover, from the point above, even C-level information is lost when using to-disk serializers, such as the feature names in the booster, which can also lead to very unexpected behaviors (e.g. they are saved with xgb.save.raw, but not with xgb.save).

I was thinking that perhaps the booster could add serialized versions of the R attributes (as obtained by base::serialize), but those raw bytes would not conform to a JSON-loadable string, would add latency and memory usage, are not usable in other interfaces, among other minuses.

It leaves me wondering what should be the actual suggestion in the doc, given that a user following that advise might find that things actually break after the fact, and there's advantages to using xgb.save.raw + R's writeBin compared to directly using xgb.save.

Plotting multi-valued trees

I am not sure if the plotting functionalities need any kind of adjustment for models like multi-quantile regression, multi-target objectives, and so on. I didn't add any modification here.

In the case of multi-quantile regression, functions seem to produce some output, but I am not sure about the correctness of such output, since there's one value per leaf visible there. Should this perhaps produce an error at the C++ level?

For mutli-output regression, functions like xgb.dump will error out from the C side - I would guess that multi-quantile and multi-class might share some structural similarities but am not familiar with tree plotting.

Retrieving booster attributes

As part of this PR, I added internal functions to access fields that are part of the C booster, such as the booster type.

I implemented them by producing full dumps of the booster JSON through XGBoosterSaveJsonConfig, then parsing them with R's jsonlite, and accessing the field in that parsed JSON.

This is rather inefficient, but I couldn't find any C-level function to extract only one particular field from the booster JSON. Would be quite useful to add more functions to retrieve and set commonly used attributes like the booster type and the number of threads, or to retrieve only one particular attribute from the JSON if given a path like field1.subfield2.<etc>.

Missing functions in core library

Both the R and Python interfaces now resort to parsing text dumps in order to extract information such as model coefficients in a linear booster. If such functionalities are going to be required throughout different interfaces, would be ideal to create C-level accessors for them that would for example avoid the loss of precision from the conversion from float to string and back.

Furthermore, some of these parsings broke after adding feature names, and the regexes needed to be updated. I am not sure that I've pushed updated everywhere necessary.

Attribute 'niter'

Handling of this attribute was rather strange before this PR. This attribute was kept both in the booster attributes and in the R attributes, with the caveat that the C one used base-0 indexing, while the R one used base-1. Further, they were not updated in synch everywhere.

After this PR, I preferred to remove this attribute, sticking instead to 'nrounds' as returned by function XGBoosterBoostedRounds.

There is one particular issue with this function however: after calling XGBoosterSetParam on a fitted booster, depending on what the parameters there contain, function XGBoosterBoostedRounds will afterwards return zero, but the trees will still be there when looking at e.g. learner.gradient_booster.gbtree_model_param.num_trees in the JSON config.

Linear intercepts

When creating a gblinear booster, it will have two intercepts:

  • One given by base_score.
  • Another given by the last coefficient in the JSON.

Both of those intercepts are added together in the prediction, which is quite confusing. Would be ideal if this booster type could auto-adjust itself after the fact to contain only one intercept - for example by forcibly setting base_score to zero and adding it to the last coefficient.

It would also improve things for some objective like multi:softmax as it gets an automatically added a base_score of 0.5 which, logically speaking, does not help with convergence / loss decrease.

Legacy R code

Changing things throughout multiple places made me notice that the logic for callbacks is very hard to follow and rather unidiomatic - for example, it requires defining variables inside the function call that do not seem to be used and one cannot tell from a look at that function if they could be removed or not. I am guessing it will also be the hardest thing to review in this PR, and the one that's most prone to errors.

Since these callbacks are part of the public interface (meaning: the user can pass custom callbacks), it'd be ideal to:

  • Document them (current docs do not cover key aspects like the contents of the environments).
  • Change the logic from using environments where variables are assigned to arbitrary names (e.g. the user might not know that there's a bst variable in the environment that needs to be accessed), towards using function keyword arguments.
    • Ideally, there should be a factory function like xgb.Callback that would take arguments like before_training, after_training, before_iteration, after_iteration, each taking functions expecting a defined function signature.
    • It should also standardize the logic for keeping things from a callback in booster attributes or not.
  • Avoid usage of global variables and <<- setters - better to keep a shared R environment-class variable for such purposes.

R-specific functionality

I also noticed that there are some functionalities in the R interface that the Python one lacks, such as function xgb.gblinear.history. I think this function could also be moved into a C-level functionality to keep that coefficient history as an internal booster property.

@trivialfis
Copy link
Member

Thank you for introducing the new serialization! This is a major upgrade to the existing interface and I'm excited about it.

I will look into the PR deeper in the coming days and learn more about the changes.

Since this is related to serialization, I will assist in writing some testing guidelines, as mentioned earlier, serialization has been a foot gun and we still have issues with it today. I would like to be extra careful from the beginning.

@trivialfis
Copy link
Member

Related #9908

@david-cortes
Copy link
Contributor Author

Looks like the windows build from buildkite is still using R 3.6:

-- Found LibR: C:/Program Files/R/R-3.6.3
...
C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0667cc80194b35b5c-1\xgboost\xgboost-ci-windows\R-package\src\xgboost_R.cc(679,27): error C3861: 'R_make_altlist_class': identifier not found [C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0667cc80194b35b5c-1\xgboost\xgboost-ci-windows\build\R-package\xgboost-r.vcxproj]
C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0667cc80194b35b5c-1\xgboost\xgboost-ci-windows\R-package\src\xgboost_R.cc(681,3): error C3861: 'R_set_altlist_Elt_method': identifier not found [C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0667cc80194b35b5c-1\xgboost\xgboost-ci-windows\build\R-package\xgboost-r.vcxproj]

@hcho3
Copy link
Collaborator

hcho3 commented Dec 26, 2023

What's the minimum R version do we need for ALTREP? I can update the R version on the Windows side.

R-package/R/utils.R Outdated Show resolved Hide resolved
R-package/src/xgboost_R.cc Outdated Show resolved Hide resolved
R-package/src/xgboost_R.cc Outdated Show resolved Hide resolved
@david-cortes
Copy link
Contributor Author

What's the minimum R version do we need for ALTREP? I can update the R version on the Windows side.

It's R 4.3.

@hcho3
Copy link
Collaborator

hcho3 commented Dec 26, 2023

I'm also thinking that it might be better to commit those model files here and bundle them in the R package instead of downloading them when the tests are run - if the files are updated now, unless the download link is changed, then the same test running on older xgboost versions will fail - will also make testing faster as there won't be a need to download files from the internet.

Currently, we have a policy of not including binary files in the git repository, since git isn't suited for handling diffs in binary (non-text) files. Some alternatives:

  • Use Git LFS. This can get expensive pretty fast, as GitHub charges per storage and per bandwidth.
  • Keep hosting model files externally, but version the files (by adding suffix), so that old tests don't break when we update the model files later.

@hcho3
Copy link
Collaborator

hcho3 commented Dec 26, 2023

Does it mean that the new XGBoost R package will only be compatible with R 4.3+ ? We probably should document the requirement.

@david-cortes
Copy link
Contributor Author

Does it mean that the new XGBoost R package will only be compatible with R 4.3 ? We probably should document the requirement.

It's already stated in the DESCRIPTION file, so a user installing it with install.packages will not be able to install the new xgboost version in an older R version.

@hcho3
Copy link
Collaborator

hcho3 commented Dec 26, 2023

Sorry, I missed the conversation in #9847, as I was gone for vacation. I will go ahead and update the R version in the Windows CI pipeline.

@mayer79
Copy link
Contributor

mayer79 commented Dec 27, 2023

Awesome work!

I noted that we will have a mix of non-Markdown and Markdown help files in "xgb.Booster.R" and "xgb.model.dt.tree.R".

E.g., backticks instead of \code, two asterisks instead of \bold,

Furthermore, I am avoiding long titles, @title tags, @description tags (except when the description has multiple sections).

Of course, I can go over these files after merging. Currently, I am not working on the help files until most of the PRs will be merged to avoid such conflicts.

@david-cortes
Copy link
Contributor Author

I'm also thinking that it might be better to commit those model files here and bundle them in the R package instead of downloading them when the tests are run - if the files are updated now, unless the download link is changed, then the same test running on older xgboost versions will fail - will also make testing faster as there won't be a need to download files from the internet.

Currently, we have a policy of not including binary files in the git repository, since git isn't suited for handling diffs in binary (non-text) files. Some alternatives:

  • Use Git LFS. This can get expensive pretty fast, as GitHub charges per storage and per bandwidth.
  • Keep hosting model files externally, but version the files (by adding suffix), so that old tests don't break when we update the model files later.

There's also the option of pre-downloading it as part of the R package build script; and/or to look for the file under somewhere like inst when running tests without creating a package plus adding it to .gitignore, and save the file there instead of tmp if it detects such a folder.

Copy link
Collaborator

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objections from me

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge thanks for the work on serialization! It will take some time to process it (I'm still on PTO). Can we extract some of the unrelated changes and merge them first?

In addition, a brief introduction to how it works as code comment would be nice. Something that can get new comers up to speed.

R-package/src/xgboost_R.cc Show resolved Hide resolved
R-package/src/xgboost_R.cc Show resolved Hide resolved
@david-cortes
Copy link
Contributor Author

But that's also the case for UBJ serialization:

We have two distinguished cases as mentioned previously:

  • saving the model.
  • saving everything.

The xgb.save saves the model, and nothing else. It's used to "export" a model for other tasks like inference, explanation. xgb.serialize is used to save everything, that the user can continue training without any intervention, including things like whether to use GPU.

Thanks for pointing this out.

What would be the python equivalent of this function?

@trivialfis
Copy link
Member

pickle.

@trivialfis
Copy link
Member

specifically, __setstate__ and __getstate__.

@trivialfis
Copy link
Member

I'm quite excited about this PR, and am comfortable with merging it regarding the interface change, this way we can unblock other PRs and avoid too many rebase work. Do you want to merge it and leave the RDS issue in a follow-up PR or do you want to have it solved in this one (I can help with the serialization)?

@david-cortes
Copy link
Contributor Author

I've added a small note about the R attribute params.

I am wondering however if there's some C-level function that would automatically gather and serialize all C attributes that aren't handled by XGBoosterSaveModelToBuffer.

This step could be done in the same ALTREP serializer if we know what else needs to be there, but I see that the python class has many things that it keeps in attributes and I'm not sure which ones are C things that need to be re-assigned. Does it need to also gather xgb.parameters, for example? (there's currently no getter for that in R).

Do you want to merge it and leave the RDS issue in a follow-up PR

Let's better do it in this PR to keep all serialization-related things together.

@trivialfis
Copy link
Member

@david-cortes See #9924 (comment) .

XGBoosterSerializeToBuffer
XGBoosterUnserializeFromBuffer

It saves everything that's defined in C, including parameters and attributes. It's used by the now removed xgb.serialize R function, and:

_check_call(_LIB.XGBoosterUnserializeFromBuffer(handle, ptr, length))

@trivialfis
Copy link
Member

I think we can use it here: https://github.com/dmlc/xgboost/pull/9924/files#r1445141247 .

@david-cortes
Copy link
Contributor Author

Changed it to use XGBoosterSerializeToBuffer and XGBoosterUnserializeFromBuffer instead of XGBoosterSaveModelToBuffer.

Seems to pass the tests and work as expected.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 10, 2024

We should probably retain the note about backward compatibility: if saveRDS is used, the model may not be fully usable in a future version of XGBoost.

@david-cortes
Copy link
Contributor Author

We should probably retain the note about backward compatibility: if saveRDS is used, the model may not be fully usable in a future version of XGBoost.

Is that because XGBoosterSerializeToBuffer doesn't have such guarantee? If so, what happens when a user tries to load an incompatible model? Is there such an incompatible serialized booster that I could download from somewhere?

@trivialfis
Copy link
Member

When loading from a different version of XGBoost, the internal parameters are automatically discarded, in which case, it's the same as xgb.save.

@david-cortes
Copy link
Contributor Author

When loading from a different version of XGBoost, the internal parameters are automatically discarded, in which case, it's the same as xgb.save.

Ok, then I guess it's no big deal, and the compatibility note makes sense to keep.

@trivialfis
Copy link
Member

trivialfis commented Jan 10, 2024

See #9734 (comment)

Ok, then I guess it's no big deal, and the compatibility note makes sense to keep.

sounds good!

@hcho3

This comment has been minimized.

@david-cortes
Copy link
Contributor Author

Then let's drop the warning about saveRDS ?

I've rewritten the compatibility note to reflect the current situation per my understanding. Would be ideal if you and @trivialfis could take a look at it. It doesn't have a warning to not use R's serializers, but describes a bit the differences.

I think it's still important to have it, since now we have an incompatibility with models that were created before this PR, and an incompatibility with older versions of qs which is also used for serialization.

R-package/R/utils.R Outdated Show resolved Hide resolved
@hcho3
Copy link
Collaborator

hcho3 commented Jan 10, 2024

Thanks for the new wording. I recognize that some use cases do call for the use of saveRDS.

R-package/R/utils.R Outdated Show resolved Hide resolved
R-package/R/utils.R Outdated Show resolved Hide resolved
david-cortes and others added 3 commits January 10, 2024 20:44
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@trivialfis trivialfis merged commit d3a8d28 into dmlc:master Jan 10, 2024
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants