Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] [R-package] Replace "info" interface in lgb.Dataset with keyword arguments #4543

Closed
12 tasks done
jameslamb opened this issue Aug 21, 2021 · 4 comments
Closed
12 tasks done

Comments

@jameslamb
Copy link
Collaborator

jameslamb commented Aug 21, 2021

Summary

The following changes should be made to lgb.Dataset() in the R package.

"deprecated" = "supported, but raises a warning if used".

In release 3.3.0 (#4310)

In release 4.0.0

Motivation

  • reduces maintenance burden by making the R package more closely resemble the Python package
  • improves usability, especially for users working in IDEs like RStudio
  • reduces the volume of deprecation warnings for users of version 3.3.0 (since weight, init_score, etc. will match keyword args and not be part of ...)
  • reduces the risk of bugs by simplifying the interface
    • for example, would allow the removal of this logic:
      for (key in names(additional_params)) {
      # Key existing
      if (key %in% .INFO_KEYS()) {
      # Store as info
      info[[key]] <- additional_params[[key]]
      } else {
      # Store as param
      params[[key]] <- additional_params[[key]]
      }
    • and would remove the need to worry about problems like "what happens if you provide init_score as an argument passed through ... and a different init_score in the info list?"
  • adding deprecation warnings now, plus support for the pattern we want to support from 4.0.0 onwards, gives users time (probably on the order of months) to change their code before the breaking changes in 4.0.0 are released

Description

LightGBM training involves some preprocessing like bucketing continuous features into histograms and filtering out unsplittable features. That work is done one time before training begins, in the construction of a Dataset object.

In addition to the raw data (i.e. features) used, LightGBM Dataset objects can also contain the following:

  • label = an array of values for the target (e.g. 0s and 1s for binary classification)
  • weight = an array of sample weights, used to tell LightGBM that some samples should be considered more important during training
  • group = a vector of integers, describing how samples should be grouped together into "query results" (only relevant in the learning-to-rank task)
  • init_score = a matrix of per-sample initial scores to boost from. This can be used, for example, to start the boosting process from predictions created by another model.

References

  • implementation of Dataset class on the Python side:
    class Dataset:
    """Dataset in LightGBM."""
    def __init__(self, data, label=None, reference=None,
    weight=None, group=None, init_score=None, silent=False,
    feature_name='auto', categorical_feature='auto', params=None,
    free_raw_data=True):

Other Notes

Sorry I didn't write this up sooner. Didn't really think of it until I started working on adding deprecation warnings for uses of ... (e.g. in #4522).

@Laurae2 and I have already talked about this privately, although would still like to open this as a Request for Comment (RFC) to give everyone who's interested a chance to voice their opinions.

@Laurae2
Copy link
Contributor

Laurae2 commented Aug 23, 2021

Agree with all the proposed changes, not only this will make it easier to maintain but also make it easier for users to work with. 👍

@jameslamb
Copy link
Collaborator Author

This work is now complete. See the list of linked pull requests above for details.

Thanks very much @StrikerRUS for thorough reviews of so many PRs!

@StrikerRUS
Copy link
Collaborator

@jameslamb Thanks a lot for splitting the work into many multiple small PRs! It was a pleasure to review them.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants