The discrust
package provides a supervised discretization algorithm. Under the hood it implements a decision tree, using information value to find the optimal splits, and provides several different methods to constrain the final discretization scheme. This algorithm identifies the optimal way to split a continuous variable into discrete bins, while maximizing the predictive value of those bins with respect to some binary dependent variable.
The Rust code for the actual algorithm implementation can be found in the crates/discrust_core directory. The code for the python bindings can be found in the src directory.
The package has a single user facing class, Discretizer
that can be instantiated with the following arguments.
min_obs
(Optional[float], optional): Minimum number of observations required in a bin. Defaults to 5.max_bins
(Optional[int], optional): Maximum number of bins to split the variable into. Defaults to 10.min_iv
(Optional[float], optional): Minimum information value required to make a split. Defaults to 0.001.min_pos
(Optional[float], optional): Minimum number of records with a value of one that should be present in a split. Defaults to 5.mono
(Optional[int], optional): The monotonicity required between the binned variable and the binary performance outcome. A value of -1 will result in negative correlation between the binned x and y variables, while a value of 1 will result in a positive correlation between the binned x variable and the y variable. Specifying a value of 0 will result in binning x, with no monotonicity constraint. If a value of None is specified the monotonicity will be determined the monotonicity of the first split. Defaults to None.
The fit
method can be called on data and accepts the following parameters.
x
(ArrayLike): An arraylike numeric field that will be discretized based on the values ofy
, and the constraints theDiscretizer
was initialized with.y
(ArrayLike): An arraylike binary field.sample_weight
(Optional[ArrayLike], optional): Optional sample weight array to be used when calculating the optimal breaks. Defaults to None.exception_values
(Optional[List[float]], optional): Optional list specifying exception values. These values are held out of the binning process, additionally, their respective weight of evidence, and summary information can be found in theexception_values_
attribute once the discretizer has been fit.
A np.nan
value may be present in the list of possible exception values. If there are np.nan
values present in the x
variable, and np.nan
is not listed as a possible exception value, an error will be raised. Additionally, an error will be raised if np.nan
is found to be in y
or the sample_weight
arrays.
This method will fit the decision tree and find the optimal split values for the feature given the constraints. After being fit the discretizer will have a splits_
attribute with the optimal
split values.
import seaborn as sns
df = sns.load_dataset("titanic")
from discrust import Discretizer
ds = Discretizer(min_obs=5, max_bins=10, min_iv=0.001, min_pos=1.0, mono=None)
ds.fit(df["fare"], df["survived"])
ds.splits_
# [-inf, 6.95, 7.125, 7.7292, 10.4625, 15.1, 50.4958, 52.0, 73.5, 79.65, inf]
Here we show what the results are if exception values are also specified. These exception values will be held out when calculating the bins.
ds = Discretizer(min_obs=5, max_bins=10, min_iv=0.001, min_pos=1.0, mono=None)
ds.fit(df["age"], df["survived"], exception_values=[np.nan, 1.0])
ds.exception_values_
# {'vals_': [nan, 1.0],
# 'totals_ct_': [177.0, 7.0],
# 'iv_': [0.03054206173541801, 0.015253257689460616],
# 'ones_ct_': [52.0, 5.0],
# 'woe_': [-0.40378231427394834, 1.3895784363210804],
# 'zero_ct_': [125.0, 2.0]}
The exception_values_
dictionary has the following keys.
vals_
: The exception values passed to theDiscretizer
.totals_ct_
: The total number of each respective exception value present in thex
variable used for fitting.ones_ct_
: Total count of the positive class for each exception value.zero_ct_
: Total count of zeros for each respective value.woe_
: The weight of evidence for each respective exception value.iv_
: The information value for each respective exception value.
The predict
method can be called and will discretize the feature, and then perform either weight of evidence substitution on each binned level, or return the bin index. This method takes the following arguments.
-
x
(ArrayLike): An arraylike numeric field. -
prediction_type
(str, optional): A string specifying which prediction type should be returned. The string specified must be one of "woe" or "index". Defaults to "woe".- If "woe" is supplied, weight evidence subtitution will be performed for each value, and the
weight of evidence of the bin the value should fall in will be returned. For exception values found in
x
, the calculated weight of evidence for that exception value will be returned. If the exception value was never present in thex
variable when theDiscretizer
was fit, then the returned weight of evidence will be zero for the exception value. - If "index" is specified, each value will be converted to the
relevant bin index. These bins will be created from the
splits_
attribute and will be zero indexed. Any exception values will be encoded starting with -1 to -N, where N is the number of exception values present in theexception_values_
attribute. The order of the exception values will be equivalent to thevals_
key in this attribute.
- If "woe" is supplied, weight evidence subtitution will be performed for each value, and the
weight of evidence of the bin the value should fall in will be returned. For exception values found in
ds.predict(df["fare"])[0:5]
array([-0.84846814, 0.78344263, -0.787529, 0.78344263, -0.787529])
Specifying prediction_type
to "index" will be equivalent to use the pandas cut
method with the splits_
on the Discretizer
object used as the bins.
import pandas as pd
ds = Discretizer(min_obs=5, max_bins=5, min_iv=0.001, min_pos=1.0, mono=None)
ds.fit(df["fare"], df["survived"])
pd.cut(df["fare"], bins=ds.splits_).value_counts().sort_index()
# (-inf, 6.95] 26
# (6.95, 7.125] 16
# (7.125, 10.462] 297
# (10.462, 73.5] 455
# (73.5, inf] 97
# Name: fare, dtype: int64
pd.value_counts(ds.predict(df["fare"], prediction_type="index")).sort_index()
# 0 26
# 1 16
# 2 297
# 3 455
# 4 97
# dtype: int64
One of the main benefits of using the predict
method over the pandas cut function directly, is the built in support for exception values.
ds = Discretizer(min_obs=5, max_bins=4, min_iv=0.001, min_pos=1.0, mono=None)
ds.fit(df["age"], df["survived"], exception_values=[np.nan, 1.0])
pd.value_counts(ds.predict(df["age"], prediction_type="index")).sort_index()
# -2 7
# -1 177
# 0 6
# 1 34
# 2 654
# 3 13
# dtype: int64
ds.exception_values_["vals_"]
# [nan, 1.0]
ds.exception_values_["totals_ct_"]
# [177.0, 7.0]
The package can be installed directly from pypi with the following command.
python -m pip install discrust
The package can be built from source, it utalizes the maturin tool as a build backend. This tool requires you have python, and a working Rust compiler installed, see here for details. If these two requirements are met, you can clone this repository, and run the following command in the repositories root directory.
python -m pip install . -v
This should invoke the maturin
tool, which will handle the building of the Rust code and installation of the package. Alternativly, if you simply want to build a wheel, you can run the following command after installing maturin
.
maturin build --release
I have had some problems building packages with maturin directly in a conda environment, this is actually a bug on anaconda's side that will hopefully be resolved. If this does give you any problems, it's usually easiest to build a wheel inside of a venv
and then install the wheel.
The package draws heavily from the ivpy package, both in the algorithm and the parameter controls. Why make another package? This package serves as a proof of concept of building a python package using Rust and pyo3, as well as offers cleaner methods for dealing with exception values. Additionally the goal is for this package to better align with the scikit-learn API, and possibly be used in other Rust based credit score building tools.