Skip to content
forked from usaa/vogel

A ML project flow tool, with the primary objective of simplifying actuarial ML processes.

License

Notifications You must be signed in to change notification settings

shaynweidner/vogel

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Vogel is a ML project flow tool, with the primary objective of simplifying actuarial ML processes. It tracks and manages model development from data preparation to results analysis and visualization.

Install

  • Clone the Vogel repo
  • In the Vogel repo, pip install
    • pip install -e .

Features

  • Visualization
    • One-way plots (observed vs actual values)
    • Multi-variate plots (individual feature analysis)
    • Pareto charts
    • Model stats comparison chart
  • Custom Variable Transformations
    • Maintains metadata
    • Multiple binning mechanisms
    • Model Comparison Statistics
    • Available statistics vary by model type
  • Interfaces with multiple modeling platforms

Example

Pandas in Pandas out pipelines. All metadata is carried through to the transformed data.

import numpy as np
import pandas as pd
import statsmodels.api as sm
from IPython.display import display, HTML

import vogel.preprocessing as v_prep
import vogel.utils as v_utils
import vogel.utils.stats as v_stats
import vogel.train as v_train

# Test Data
df = pd.DataFrame({
      'a': [200., 40., 60., 100., 10., 10., 10.]
    , 'b': [100., 20., 30., np.nan, 5., 5., 5.]
    , 'c': ['texas', 'texas', 'michigan', 'colorado', 'michigan', 'michigan', 'michigan']
    , 'd': ['texas', 'texas', 'michigan', np.nan, 'michigan', 'michigan', 'michigan']
    , 'e': [1., 1., 1., 1., 1., 1., 1.]
    , 'f': [0., 10., 20, 1., 2., 20., 1000.]
})

display(df)

data_dict = {
    'grp_numeric': ['a', 'b']
  , 'grp_cat': ['c', 'd']
  , 'grp_other': ['a', 'c']
}

pipeline = v_utils.make_pipeline(
    v_prep.FeatureUnion([
        ('numeric', v_utils.make_pipeline(
            v_prep.ColumnExtractor(['grp_numeric', 'd'], data_dict, want_numeric = True),
            v_prep.NullEncoder(),
            v_prep.Imputer(),
            v_prep.Binning(bin_type='qcut', bins=3, bin_id='mean', drop='replace', feature_filter=['a'])
        )),
        ('cats', v_utils.make_pipeline(
            v_prep.ColumnExtractor(['grp_numeric', 'd'], data_dict, want_numeric = False),
            v_prep.LabelEncoder()
        ))
    ])
)

train_X = pipeline.fit_transform(df)

display(train_X)

We can now run a few models on this transformed data. We will ignore the validation and hyperparameter tuning options for now.

train_y = df['f'] 

run_list = [
    {
        'model_type': v_train.V_SM_GLM,
        'model_name': 'simple' + '_SM_glm_tweedie',
        'model_params': {
            'family': sm.families.Gaussian()
        },
        'fit_params': {
        }
    }, 
    {
        'model_type': v_train.V_xgb,
        'model_name': 'simple_1' + '_xgb',
        'model_params': {
            'objective': 'reg:linear',
            'n_estimators': 1,
            'n_jobs': -1
        },
        'fit_params': {
            'eval_set': [(train_X, train_y)],
            'verbose': False
        }
    }
    ,
    {
        'model_type': v_train.V_xgb,
        'model_name': 'simple_80' + '_xgb',
        'model_params': {
            'objective': 'reg:linear',
            'n_estimators': 80,
            'n_jobs': -1
        },
        'fit_params': {
            'eval_set': [(train_X, train_y)],
            'verbose': False
        }
    }
]

train_data_dict = {
    'X': train_X, 
    'y': train_y
}

model_runner = v_train.ModelRunner('reg', run_list, train_data_dict,
                                   None, pipeline)

eval_set = model_runner.evaluate_models()
display(eval_set)

With the stats package we can visualize how our models fit. We will choose the GLM, as it is the simplest best fitting model.

v_stats.plot_compare_stats(eval_set, valid_only=False)

We can see how individual features fit in out model.

mdl_glm = model_runner.models[0]
print('b')
v_stats.plot_one_way_fit(train_X['b'], mdl_glm.predict(train_X), target=train_y, target_error=True, pad_bar_chart=True)
mdl_glm.plot_glm_one_way_fit(plot_error=False)

More examples

About

A ML project flow tool, with the primary objective of simplifying actuarial ML processes.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%