Adding simple, multiple and hierarchical regression plots #512

GWeindel · 2019-01-08T14:19:25Z

I have written some functions to draw regression plots from mixed models fitted in pure Stan. I wonder whether creating a branch in arviz for such plots would be interesting (guess so by seeing request #313). The basic idea would be to have the possibility of plotting linear (at least) regression, simple effects, interaction effects, with or without random effects (like sjPlot in R ).

If such a project would fit in the arviz package I could begin to code it, but I would definitely benefit from people with stronger skills.

ahartikainen · 2019-01-08T23:09:21Z

Sounds great. I think we need to think API carefully.

GWeindel · 2019-01-10T11:08:27Z

Great, if the other devs agree I can start to think about it but it surely will take some time.

ahartikainen · 2019-01-10T11:32:04Z

Sure, no problem.

Do you have some model(s) you can share that could be used as a reference?

Also, do you think observed_data is suitable location for the data or do we need some other structure?

aloctavodia · 2019-01-11T19:57:07Z

This will be a great addition to ArviZ. @GWeindel please be sure to check plot_hpd function in case you find it useful for this project.

GWeindel · 2019-01-28T12:09:33Z

I am starting to have some doubts about the feasibility. It appears to me that either one constructs a post-fit structure which then needs a lot of information about the fitted object (increasing users time and effort), or one has to master what goes in (e.g. stan or PYMC3 code) and what goes out to draw these plots.
Hence I would suggest that this project should be build on top of a specialized package like bambi perhaps (https://github.com/bambinos/bambi)

ahartikainen · 2019-01-28T12:55:32Z

What if we start with regression plot done with InferenceData?

So user needs to add at creation step the following information

Scatter information:

x-variable_name --> observed_data
y-variable_name --> posterior_predictive
y-data --> observed_data

Model information (Line plot)

x-model --> observed_data
y-model --> posterior or a function that reads InferenceData (user defined)

Then we need to define same stuff as in ppcplot (do we take subsample etc)

ahartikainen · 2019-01-28T12:57:28Z

After that we could implement multiple regression (where each dimension is either a new axis, or something similar)

And later do hierachical structure.

Let's assume user can provide data.

ahartikainen · 2019-01-29T22:05:56Z

I was doing something simple today: linear regression...

It does get complicated fast.

We need a better interface to describe our models

Like getting the following to work is not hard

x
y_data
y_ppc

What is more or less hard

y_model

It would be great to give a function or something similar

y_model = "m*x+b"
y_model = "y = m*x+b"
y_model = "y ~ x"

and then m, x, and b are found from posterior.

Also I'm not sure, but there could still be better interface:

plot_lm("y ~ x", param=["m","b"], data=data)
plot_lm("y ~ m*x+b", data=data)

Could this work with glm also? If we assume InferenceData has all the needed data, we just need to parse the function and also accept numpy functions inside the

plot_lm("exp(y) ~ m*x + log(x) + sqrt(b)", data=data)

How hard would it if we did that parsing with re?

~ splits x,y
functions have ()
others are parameters in InferenceData

Then after we have y (and possibly added pair for ppc: {"y_hat" : "y"}

def plot_lm(x, y_ppc, y_data, y_model, data, x_group=None, y_ppc_group=None, y_data_group=None, num_ppc_samples=100):
    """Plot lm
    
    Parameters
    ----------
    x : str or Sequence 
    y_ppc : str
    y_data : str or Sequence
    y_model : str or Sequence
    data : obj or list[obj]
        Any object that can be converted to an az.InferenceData object
        Refer to documentation of az.convert_to_dataset for details
    xgroup : str
    ygroup : str
    num_ppc_samples : int
    line_err : bool
    y_err : bool
    x_err : bool
    xscale : str
    yscale : str
    
    Returns
    ------
    axes
    """
    
    if isinstance(x, str):
        if x_group is None:
            groups = data._groups
            if hasattr(data, "observed_data"):
                groups = ["observed_data"] + [group for group in groups if group != "observed_data"]
            for group in groups:
                item = getattr(data, group)
                if x in item and x_group is None:
                    x_group = group
                elif x in item:
                    print("Warning, duplicate variable names for x, using variable from group {}".format(x_group))
        x_values = getattr(data, x_group)[x]
    
    if isinstance(y_ppc, str):
        if y_ppc_group is None:
            groups = data._groups
            if hasattr(data, "posterior_predictive"):
                groups = ["posterior_predictive"] + [group for group in groups if group != "posterior_predictive"]
            for group in groups:
                item = getattr(data, group)
                if y_ppc in item and y_ppc_group is None:
                    y_ppc_group = group
                elif y_ppc in item:
                    print("Warning, duplicate variable names for y_ppc, using variable from group {}".format(y_ppc_group))
        y_ppc_values = getattr(data, y_ppc_group)[y_ppc]
    
    if isinstance(y_data, str):
        if y_data_group is None:
            if hasattr(data, "observed_data"):
                groups = ["observed_data"] + [group for group in groups if group != "observed_data"]
            for group in groups:
                item = getattr(data, group)
                if y_data in item and y_data_group is None:
                    y_data_group = group
                elif y_data in item:
                    print("Warning, duplicate variable names for y_data, using variable from group {}".format(y_data_group))
        y_data_values = getattr(data, y_data_group)[y_data]
    
    
    fig, ax = plt.subplots(1,1, figsize=(6,4), dpi=100)
    
    # plot data
    ax.plot(x_values, y_data_values, marker='.', color='C3', lw=0, zorder=10)
    
    # plot uncertainty in y
    slicer = np.random.choice(list(range(4000)),  size=num_pp_samples, replace=False)
    y_ppc_values_ = y_ppc_values.stack(sample=("chain", "draw"))[..., slicer]
    for i in range(num_pp_samples):
        ax.plot(x_values, y_ppc_values_[..., i], marker='.', lw=0, alpha=0.1, color='C1') 
    
    y_model_values = y_model.stack(sample=("chain", "draw"))[... ,slicer]
    # plot uncertainty in line
    for i in range(num_pp_samples):
        ax.plot(x_values, y_model_values[..., i], lw=0.5, alpha=0.2, c='k')

    for spine in ax.spines.values():
        spine.set_visible(False)
    ax.grid(True)
    return ax

jankaWIS · 2020-10-18T16:05:52Z

Speaking of which, I was wondering is there currently in arviz something like regplot in seaborn? That would be great and that could also give a beginning to what has been asked here.

utkarsh-maheshwari · 2021-03-12T16:59:18Z

Just a thought.
Inspite of asking for y_model, can't we calculate m abd b inside the function plot_lm?
Though it will increase the complexity of the function but would reduce the complexity at the input end and make it more user-friendly.

OriolAbril · 2021-03-14T16:23:41Z

Inspite of asking for y_model, can't we calculate m abd b inside the function plot_lm?

The problem is that there is no way to know what y_model is in ArviZ (it could be possible at a higher level like in bambi, but not in ArviZ), it depends on the model, it can be a y ~ b1*x+b0 but it could have multiple covariates, higher order terms, splines...

utkarsh-maheshwari · 2021-06-02T18:09:13Z

@ahartikainen
What are the assumptions we make about the data groups that should be present in infernecData passed as input??

This is an example kidiq that I am taking from posteriordb but there is no posterior predictive here, just the posterior. Can it be used as an example?

ahartikainen · 2021-06-03T05:34:14Z

good question. I think we need to calculate the posterior predictive with python.

utkarsh-maheshwari · 2021-06-06T20:40:29Z

https://gist.github.com/utkarsh-maheshwari/8d4cd2fd84c763bf85291c3f0881d588

Here is my initial try for visualization of linear regression models inspired by plot_posterior_predictive_glm from pymc3. There are still lots of things that need to be considered though.

OriolAbril · 2021-06-06T21:04:46Z

Use

with pm.Model() as model:
    mom_iq = pm.Data("mom_iq", data["mom_iq"])
    
    sigma = pm.HalfNormal('sigma', sd=10)
    intercept = pm.Normal('Intercept', 0, sd=10)
    x_coeff = pm.Normal('slope', 0, sd=10)
    
    mean = intercept + x_coeff * mom_iq
    likelihood = pm.Normal('kid_score', mu=mean, 
                        sd=sigma, observed=data["kid_score"])
    
    idata = pm.sample(1000, return_inferencedata=True)

so mom_iq gets automaticaly stored as constant data. Moreover, we should definitely not convert to dataframe:

idata.posterior["Intercept"] + idata.posterior["slope"] * idata.constant_data["mom_iq"]

will work with xarray out of the box and avoid the need to loop for computation, Ari's function above has an example with stacking to get a random subsample.

Also bit of a side note, eval is a reserved word in python, it's not a good idea to use as variable name.

utkarsh-maheshwari · 2021-06-07T18:03:04Z

@OriolAbril Thank you for the suggestions. Made the suggested changes.
I think, here, visualizing uncertainty in y points is insignificant because points are closely packed. (Should we include an option to show it ?)

Now there are many points that are needed to be considered for the function plot_lm:

Other input parameters
Initial checks on the input data
Fill area of uncertainty?

Open to suggestions

utkarsh-maheshwari · 2021-06-08T18:08:44Z

I guess, using plot_hdi, as suggested by @aloctavodia would make it look great.

@ahartikainen, about the y_model, I think we can do it like this?

Should I open a new issue to discuss particularly plot_lm and it's visualization. Otherwise, this issue will stretch very long.

utkarsh-maheshwari · 2021-06-09T19:04:50Z

utkarsh-maheshwari · 2021-06-09T19:04:59Z

utkarsh-maheshwari · 2021-06-09T19:05:07Z

utkarsh-maheshwari · 2021-06-09T19:05:10Z

utkarsh-maheshwari · 2021-06-10T16:46:28Z

Also, do you think observed_data is suitable location for the data or do we need some other structure?

I think data could be in constant_data as well.

utkarsh-maheshwari · 2021-06-10T19:51:17Z

I have tried to modified the Ari's plot_lm function with some added features.

Made a parser for y_model. I think it would work for simple linear models (irrespective of the order of terms in y_model ) and can be extended to glm as well.
Visualized uncertainty in mean, and uncertainty in data.

Achieved this

input :

plot_lm(
    x="mom_iq", 
    y_ppc="kid_score",
    y_data="kid_score", 
    data = idata, 
    y_model = "kid_score ~ Intercept + slope * mom_iq"
)

utkarsh-maheshwari · 2021-06-11T19:29:09Z

I think we need to calculate the posterior predictive with python.

Can we use pm.sample_posterior_predective() to calculate it?

ahartikainen · 2021-06-14T08:01:34Z

I think we need to calculate the posterior predictive with python.

Can we use pm.sample_posterior_predective() to calculate it?

It depends what PPL you use for the model

ahartikainen mentioned this issue Mar 16, 2019

Working with ArviZ bambinos/bambi#132

Closed

OriolAbril mentioned this issue Jan 9, 2021

Remove Plots & Diagnostics from PyMC Code Base pymc-devs/pymc#4371

Closed

utkarsh-maheshwari mentioned this issue Apr 1, 2021

timeseries / regression plot #313

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding simple, multiple and hierarchical regression plots #512

Adding simple, multiple and hierarchical regression plots #512

GWeindel commented Jan 8, 2019

ahartikainen commented Jan 8, 2019

GWeindel commented Jan 10, 2019

ahartikainen commented Jan 10, 2019

aloctavodia commented Jan 11, 2019

GWeindel commented Jan 28, 2019

ahartikainen commented Jan 28, 2019

ahartikainen commented Jan 28, 2019

ahartikainen commented Jan 29, 2019

jankaWIS commented Oct 18, 2020

utkarsh-maheshwari commented Mar 12, 2021

OriolAbril commented Mar 14, 2021

utkarsh-maheshwari commented Jun 2, 2021 •

edited

Loading

ahartikainen commented Jun 3, 2021

utkarsh-maheshwari commented Jun 6, 2021

OriolAbril commented Jun 6, 2021

utkarsh-maheshwari commented Jun 7, 2021 •

edited

Loading

utkarsh-maheshwari commented Jun 8, 2021

utkarsh-maheshwari commented Jun 9, 2021

utkarsh-maheshwari commented Jun 9, 2021

utkarsh-maheshwari commented Jun 9, 2021

utkarsh-maheshwari commented Jun 9, 2021

utkarsh-maheshwari commented Jun 10, 2021

utkarsh-maheshwari commented Jun 10, 2021 •

edited

Loading

utkarsh-maheshwari commented Jun 11, 2021

ahartikainen commented Jun 14, 2021

Adding simple, multiple and hierarchical regression plots #512

Adding simple, multiple and hierarchical regression plots #512

Comments

GWeindel commented Jan 8, 2019

ahartikainen commented Jan 8, 2019

GWeindel commented Jan 10, 2019

ahartikainen commented Jan 10, 2019

aloctavodia commented Jan 11, 2019

GWeindel commented Jan 28, 2019

ahartikainen commented Jan 28, 2019

Scatter information:

Model information (Line plot)

ahartikainen commented Jan 28, 2019

ahartikainen commented Jan 29, 2019

jankaWIS commented Oct 18, 2020

utkarsh-maheshwari commented Mar 12, 2021

OriolAbril commented Mar 14, 2021

utkarsh-maheshwari commented Jun 2, 2021 • edited Loading

ahartikainen commented Jun 3, 2021

utkarsh-maheshwari commented Jun 6, 2021

OriolAbril commented Jun 6, 2021

utkarsh-maheshwari commented Jun 7, 2021 • edited Loading

utkarsh-maheshwari commented Jun 8, 2021

utkarsh-maheshwari commented Jun 9, 2021

utkarsh-maheshwari commented Jun 9, 2021

utkarsh-maheshwari commented Jun 9, 2021

utkarsh-maheshwari commented Jun 9, 2021

utkarsh-maheshwari commented Jun 10, 2021

utkarsh-maheshwari commented Jun 10, 2021 • edited Loading

utkarsh-maheshwari commented Jun 11, 2021

ahartikainen commented Jun 14, 2021

utkarsh-maheshwari commented Jun 2, 2021 •

edited

Loading

utkarsh-maheshwari commented Jun 7, 2021 •

edited

Loading

utkarsh-maheshwari commented Jun 10, 2021 •

edited

Loading