Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timeseries / regression plot #313

Closed
ahartikainen opened this issue Oct 1, 2018 · 20 comments
Closed

timeseries / regression plot #313

ahartikainen opened this issue Oct 1, 2018 · 20 comments

Comments

@ahartikainen
Copy link
Contributor

I think we need timeseries / regression plot.

Should it go under ppc plot?

We accept x and y
x:

  • ndarray (1D) = user provided numpy array
  • str = user provided parameter (data, posterior or posterior predictive)

y:

  • ndarray (1D --> (chain, draw, N) 3D) = user provided numpy array
  • str = user provided parameter (data, posterior or posterior predictive)

There are multiple ways to visualize uncertainty:

  • scatter (error-bars, violin-points)
  • line (area fill, percentile lines)
@ahartikainen
Copy link
Contributor Author

Also, random draws from posterior are one good way to visualize the uncertainty. At least for static images.

Here was some discussion about the quantiles #2

@canyon289
Copy link
Member

canyon289 commented Oct 5, 2018

This visualization from Stitch Fix is a nice example in my opinion
TimeSeries

https://multithreaded.stitchfix.com/blog/2016/04/21/forget-arima/

@utkarsh-maheshwari
Copy link
Contributor

I believe line plots are good representation but the best representation would obviously depend on the type of data. I propose that in short term, we should focus on integrating line plots for time series analysis and later we can add more plots to the library. I would love to work on this feature.

@OriolAbril
Copy link
Member

I think that the best way to begin is probably by creating a small database of regression and timeseries models, maybe take examples from ROS https://avehtari.github.io/ROS-Examples/examples.html (it whould not be too much work to port to cmdstanpy or pystan) or using https://github.com/bambinos/Bambi_resources/tree/master/ROS and then see how they could be reproduced from ArviZ and InferenceData objects. There are many things to take into account for the plots and I think it will be useful to get a better picture of what could be supported to decide what will be supported and how.

@utkarsh-maheshwari
Copy link
Contributor

Sure.
By now, I am not very familiar with R. I'll try to re-implement some examples of bambi.

@OriolAbril
Copy link
Member

You probably won't need to reimplement them, bambi already uses ArviZ, it is more than anything to get an idea of the different possibilities regarding regression and timeseries plots and to get familiar with ArviZ+xarray usage which can be quite different from ArviZ development where xarray does not play such an important role

@utkarsh-maheshwari
Copy link
Contributor

utkarsh-maheshwari commented Mar 16, 2021

Okay. I saw some of ROS examples too. I think it's not that hard to understand them. I am going through the examples tring to get familiar with the plots and ArviZ+xarray usage. I will keep in mind that we need to create a small database to get started.

@utkarsh-maheshwari
Copy link
Contributor

I have gone through examples in https://github.com/bambinos/Bambi_resources/tree/master/ROS. I can see that many examples generate fake data. I think the database generated/used in these 2 examples are good for time series/regression analysis
https://github.com/bambinos/Bambi_resources/blob/master/ROS/Unemployment/unemployment.ipynb
https://github.com/bambinos/Bambi_resources/tree/master/ROS/ElectionsEconomy
We can get an idea from these databases.

@utkarsh-maheshwari
Copy link
Contributor

utkarsh-maheshwari commented Mar 17, 2021

What are the things we need to keep in mind while creating database.
For univariate linear regression, 2 fields( For example, date/year and unemployment) are enough to demonstrate the example. But for multi-variate regression, we need more fields. Do we need consider it? Are there other such points to be considered?

@OriolAbril
Copy link
Member

Of the top of my head (I'll try to get back here and keep adding things that may come to me later) these are some of the things to consider for the design:

  • Predictors/predictions: we can have one or many of either of predictors and predictions, not only multiple predictors.
  • hierarchies: there could be group level predictions in addition to observation level ones
  • interpolation/extrapolation or forecasting. This is also similar to posterior predictive checks vs visualization of predictions.
  • info to show: spagetti plots, hdi bands, quantile bands, also related to the one above, we could generate spagetti plots or bands from posterior predictive samples (assuming the data used for fitting is on a fine enough grid) but we may also want to generate prediction lines/curves from the posterior samples instead of using posterior predictive samples.

@utkarsh-maheshwari
Copy link
Contributor

Speaking of time series analysis, one compulsory field is date/years ( let say 100 years ). We can have single or multiple monitored variables( monitored over 100 years). These could be generated or taken from real databases. I think generating them would be better idea as then we could cross-validate the model. Do we need more fields?

@OriolAbril
Copy link
Member

I don't think it matters the origin of the data, the goal is to visualize the results of the models, we don't need to check the model is correct as the visualization should work either way, after all, one of its goals it to check the models and see if they are working.

What were you thinking when you mentioned cross validation? I may be missing something. We also have another project about refitting models that would allow implementing k-fold crossvalidation, approximate leave future out... which will probably need some plots of their own, but I think this is outside of the scope of the timeseries/regression plots, I am not even sure all the points above can be covered in a single project either, you may need to select a subset of cases to support.

@utkarsh-maheshwari
Copy link
Contributor

By cross validation, I meant, for example we generate y like this
x = np.arange(1, 21)
n = x.shape
a = .2
b = .3
sigma = .5
y = a + b*x + sigma*stats.norm().rvs(n)

Then, in the example, we'll probably find distribution of a_hat and b_hat. We can then crossvalidate with original values (that are .2 and .3 here).
Nevermind, I realize I went off the track. Sorry for that. That cross-validation probably doesn't matter.

I think better way is to just start with creating database with 2 fields and then add fields to it when required.

@OriolAbril
Copy link
Member

Don't worry about going off track, I am just trying to keep the eye on the price, especially this year with the reduced coding period, it is crucial to define what is part of the project and what is not (even when useful and interesting too).

I am not sure we have the same idea in mind when thinking about database. My proposal was to have a "database" of inferencedata objects (local files is fine, on figshare if we want that to be public) so that when you are implementing the plot_regression (or plot_timeseries or whatever name is chosen eventually) you can easily go plot_regression(idata1...), plot_regression(idata2...)... and ensure that the api allows to generate all the different plots we are interested in. I also though that gathering this idata objects would be a good way to get familiar with the different possible visualizations involved in the project and thus help with the proposal and design phase.

I proposed looking into ROS because it has many examples covering a wide range of cases and already provides an implementation for all of the examples, so getting from there to inferencedata objects should be less work than trying to come up with the models/data from scratch. The bambi port is still a work in progress so I don't know how many can be taken as inferencedata "for free" from there, maybe @canyon289 can help with that. But looking at other examples/books/pakages is also perfectly fine.

@utkarsh-maheshwari
Copy link
Contributor

My proposal was to have a "database" of inferencedata objects

Can we take some dicts/dataframes defined in ROS examples, convert them to inferencedata using az.convert_to_inference_data?

@OriolAbril
Copy link
Member

Can we take some dicts/dataframes defined in ROS examples, convert them to inferencedata using az.convert_to_inference_data?

I guess so, it depends on what the data inside the dicts is, is the whole posterior stored as dict? posterior+observations?

@ahartikainen
Copy link
Contributor Author

Maybe we could take data from posteriordb?

@utkarsh-maheshwari
Copy link
Contributor

I saw posteriordb. There are lots of models. I filter out some which have "time series" in keywords. for example - https://github.com/stan-dev/posteriordb/blob/master/posterior_database/posteriors/rstan_downloads-prophet.json

I also took a quick look of prophet library developed by facebook. I think we can take an idea of time series plots from there too. Can we?

@utkarsh-maheshwari
Copy link
Contributor

utkarsh-maheshwari commented Apr 1, 2021

Do we need a seperate function like plot_lm int #512 to tackle regression which does not include time series analysis?

@OriolAbril
Copy link
Member

I think we can close this now with plot_lm and plot_ts? @ahartikainen @canyon289

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants