Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New] Relationship of OxCGRT index and parameter values (short-term prediction) #280

Closed
Inglezos opened this issue Oct 26, 2020 · 29 comments
Labels
brainstorming Discussion to get creative ideas documentation Improvements or additions to documentation enhancement New feature or request

Comments

@Inglezos
Copy link
Collaborator

Inglezos commented Oct 26, 2020

What we need to document?

I am referring to https://lisphilar.github.io/covid19-sir/usage_policy.html, in specific to the (Experimental): Relationship of OxCGRT index and parameter values section.
What all these results actually mean? Could you provide a more detailed documentation and analysis, about the OxCGRT index usage and how it affects the parameter values for each country, with examples and practical explanations?

  • For example, what does that correlation table mean practically, how are these results interpreted and what's the results significance/impact on the general countries analyses?
  • The scatter plot at the bottom for every country, depicting the relationship between the Reproductive number (Rt) and the OxCGRT stringency index, what does it mean practically? Does it mean that for example for higher index values, the Rt is lower?
    The results have to be reworked/scaled, in order to display properly all the points around 0-100 zone and to ignore the high outliers.
  • More detailed examples (preferably in notebook .ipynb format) should be provided about how the OxCGRT index and the policy measures affect specific countries (i.e. Japan, USA, India, Greece, Italy, Spain) and how such results are interpreted for each country.
  • Why is this feature "Experimental"? Does it not fully work? Is there an open issue for that and active development being done?
@Inglezos Inglezos added the documentation Improvements or additions to documentation label Oct 26, 2020
@lisphilar
Copy link
Owner

This is just an experimental analysis to find the relasionship of the parameters and government responses. This is related to the discussion with @ilyasst and @joydisette in #3
They are the authors of https://ilylabs.github.io/projects/COVID-trackers/

We need to perform machine/deep learning. #204 and #205 are also related.
We should have the dataset of parameter values to enhance this experimental analysis.

However, parameter estimation for all countries is a very time-consuming task. We have too many countries and the number of phases are incleasing every day.

#225 must be solved in advance.

@lisphilar lisphilar added brainstorming Discussion to get creative ideas enhancement New feature or request labels Oct 27, 2020
@Inglezos
Copy link
Collaborator Author

Yes, I think #225 must definitely be solved in advance. You mean machine learning for pattern recognition of the relationships?
We don't have to do this for all the countries, but for a few at first.

I am referring to https://lisphilar.github.io/covid19-sir/usage_policy.html, in specific to the (Experimental): Relationship of OxCGRT index and parameter values section.
What all these results actually mean? Could you provide a more detailed documentation and analysis, about the OxCGRT index usage and how it affects the parameter values for each country, with examples and practical explanations?

  • For example, what does that correlation table mean practically, how are these results interpreted and what's the results significance/impact on the general countries analyses?
  • The scatter plot at the bottom for every country, depicting the relationship between the Reproductive number (Rt) and the OxCGRT stringency index, what does it mean practically? Does it mean that for example for higher index values, the Rt is lower?
    The results have to be reworked/scaled, in order to display properly all the points around 0-100 zone and to ignore the high outliers.
  • So, what those results practically mean?

@lisphilar
Copy link
Owner

As it is and this is just an experimental analysis to find the solutions to predict the parameter values using government responses with deep learning.

They are for feature selection.

I do not know which index is necessary in the solution. Correlation table, scatter plot and deep learning is helpful to find the useful index.

@Inglezos
Copy link
Collaborator Author

Inglezos commented Oct 28, 2020

  • I would suggest to use the StringencyIndexForDisplay, GovernmentResponseIndexForDisplay and
    ContainmentHealthIndexForDisplay indexes. The final index would be a sum of all these. We could start simple, finding some patterns between basic parameters, as Rt, and this index. Perhaps we could apply trends() in Rt-index plane, in order to see how the reproductive number, and in general the parameters, change with respect to this index.
  • Or we could see retrospectively which one index was the most representative of a specific country's effectiveness for an older phase when the pandemic was successfully contained, for example Italy during June-September or China after March! In that way we can rely on only one index (in case we don't decide to sum over all those three and have an ultimate one).
  • Then this index could for example affect the estimator's weights of the cost function or something like that.

@Inglezos
Copy link
Collaborator Author

For example, if a country applies quarantine measures on X day, we know that most probably the daily new cases will decrease in around 2-3 weeks after that X day. But the estimation analysis will give us increased cases for that period, since the parameters will be different and we simply cannot forecast this. Except if we somehow insert this index into the simulator analysis as an extra input parameter and affect the simulated cases. Because these simulated cases otherwise are not realistic and representative of our current knowledge that extra measures are in effect. We need to let the model know about that and the best way would be that index.
Do you have any ideas how something like this could be implemented soon? Could we start with a simple solution?

@lisphilar
Copy link
Owner

I intended to analyse the relationship with PolicyMeasures class, but solving difficult issue #225 is necessary.
One solution with Scenario class is here. Predict parameter values and simulate the numer of cases with these predicted parameter values.

  • Perform parameter estimation and get a dataframe with index Date and columns theta, rho, sigma, kappa. (Please see Scenario._track_param() method)
  • Calculate rolling mean values of the estimated parameter values because they are discrete values for dates. (c.f. Continuous for phases.)
  • Combine OxCGRT records for the country and the rolled estimated parameter values.
  • Split this data to train/validation data.
  • Select one OxCGRT index for one parameter using correlation coefficient, considering delay you mentioned.
  • Predict the future values of parameters using the OxCGRT index values respectively, using fbprohet (with additional regressors) or darts package.
  • Validate this prediction with validation data.
  • Simulate the number of cases with predicted parameter values.

What do you think about these steps? Sentences in bold will be the most difficult part.

@Inglezos
Copy link
Collaborator Author

Inglezos commented Nov 25, 2020

I intended to analyse the relationship with PolicyMeasures class, but solving difficult issue #225 is necessary.

Why is necessary to have a web service/RESTful API for such relationship? Can't we run these in advance once for a specific set of countries and then find this relationship on-the-fly?

Regarding the above algorithm I think this would suffice as a standalone solution and would enable the model to consider the various government measures in effect. I think this is vital to be implemented soon. And if not a complete solution, for a starting point it would be enough to apply this to a single future phase (one predicted set of parameters) or to the next month, for short-term impact.

Another question more general, what is the physical meaning of the estimated parameters? Do they make sense, are the parameters logical? For example, for Greece the Rt now is 19.5 . What does this mean? Is it logical that one individual can infect other 19 people? Or is it just a value with no realistic meaning, that serves only for fitting of the data to the model?

@lisphilar
Copy link
Owner

Which do you want to use for this analysis, PolicyMeasures class or Scenario class? Does "Standalone solution" mean that we will create a new class?

If PolicyMeasures class, though we can use small number of countries with .countries setter (i.e. property users can change), but it would be helpful if we can run many countries. I did not tried, but machine learning needs a lot of data to predict the results, avoiding over-fitting. (However, we can try it. If you think yes, please move forward to discussion about detailed codes or algorithms.)

If Scneario class, we can implement the function with the steps I mentioned in the previous comment. Please discuss the codes to implement.

If another class, please explain the detailed steps of your idea.

Another question more general, what is the physical meaning of the estimated parameters?

Reproduction number is a index to know whether outbreaking (Rt > 1) or not.
Parameter values have physical/logical meanings and have units [1/min]. E.g. rho is effective contact rate.
Please refer to my model desription in my Kaggle notebook.
https://www.kaggle.com/lisphilar/covid-19-data-with-sir-model#SIR-to-SIR-F

Rho, sigma and kappa are functions of control factors as explained in Factors of model parameters section of my Kaggle Notebook.
https://www.kaggle.com/lisphilar/covid-19-data-with-sir-model#Factors-of-model-parameters

@Inglezos
Copy link
Collaborator Author

Inglezos commented Dec 8, 2020

I think for a first solution implementation a Scenario class/method would suffice. A simple pattern recognition or even trend analysis in the {Rt or parameters set} - response_index plane could probably be enough, in order to predict short-term future model parameters after some measures were applied, per some representative and specific countries analysis.

@lisphilar lisphilar added this to the Release v2.14.0 milestone Dec 13, 2020
@lisphilar lisphilar changed the title Relationship of OxCGRT index and parameter values [New] Relationship of OxCGRT index and parameter values Dec 19, 2020
@lisphilar
Copy link
Owner

lisphilar commented Dec 28, 2020

[MEMO]
pre-test: https://gist.github.com/lisphilar/637d248376eb9fb7511ba9c037aae9b2
Updated idea

  1. User-specification ot time-series prediction of OxCGRT scores in the future phases
  2. Linear regression: X = OxCGRT scores, y = rho values etc. (theta, kappa, sigma, rho)
  3. Evaluation of linear regression (RMSLE etc.)
  4. Predict rho values in the future phases with linear regression above
  5. Set future phases using the predicted parameter values
  6. Simulate the number of cases

@Inglezos
Copy link
Collaborator Author

Inglezos commented Dec 28, 2020

May I suggest another way to predict future values?
What if for a moment we forget the OxCGRT index and focus solely on Rt (and the other parameters). Essentially we need to find a function that fits the estimated values for Rt (and the rest parameters). If we find such a fitting function then we can extrapolate the next values. We could use the index only in case we want to estimate the delay period (if this is needed). What do you think?

@lisphilar
Copy link
Owner

Yes, time series forcasting only with parameter values is an alternative. However, I tried a prototype of this solution in the bottom of the notebook I mentioned in the previous comment and failed in forcasting as shown in line 97. How can we improve it?

@Inglezos
Copy link
Collaborator Author

You mean a prototype of which solution, the alternative I described or the one you had in mind with the index?
As a first attempt I think it would be easier to try the alternative one.
If you tried other values for delay? Or try other parameters?
I think the major problem is that you used a linear regressor. I wouldn't expect the values to follow such a distribution.

@Inglezos
Copy link
Collaborator Author

Perhaps a time varying autoregressive model would be more appropriate for fitting
https://arxiv.org/pdf/1711.05204.pdf
https://icasas.github.io/tvReg/reference/tvAR.html
(I haven't searched into this yet)

@lisphilar
Copy link
Owner

Linear regression part was for the idea with OxCGRT scores. This is not related to the alternative you nentioned.
The bottom lines with Dart package is for the alternative (time series fodcasting of parameter values).

@lisphilar
Copy link
Owner

MEMO: https://gist.github.com/lisphilar/8f492770cd4c306b081873ca71b7871d
It be required to predict OxCGRT scores using time series forcasting, but this is the next step.

@lisphilar
Copy link
Owner

lisphilar commented Dec 29, 2020

I tired time series forcasting without OxCGRT scores, but it seems difficult to forecast parameter values with this solution because parameter values show wild ups and downs.
https://gist.github.com/lisphilar/30cb8d615659948334fb3aa5faa20aca

@Inglezos
Copy link
Collaborator Author

MEMO: https://gist.github.com/lisphilar/8f492770cd4c306b081873ca71b7871d
It be required to predict OxCGRT scores using time series forcasting, but this is the next step.

This is very good I think!

I tired time series forcasting without OxCGRT socres, but it seems difficult to forecast parameter values with this solution because parameter values show wild ups and downs.
https://gist.github.com/lisphilar/30cb8d615659948334fb3aa5faa20aca

It is a nice first approach I think. Also try to use AutoARIMA as well in the first model selection (they have same score with exponential smoothing).

A general note, I think the RMSLE by itself is not that much credible, because the parameter values are very small.
These ups and downs maybe cannot be forecasted with good accuracy. They probably depend on the index.
Also, I don't think that there is point in predicting long-term. We should aim to predict the parameters for the next phase only, short-term, i.e. for 2-6 weeks max into the future.

@Inglezos
Copy link
Collaborator Author

How OxCGRT index is combined and used in forecasting?

@lisphilar
Copy link
Owner

@Inglezos
Copy link
Collaborator Author

This seems very promising indeed!!

@lisphilar
Copy link
Owner

lisphilar commented Dec 29, 2020

I try to use recovery period (=17 days) rather than 14 days as delay. Do you have any ideas?

@Inglezos
Copy link
Collaborator Author

Inglezos commented Dec 29, 2020

In order to calculate the delay? Hmm.. if you compared the dates per country when the index was changed rapidly or critical measures were imposed to the dates of the phases (from S-R trend amalysis) or the dates when parameters changed rapidly?

Like applying change point analysis but in parameters-index plane instead of S-R.

Averaging of these change points duration then could indicate such delay period.

@lisphilar
Copy link
Owner

lisphilar commented Dec 29, 2020

It seems a difficult issue and this will be solved in the future versions...

I created pull request #471 as the first step.
I will check the outputs for some countries tomorrow (UTC).

Usage:

snl = cs.Scenario(jhu_data, population_data, "Japan")
snl.trend()
snl.estimate(cs.SIRF)
snl.predict(oxcgrt_data)
snl.summary()
snl.simulate()
snl.history("Rt")

I may rename .predict() to .fit_predict() and create .fit() and .predict().

@lisphilar lisphilar changed the title [New] Relationship of OxCGRT index and parameter values [New] Relationship of OxCGRT index and parameter values (short-term prediction) Dec 30, 2020
@lisphilar
Copy link
Owner

#471 was merged and tutorial of .fit_predict() etc. was documented.
https://lisphilar.github.io/covid19-sir/usage_quick.html#Short-term-prediction-of-parameter-values

@rebeccadavidsson
Copy link
Collaborator

In this example notebook, this code was used to include the delay period of 14 days:

# Assume OxCGRT score impact on parameter values with 14 days delay
delay = 14
df = oxcgrt_df.set_index("Date")
df.index += timedelta(days=delay)
merged_df = param_df.join(df, how="right")
merged_df.tail()

However, this delay is different for each country and the 'end' of the effects from Policy Measures is also very different. I made a short overview at the bottom of this notebook to identify the 'ending' effect of policy measures:
https://github.com/rebeccadavidsson/SIR_LSTM/blob/main/corr_oxf.ipynb

Just wanted to share this for any new ideas of implementations.

@Inglezos
Copy link
Collaborator Author

Inglezos commented Jan 8, 2021

Yes as far as I know the delay should not be a fixed value, but calculated dynamically for each country. In this first implementation the delay is set to the recovery period just to have a first working functionality. This will have to be revised.
I refer you to my previous comment:

In order to calculate the delay? Hmm.. if you compared the dates per country when the index was changed rapidly or critical measures were imposed to the dates of the phases (from S-R trend amalysis) or the dates when parameters changed rapidly?
Like applying change point analysis but in parameters-index plane instead of S-R.
Averaging of these change points duration then could indicate such delay period.

@Inglezos
Copy link
Collaborator Author

Inglezos commented Jan 8, 2021

The delay period will be reworked with issue #513.

@lisphilar
Copy link
Owner

Very very interesting. We will move forward to the new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
brainstorming Discussion to get creative ideas documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants