training material from optum

Hello All,

 As discussed, below are the pre-requisites –

 

Thumb Rule when to do log transformation

When Log transformation is done, when there are no negative or missing values, and when we want to use models where non-normal distribution will throw off the model prediction / accuracy

Other reasons are :

  • Highly Skewed Distribution - means
  • Want to magnify the difference in age : Like the difference between 30 year old and 40 year old and the difference between 70 year old and 80 year old, where we want to magnify the difference of 10 years between 70 year old and 80 year
    • Beware after log the interpretation changes.
  • Need to decrease the impact of outliers

Note : If the skewed data is expected in the real world, then don't create a normal distribution column using log transformation instead use a different model like Poisson regression or other regression techniques which works better with skewed data.

Working of SIR Modeling

Section 1: The SIR model main purpose is only to understand:

1.       How many days it takes for the epidemic to peak.

2.       After how many days from the start of infection can we expect the number of new cases to be less than five (5).

3.       We can also find out how estimate the hospitalization rate and the number of beds and ventilators required.

4.       Important Note:

a.       The limitation of this model is that once the person gets infected, the model assumes that the same person will not be infected again (life long immunity). Because of this problem the SIR model is not accurate and will not reflecting the actual number of cases.

b.      The SIR model also cannot estimate the effects of quarantine (example : State Lockdown)

c.       The SIR model cannot tell the improvement in outcomes if vaccine is introduced or developed.

d.      The SIR model assumes well-mixed population .i.e. everyone is of same age and has same immunity (homogeneous). Also the birth rate and death rate is constant. So any change in the in the population mix is only because of the undergoing epidemic.

e.       Adding additional complexity is the fact that the model is a continuous-time process, whereas the data are generally collected on a daily, weekly or monthly basis. 

Section 2: To use the SIR model, epidemiologists estimate the following

1)      Starting Date of infection in a county/ state or country.
2)      Suspected(S) - Who is the susceptible population for this disease( in case of COVID-19, we estimate the entire population to be susceptible as this disease is novel and there is no prior understanding of this disease).
a.       We can estimate the population to be susceptible as a percentage and give 100%
3)      Infected(I) - We also need to estimate how many people on the starting date were infected (in case of COVID-19 we don’t know how many people were infected on first day so we can give the value as 1 or 10).

a.       We can also give the infected number on starting date of infection as a percentage by giving a very small number like 0.0001 or 0.001.

4)      Recovered(R) - The number of patients recovered at the start of infection.

a.       We can estimate this value to be zero for COVID-19 at the start of infection.
5)      The main parameters of the infection are Beta (effective contact rate) and Gamma (inverse of the mean recovery time).
6)      Using differential solver equation we can initialize the value of S,I,R and give value of B and Gamma and ask the solver to return the value for S(t), I(t) and R(t) for the number of iterations.

Section 3: To calculate the SIR model outputs, we need to know; the below conditions are applicable for the human in the loop who will run the model

1)      The starting date (t=0) on which the epidemic started in county/state and country.

2)      The number of days we want to run the simulation.

3)      The value of Suspected(S) (at time t = 0), if we don’t know we can use the entire population.

4)      The Value of Infected(I) (at time t = 0), if we don’t know then we can use 1, as we need at least one infected.

5)      The Value of Recovered(R) (at time t = 0), if we don’t know then we can use 0, as initial value of recovered can be assumed as zero.

6)      The important value of the SIR model is the constant value of Beta and Gamma

a.       Recovery Rate - Gamma = 1 / (mean number of days it takes to recover)

          i.      As per CDC : mean number of days is 14 , Gamma = 1/14 (some experts argue 20 days is the time need for virus shedding)

                   ii.      As per John R Code : mean number of days is 7, Gamma = 1/7

b.      Infection Rate - Beta (effective contact rate) is the challenging parameter to calculate. As infection has not ended we cannot estimate the value of Beta.

                  i.      Approximate value of Beta can be calculated by

1.       Beta = Gamma + G,

2.       Where G = {2^(1/Td)} -1,

a.       Where Td is the number of days it takes to double the number of new cases from the start of infection (t=0).

3.       The American Hospital Association (AHA) initially projected a doubling time (Td) between 7 and 10 days. The doubling time is applied to the number of infections, not the number of confirmed cases. This distinction may explain the discrepancy between the AHA's doubling time estimates and the observed doubling time of confirmed cases (currently 2 - 4 days).

4.       There is a relation between Beta and Gamma where R0 = Beta/Gamma. The value of R0 for COVID-19 is suggested to be between 1.4 to 3.28

7)      Using the values from point 2, 3, 4, 5, 6a and 6b we define the SIR Model and use any solver package to create a model.

a.       John has used the R package desolve

8)      The output of the model will be the values of S(t), I(t) and R(t) for each day of the simulation.

9)      We can use the output values from point 8.

Section 4: Output of SIR models;

1)      We can use the output values of SIR i.e. S(t), I(t) and R(t) as a time series matrix.

2)      The number of output rows will be the number of days we wanted to run the simulation.

a.       John’s R-Code has used 200 as the default number.

b.      Each row will contain the output values of S(t), I(t) and R(t)

3)      We can plot the each row value in a time series plot starting from the first day of infection.

a.       Using the graph we can understand the points 1, 2 and 3 from Section 1 as given below;

                   i.      How many days it takes for the epidemic to peak.

                   ii.      After how many days from the start of infection can we expect the number of new cases to be less than five (5).

                   iii.      We can also find out how estimate the hospitalization rate and the number of beds and ventilators required.

Family of Regression Techniques

Family of Regression, that i have listed out to study

  1. Hedonic Regression
  2. Linear Regression - OLS
    1. When 'Y' dependent variable is an ordinal ratio, which means that there is some unit increase and can go less that zero by not negative numbers.
    2. When 'Y' is normally distributed
    3. When multiple relationships define an outcome
    4. Using OLS we get;
      1. We get the estimate for each independent variable along with the Estimate is the Standard Error, T-statistic and P-value
      2. The Standard Error should be as small as possible
      3. The t-value should be large to indicate relationship between IV and DV
      4. The p-value should be very small so that we can infer the relationship is not by chance
    5. Fit Statistics - which one to use depends on the problem we are trying to solve
      1. R value and Adjusted-R Value
      2. RMSE
      3. AIC - Akaike Information Criteria (the lower the better and more parsimonious the model)
      4. MAE - Mean Absolute Error
      5. MAPE - Mean Absolute Percentage Error
      6. Min-Max Accuracy
  3. 2 Stage Least Squares Regression (2SLS) - This technique is the extension of the OLS method. It is used when the dependent variable's error terms are correlated with the independent variables. 
  4. Spline Regression
  5. Multi Variate Adaptive Regression Splines (MARS)
  6. Polynomial Regression 
  7. Generalized Linear Regression and Extensions
  8. General Additive Models Local Regression is basically Generalized Linear Regression with Smoothing. It has non-linear smoothing plus other co-variates.
  9. Vector Generalized Linear and Additive Models
  10. Ordinal Regression
  11. Survival Analysis - Non Negative Regression (Right Censoring)
  12. Probit Regression
  13. Quantile Regression
  14. Poisson Regression
  15. Stepwise Regression
  16. Least Absolute Shrinkage and Selection Operator (LASSO) Regression - L1 regularization
  17. Ridge Regression - L2 regularization
  18. Elastic Net Regression - has both L1 and L2 regularization
  19. Support Vector Regression - Decision Tree Regression - Random Forest Regression 
  20. Logistic Regression
  21. PLS: Partial least squares or projection to latent structures 
  22. Nonlinear regression
  23. Flexible Regression and Smoothing
  24. Bayesian Linear Regression
  25. Principle Component Regression
  26. Locally Weighted Regression (LWL)
  27. Least Angled Regression (LARS)
  28. Neural Net Regression
  29. Gradient Descent Regression
  30. Locally Estimated Scatterplot Smoothing Regression - LOESS Regression (similar to K-NN Regression)
  31. K-NN: K Nearest Neighbor Regression
  32. Zero Inflated Poisson Regression
  33. Isotonic Regression
  34. Nearly-Isotonic Regression
  35. Censored Regression - Using Tobit Model
  36. SoftMax Regression
  37. Sliced inverse regression (SIR) - is a tool for dimension reduction in the field of multivariate statistics