Salman_Ahmed_asking_what_is

t-sne

Download R_march_8_2020.R

Download R_march_6_2020.R

https://factly.in/decoding-the-pincode-postal-index-number/#:~:text=Postal%20Index%20Number%20(PIN)%20or,for%20the%20Army%20Postal%20Service.

documents to read

Download Lets_learn_AI_StepUp_Module.pdf

Download new_pmp_exam_outline.pdf

training material from optum

Hello All,

As discussed, below are the pre-requisites –

Kaggle account – Individuals need to register on www.Kaggle.com and should have an active Kaggle account. Idea is to use Kaggle kernels to run codes.
Course link - Practical Deep Learning for Coders (v3) https://course.fast.ai/
Lesson - Everyone needs to complete the video lecture before coming to session and presenting batch can create a document like PPT/Word/Kaggle notebooks to present their content

Lesson 1 link - (https://course.fast.ai/videos/?lesson=1)

Py torch Basics- Everyone can spend like 30 mins on this short PY torch tutorial to understand the basics

https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py

We need to ensure that every individual participating in session will be running the codes in parallel with the presenting team, As this course is heavily focused on developing deep learning coding skills too.

Download Team2_Q6_AIR_Risk_Hackathon_Aug-05.pptx

Download fast_ai_courses.docx

Download snapshots_of_using_code_and_execution.docx

links to check

http://ow.ly/ZoUT50B0JMR

https://www.reddit.com/r/actuary/comments/8oswrj/got_a_score_of_10_on_exam_p_heres_what_i_did/?__s=p7o3wub3cn6ved5n7bf0

Download Salman_Ahmed_16_Jul_2020_Cover_Letter_1.docx

Comparing Models Statistics

Sensitivity
Specificity
POS Pred Values
NEG Pred Values
Precision
Recall
F1
Prevalence
Detection Rate
Detection Prevalence
Balanced Accuracy
Accuracy

Thumb Rule when to do log transformation

When Log transformation is done, when there are no negative or missing values, and when we want to use models where non-normal distribution will throw off the model prediction / accuracy

Other reasons are :

Highly Skewed Distribution - means
Want to magnify the difference in age : Like the difference between 30 year old and 40 year old and the difference between 70 year old and 80 year old, where we want to magnify the difference of 10 years between 70 year old and 80 year

Beware after log the interpretation changes.

Need to decrease the impact of outliers

Note : If the skewed data is expected in the real world, then don't create a normal distribution column using log transformation instead use a different model like Poisson regression or other regression techniques which works better with skewed data.

Working of SIR Modeling

Section 1: The SIR model main purpose is only to understand:

1. How many days it takes for the epidemic to peak.

2. After how many days from the start of infection can we expect the number of new cases to be less than five (5).

3. We can also find out how estimate the hospitalization rate and the number of beds and ventilators required.

4. Important Note:

a.       The limitation of this model is that once the person gets infected, the model assumes that the same person will not be infected again (life long immunity). Because of this problem the SIR model is not accurate and will not reflecting the actual number of cases.

b.      The SIR model also cannot estimate the effects of quarantine (example : State Lockdown)

c.       The SIR model cannot tell the improvement in outcomes if vaccine is introduced or developed.

d.      The SIR model assumes well-mixed population .i.e. everyone is of same age and has same immunity (homogeneous). Also the birth rate and death rate is constant. So any change in the in the population mix is only because of the undergoing epidemic.

e.       Adding additional complexity is the fact that the model is a continuous-time process, whereas the data are generally collected on a daily, weekly or monthly basis.

Section 2: To use the SIR model, epidemiologists estimate the following

1)      Starting Date of infection in a county/ state or country.
2)      Suspected(S) - Who is the susceptible population for this disease( in case of COVID-19, we estimate the entire population to be susceptible as this disease is novel and there is no prior understanding of this disease).
a.       We can estimate the population to be susceptible as a percentage and give 100%
3)      Infected(I) - We also need to estimate how many people on the starting date were infected (in case of COVID-19 we don’t know how many people were infected on first day so we can give the value as 1 or 10).

a. We can also give the infected number on starting date of infection as a percentage by giving a very small number like 0.0001 or 0.001.

4) Recovered(R) - The number of patients recovered at the start of infection.

a. We can estimate this value to be zero for COVID-19 at the start of infection.

5) The main parameters of the infection are Beta (effective contact rate) and Gamma (inverse of the mean recovery time).
6) Using differential solver equation we can initialize the value of S,I,R and give value of B and Gamma and ask the solver to return the value for S(t), I(t) and R(t) for the number of iterations.

Section 3: To calculate the SIR model outputs, we need to know; the below conditions are applicable for the human in the loop who will run the model

1) The starting date (t=0) on which the epidemic started in county/state and country.

2) The number of days we want to run the simulation.

3) The value of Suspected(S) (at time t = 0), if we don’t know we can use the entire population.

4) The Value of Infected(I) (at time t = 0), if we don’t know then we can use 1, as we need at least one infected.

5) The Value of Recovered(R) (at time t = 0), if we don’t know then we can use 0, as initial value of recovered can be assumed as zero.

6) The important value of the SIR model is the constant value of Beta and Gamma

a. Recovery Rate - Gamma = 1 / (mean number of days it takes to recover)

i. As per CDC : mean number of days is 14 , Gamma = 1/14 (some experts argue 20 days is the time need for virus shedding)

ii. As per John R Code : mean number of days is 7, Gamma = 1/7

b. Infection Rate - Beta (effective contact rate) is the challenging parameter to calculate. As infection has not ended we cannot estimate the value of Beta.

i. Approximate value of Beta can be calculated by

1. Beta = Gamma + G,

2. Where G = {2^(1/Td)} -1,

a. Where Td is the number of days it takes to double the number of new cases from the start of infection (t=0).

3. The American Hospital Association (AHA) initially projected a doubling time (Td) between 7 and 10 days. The doubling time is applied to the number of infections, not the number of confirmed cases. This distinction may explain the discrepancy between the AHA's doubling time estimates and the observed doubling time of confirmed cases (currently 2 - 4 days).

4. There is a relation between Beta and Gamma where R0 = Beta/Gamma. The value of R0 for COVID-19 is suggested to be between 1.4 to 3.28

7) Using the values from point 2, 3, 4, 5, 6a and 6b we define the SIR Model and use any solver package to create a model.

a. John has used the R package desolve

8) The output of the model will be the values of S(t), I(t) and R(t) for each day of the simulation.

9) We can use the output values from point 8.

Section 4: Output of SIR models;

1) We can use the output values of SIR i.e. S(t), I(t) and R(t) as a time series matrix.

2) The number of output rows will be the number of days we wanted to run the simulation.

a. John’s R-Code has used 200 as the default number.

b. Each row will contain the output values of S(t), I(t) and R(t)

3) We can plot the each row value in a time series plot starting from the first day of infection.

a. Using the graph we can understand the points 1, 2 and 3 from Section 1 as given below;

i. How many days it takes for the epidemic to peak.

ii. After how many days from the start of infection can we expect the number of new cases to be less than five (5).

iii. We can also find out how estimate the hospitalization rate and the number of beds and ventilators required.

Bishop Pattern Recognition

Download Bishop_Pattern_Recognition_and_Machine_L.pdf

Family of Regression Techniques

Family of Regression, that i have listed out to study

Hedonic Regression
Linear Regression - OLS

When 'Y' dependent variable is an ordinal ratio, which means that there is some unit increase and can go less that zero by not negative numbers.
When 'Y' is normally distributed
When multiple relationships define an outcome
Using OLS we get;

We get the estimate for each independent variable along with the Estimate is the Standard Error, T-statistic and P-value
The Standard Error should be as small as possible
The t-value should be large to indicate relationship between IV and DV
The p-value should be very small so that we can infer the relationship is not by chance

Fit Statistics - which one to use depends on the problem we are trying to solve

R value and Adjusted-R Value
RMSE
AIC - Akaike Information Criteria (the lower the better and more parsimonious the model)
MAE - Mean Absolute Error
MAPE - Mean Absolute Percentage Error
Min-Max Accuracy

2 Stage Least Squares Regression (2SLS) - This technique is the extension of the OLS method. It is used when the dependent variable's error terms are correlated with the independent variables.
Spline Regression
Multi Variate Adaptive Regression Splines (MARS)
Polynomial Regression
Generalized Linear Regression and Extensions
General Additive Models Local Regression is basically Generalized Linear Regression with Smoothing. It has non-linear smoothing plus other co-variates.
Vector Generalized Linear and Additive Models
Ordinal Regression
Survival Analysis - Non Negative Regression (Right Censoring)
Probit Regression
Quantile Regression
Poisson Regression
Stepwise Regression
Least Absolute Shrinkage and Selection Operator (LASSO) Regression - L1 regularization
Ridge Regression - L2 regularization
Elastic Net Regression - has both L1 and L2 regularization
Support Vector Regression - Decision Tree Regression - Random Forest Regression
Logistic Regression
PLS: Partial least squares or projection to latent structures
Nonlinear regression
Flexible Regression and Smoothing
Bayesian Linear Regression
Principle Component Regression
Locally Weighted Regression (LWL)
Least Angled Regression (LARS)
Neural Net Regression
Gradient Descent Regression
Locally Estimated Scatterplot Smoothing Regression - LOESS Regression (similar to K-NN Regression)
K-NN: K Nearest Neighbor Regression
Zero Inflated Poisson Regression
Isotonic Regression
Nearly-Isotonic Regression
Censored Regression - Using Tobit Model
SoftMax Regression
Sliced inverse regression (SIR) - is a tool for dimension reduction in the field of multivariate statistics

Some great books to look at

- The Seven Pillars of Statistical Wisdom (Stigler)
- Philosophy of Social Science (Rosenberg)
- Apollo’s Arrow (Orrell)
- The Model Thinker (Page)
- Artificial Intelligence (Russell and Norvig)
- Uncertainty: The Soul of Modeling, Probability & Statistics (Briggs)
- The Oxford Handbook of Causation (Beebee et al.)
- Probability Theory: The Logic of Science (Jaynes)
- Conjectures and Refutations: The Growth of Scientific Knowledge (Popper)
- The Logic of Scientific Discovery (Popper)

- The Structure of Scientific Revolutions (Kuhn)

https://www.phil.vt.edu/dmayo/personal_website/

Salman_Ahmed_asking_what_is_water

Personal Blog, Mind Rapture, Freedom of Ideas