Statistics Notes

Statistical Learning refers to a vast set of tools for understanding data. These tools can be classified as supervised or unsupervised.

These tools are used for predicting or drawing inferences. With unsupervised learning, since there is no output variable, we can learn relationships and structure from such data.

If the data involves predicting continuous or quantitative output value, then it is referred as a regression problem, however, in certain cases, we may wish to instead predict a non-numerical value - that is a categorical or qualitative output this will also be regression problem.

Generalized linear models include both linear and logistic regression as special cases.

Regression and Classification Trees were among the first to demonstrate the power of a details practical implementation of a method, including cross-validation for model selection. There were included under General Additive Models which are an extension of Generalized Linear Models.

In Regression function :

Y = f(X) + e

where f is some fixed but unknown function which may have more than one X and e is a random error term which is independent of X and has mean zero.

f represents the systematic information that X provided about Y.

Statistical learning refers to a set of approaches for estimating f, to accomplish two tasks for prediction and inference.

Errors in the regression are of two types:

  1. Reducible: Where an estimate of Y can be brought as close to the real Y, by improving the model.
  2. Irreducible: Which is inherent and is a component of Y (or Y is also a function of e) and cannot be explained with X.

Two ways of estimating f are Parametric (like linear regression using commonly used least squares approach) and Non-Parametric.

To make the model more flexible generally requires fitting more number of parameters, which makes the model more complex. But the more complex the model can lead to a phenomenon known as overfitting the data, which essentially means they follow the errors or noise, too closely.

Non-Parametric approaches do not make explicit assumptions about the functional form of f. They have that advantage over parametric approach by avoiding the assumption of a particular functional form of f. But non-parametric approaches do suffer from a major disadvantage; since they do not reduce the problem of estimating f to a small number of parameters, a very large number or observations is required in order to obtain an estimate for f.

 A a general rule for p variables we need p*(p-1)/2 scatter plots to study relationships between variables.

Quality of fit : can be measured by Average(Y- f(x))^2

For train and test MSE as the flexibility of the statistical learning method increases, we observe a monotone decrease in the training MSE and a U-shape in the test MSE. This is a fundamental property of statistical learning that holds regression of the particular data set at hand and regardless of the statistical method of being used. As model flexibility increases, training MSE will decrease, but the test MSE may not. When a given method yields a small training MSE but a large test MSE, we are said to over-fitting the data.

A statistical learning method that simultaneously achieves low variance and low bias is the preferred method. Variance refers to the amount of change when the model is tested from one set of data to another set of data. High variance is expected if the model fits closely to one set of data.

Bias is expected when we are estimating a real-world problem by a much simpler model.

We need to find a statistical method which reduces variance as well as Bias. A good test set of the performance of a statistical learning method requires a low variance and low squared bias

High Bias and Low Variance will be fitting a horizontal line through the data. Whereas Low Bias and High Variance will be fitting a non-linear curve.

Synergy Effect in business marketing language is called an interaction effect in Statistics.

A sample mean may give a close approximate of the population mean, whereas if we take sample means of a large number of data points, then the average of all sample means will be very close to the population mean.

In calculating coefficients for linear regression, we test the hypothesis

H0 : there is no relationship between X and Y.

Ha : there is a relationship between X and Y.

Mathematically, this corresponds to testing

H0 : B1 is equal to 0

Ha : B1 not equal to 0

How big sample B1 has to be greater than zero, or far away from zero to signify that the true population B1 is greater than zero can be calculated by Standard Error of B1. In practice, we calculate the t-statistic given by 


t = B1- 0 / SE(B1)

if there is no relation between X and Y then we expect the above formula of t statistic will have a t- distribution of n-2 degrees of freedom. consequently, it is a simple matter to compute the probability of observing any value equal to |t| or larger, assuming B1 = 0. We call this value the p-value. we interpret the p-value as follows: a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response.

Accessing the accuracy of the model;

  1. Residual Standard Error (RSE), it is an absolute measurement of lack of fit in terms of Y, and may not be always clear what constitutes a good RSE
  2. R^2 and the F-Statistic: R^2 statistic takes the form of proportion. a number close to one is a good fit.

RSE = sqrt( (summation of y - predicted y)^2)/ n-2)

R^2 = TSS - RSS / TSS

For simple linear regression between one X and Y, we can say that R^2 is equal to cor(X,Y), but this cannot be extended when predictors are more than one X. As Cor(X,Y) works in pairs and R^2 can show the relations between Y and multiple X.


For Multiple Linear Regression  

H0 = B1 = B2 = B3 ...... Bn = 0

or 

Ha = at leans one Bj is non-zero

The hypothesis test is performed by computing the F-Statistic

F = ( (TSS - RSS) / p ) / (RSS / (n-p-1) ) , if there is no relationship then value close to 1 otherwise very large value.

One important point to note, when we check the summary of the linear regression. How large should F be to reject H0, then answer depends on the parameters n and p. When n is large, an F-Statistic that is just a little larger than 1 might still provide evidence against H0. Consequently a larger F-statistic is needed when n is small.

The t-statistic and p-value [rovide information about whether each individual predictor is related to the response, after adjusting for the other predictors. It turns out that each of these is exactly equivalent to the F-Statistic (i.e. the square of each t-statistic is the corresponding F-statistic)  that omits the single variable from the model, leaving all the other in. i.e. q = 1

Why F-Statistic, when p is very large for example say about p=100, there is always a chance that we will see at least 5% of p variables with p-values less than 0.05, even if there is no apparent association between the predictors and the response. Hence the individual t-statistic and p-value for a very large number of predictors will incorrectly conclude the wrong thing. However, the F-statistic does not suffer from this problem, because it adjusts for the number of predictors and observations.


But for p>n, all the above assumptions of t-statistic, p-value and F-statistic will not work with least squares, to work with such high dimensionality problem another method called forward-selection method is used.

Deciding Factors on the selection of Variables for a model

  1. Akaike Information Criterion (AIC)
  2. Mallows Cp
  3. Bayesian Information Criterion (BIC)
  4. Adjusted R^2

For p predictors there will will 2^p models subsets, so it infeasible to check all the models. Another way to do this is using either of the three below;


Forward Selection : is a greedy approach, and might include variables early that later become redundant.

Backward Selection : Cannot be used if p > n

Mixed Selection : can remedy Forward selection

RSE = sqrt (RSS / (n-p-1))

Prediction intervals are always wider than confidence intervals because they incorporate both the error in the estimate for f(X) (the reducible error) and the uncertainty of the population regression plane (the irreducible error)

Two assumptions of linear regression: Additive (there is no interaction effect between predictors) and Linear  

The hierarchical principle states that if we have the interaction effect in the model then the original predictors also have to be in the model, even f the p-value is not significant.


Potential problems in linear regression:

  1. Non-Linearity f the response-predictor relationships. : box - cox transformation helps in this regard.
  2. Correlation of error terms: In such cases, confidence intervals and prediction intervals will be narrower than the should be and p-value will be lower than they should be. It could also be a case of time-series data. It is crucial for a good statstical experiment to mitigate the risk of such correlations.
  3. Non-Constant variance of error terms. : Heteroscedasticity : the solution is box-cox transform of Y or use weighted least-squares method.
  4. Outliers: Use studentized residuals measurement which is calculated by plotting the residuals and dividing by Standard Error fo residuals. If the value is greater than +3 or -3 then remove them as outliers.
  5. High-Leverage Points: Unusual or high values of X, to treat them we calculate the leverage statistic and if the value is greater than (p+1)/n, we may suspect that the corresponding point has high leverage and remove such observations.
  6. Collinearity.: Use VIF which is the ration of the variance of Bj when fitting the full model divided by the variance of Bj if fit on its own. If VIF is greater than 5 or 10 then it is a candidate of dropping or combine the two collinear terms and form a new predictor.

If the form of function F is known and relationship is linear, then Parametric will always perform bettern then non-Parametric.