Random Forest Model and find the most important variables using R

One of the benefits of using Random Forest Model is

1. In Regression, when the variables may be highly correlated with each other, the approach of Random Forest really help in understanding the feature importance. The trick is Random forest selects explanatory variables at each variable split in the learning process, which means it trains a random subset of the feature instead of all sets of features. This is called feature bagging. This process reduces the correlation between trees; because the strong predictors could be selected by many of the trees, and it could make them correlated.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

How to find the most important variables in R

Find the most important variables that contribute most significantly to a response variable

Selecting the most important predictor variables that explains the major part of variance of the response variable can be key to identify and build high performing models.

1. Random Forest Method

Random forest can be very effective to find a set of predictors that best explains the variance in the response variable.

library(caret)

library(randomForest)

library(varImp)

regressor <- randomForest(Target ~ . , data       ​= data, importance=TRUE) # fit the random forest with default parameter

varImp(regressor) # get variable importance, based on mean decrease in accuracy

varImp(regressor, conditional=TRUE) # conditional=True, adjusts for correlations between predictors

varimpAUC(regressor) # more robust towards class imbalance.


2. xgboost Method

library(caret)

library(xgboost)

regressor=train(Target~., data        ​= data, method = "xgbTree",trControl = trainControl("cv", number = 10),scale=T)
 

varImp(regressor)


3. Relative Importance Method

Using calc.relimp {relaimpo}, the relative importance of variables fed into lm model can be determined as a relative percentage.

library(relaimpo)

regressor <- lm(Target ~ . , data       ​= data) # fit lm() model

relImportance <- calc.relimp(regressor, type = "lmg", rela = TRUE) # calculate relative importance scaled to 100
 

sort(relImportance$lmg, decreasing=TRUE) # relative importance


4. MARS (earth package) Method

The earth package implements variable importance based on Generalized cross validation (GCV), number of subset models the variable occurs (nsubsets) and residual sum of squares (RSS).

library(earth)

regressor <- earth(Target ~ . , data       ​= data) # build model

ev <- evimp (regressor) # estimate variable importance
 

plot (ev)

5. Step-wise Regression Method

If you have large number of predictors , split the Data in chunks of 10 predictors with each chunk holding the responseVar.

base.mod <- lm(Target ~ 1 , data       ​= data) # base intercept only model

all.mod <- lm(Target ~ . , data       ​= data) # full model with all predictors

stepMod <- step(base.mod, scope = list(lower = base.mod, upper = all.mod), direction = "both", trace = 1, steps = 1000) # perform step-wise algorithm

shortlistedVars <- names(unlist(stepMod[[1]])) # get the shortlisted variable.

shortlistedVars <- shortlistedVars[!shortlistedVars %in% "(Intercept)"] # remove intercept

The output might include levels within categorical variables, since ‘stepwise’ is a linear regression based technique.

If you have a large number of predictor variables, the above code may need to be placed in a loop that will run stepwise on sequential chunks of predictors. The shortlisted variables can be accumulated for further analysis towards the end of each iteration. This can be very effective method, if you want to

·        Be highly selective about discarding valuable predictor variables.

·        Build multiple models on the response variable.


6. Boruta Method

The ‘Boruta’ method can be used to decide if a variable is important or not.

library(Boruta)

# Decide if a variable is important or not using Boruta

boruta_output <- Boruta(Target ~ . , data  ​= data, doTrace=2) # perform Boruta search

boruta_signif <- names(boruta_output$finalDecision[boruta_output$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables

# for faster calculation(classification only)

library(rFerns)

boruta.train <- Boruta(factor(Target)~., data  ​=data, doTrace = 2, getImp=getImpFerns, holdHistory = F)
boruta.train
 
boruta_signif <- names(boruta.train$finalDecision[boruta.train$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables
 
boruta_signif

##
getSelectedAttributes(boruta_signif, withTentative = F)

boruta.df <- attStats(boruta_signif)

print(boruta.df)

7. Information value and Weight of evidence Method

library(devtools)

library(woe)

library(riv)

iv_df <- iv.mult(data, y="Target", summary=TRUE, verbose=TRUE)

iv <- iv.mult(data, y="Target", summary=FALSE, verbose=TRUE)

iv_df

iv.plot.summary(iv_df) # Plot information value summary

Calculate weight of evidence variables

data_iv <- iv.replace.woe(data, iv, verbose=TRUE) # add woe variables to original data frame.

The newly created woe variables can alternatively be in place of the original factor variables.


8. Learning Vector Quantization (LVQ) Method

library(caret)
control <- trainControl(method="repeatedcv", number=10, repeats=3)

# train the model

regressor<- train(Target~., data       ​=data, method="lvq", preProcess="scale", trControl=control)

# estimate variable importance

importance <- varImp(regressor, scale=FALSE)

9. Recursive Feature Elimination RFE Method

library(caret)

# define the control using a random forest selection function

control <- rfeControl(functions=rfFuncs, method="cv", number=10)

# run the RFE algorithm

results <- rfe(data[,1:n-1], data[,n], sizes=c(1:8), rfeControl=control)

# summarize the results

# list the chosen features
predictors(results)

# plot the results
plot(results, type=c("g", "o"))

10. DALEX Method

library(randomForest)

library(DALEX)

regressor <- randomForest(Target ~ . , data       ​= data, importance=TRUE) # fit the random forest with default parameter


# Variable importance with DALEX

explained_rf <- explain(regressor, data   ​=data, y=data$target)



# Get the variable importances

varimps = variable_dropout(explained_rf, type='raw')



print(varimps)

plot(varimps)

11. VITA

library(vita)

regressor <- randomForest(Target ~ . , data    ​= data, importance=TRUE) # fit the random forest with default parameter

pimp.varImp.reg<-PIMP(data,data$target,regressor,S=10, parallel=TRUE)
pimp.varImp.reg

pimp.varImp.reg$VarImp

pimp.varImp.reg$VarImp
sort(pimp.varImp.reg$VarImp,decreasing = T)


12. Genetic Algorithm

library(caret)

# Define control function

ga_ctrl <- gafsControl(functions = rfGA, # another option is `caretGA`.

            method = "cv",

            repeats = 3)



# Genetic Algorithm feature selection

ga_obj <- gafs(x=data[, 1:n-1], 

        y=data[, n], 

        iters = 3,  # normally much higher (100+)

        gafsControl = ga_ctrl)



ga_obj

# Optimal variables

ga_obj$optVariables


13. Simulated Annealing

library(caret)

# Define control function

sa_ctrl <- safsControl(functions = rfSA,

            method = "repeatedcv",

            repeats = 3,

            improve = 5) # n iterations without improvement before a reset



# Simulated Annealing Feature Selection

set.seed(100)

sa_obj <- safs(x=data[, 1:n-1], 

        y=data[, n],

        safsControl = sa_ctrl)



sa_obj

# Optimal variables

print(sa_obj$optVariables)


14. Correlation Method

library(caret)

# calculate correlation matrix

correlationMatrix <- cor(data [,1:n-1])

# summarize the correlation matrix

print(correlationMatrix)

# find attributes that are highly corrected (ideally >0.75)

highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5)

# print indexes of highly correlated attributes
 

print(highlyCorrelated)