One of the benefits of using Random Forest Model is
1. In Regression, when the variables may be highly correlated with each other, the approach of Random Forest really help in understanding the feature importance. The trick is Random forest selects explanatory variables at each variable split in the learning process, which means it trains a random subset of the feature instead of all sets of features. This is called feature bagging. This process reduces the correlation between trees; because the strong predictors could be selected by many of the trees, and it could make them correlated.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
How to find the most important variables in R
Find the most important variables that contribute most significantly to a response variable
Selecting the most important predictor variables that explains the major part of variance of the response variable can be key to identify and build high performing models.
1. Random Forest Method
Random forest can be very effective to find a set of predictors that best explains the variance in the response variable.
library(caret) library(randomForest) library(varImp) regressor <- randomForest(Target ~ . , data = data, importance=TRUE) # fit the random forest with default parameter varImp(regressor) # get variable importance, based on mean decrease in accuracy varImp(regressor, conditional=TRUE) # conditional=True, adjusts for correlations between predictors
varimpAUC(regressor) # more robust towards class imbalance.
2. xgboost Method
library(caret) library(xgboost) regressor=train(Target~., data = data, method = "xgbTree",trControl = trainControl("cv", number = 10),scale=T)
varImp(regressor)
3. Relative Importance Method
Using calc.relimp {relaimpo}, the relative importance of variables fed into lm model can be determined as a relative percentage.
library(relaimpo) regressor <- lm(Target ~ . , data = data) # fit lm() model relImportance <- calc.relimp(regressor, type = "lmg", rela = TRUE) # calculate relative importance scaled to 100
sort(relImportance$lmg, decreasing=TRUE) # relative importance
4. MARS (earth package) Method
The earth package implements variable importance based on Generalized cross validation (GCV), number of subset models the variable occurs (nsubsets) and residual sum of squares (RSS).
library(earth) regressor <- earth(Target ~ . , data = data) # build model ev <- evimp (regressor) # estimate variable importance
plot (ev)
5. Step-wise Regression Method
If you have large number of predictors , split the Data in chunks of 10 predictors with each chunk holding the responseVar.
base.mod <- lm(Target ~ 1 , data = data) # base intercept only model all.mod <- lm(Target ~ . , data = data) # full model with all predictors stepMod <- step(base.mod, scope = list(lower = base.mod, upper = all.mod), direction = "both", trace = 1, steps = 1000) # perform step-wise algorithm shortlistedVars <- names(unlist(stepMod[[1]])) # get the shortlisted variable. shortlistedVars <- shortlistedVars[!shortlistedVars %in% "(Intercept)"] # remove intercept
The output might include levels within categorical variables, since ‘stepwise’ is a linear regression based technique.
If you have a large number of predictor variables, the above code may need to be placed in a loop that will run stepwise on sequential chunks of predictors. The shortlisted variables can be accumulated for further analysis towards the end of each iteration. This can be very effective method, if you want to
· Be highly selective about discarding valuable predictor variables.
· Build multiple models on the response variable.
6. Boruta Method
The ‘Boruta’ method can be used to decide if a variable is important or not.
library(Boruta) # Decide if a variable is important or not using Boruta boruta_output <- Boruta(Target ~ . , data = data, doTrace=2) # perform Boruta search boruta_signif <- names(boruta_output$finalDecision[boruta_output$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables # for faster calculation(classification only) library(rFerns) boruta.train <- Boruta(factor(Target)~., data =data, doTrace = 2, getImp=getImpFerns, holdHistory = F) boruta.train boruta_signif <- names(boruta.train$finalDecision[boruta.train$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables boruta_signif ## getSelectedAttributes(boruta_signif, withTentative = F) boruta.df <- attStats(boruta_signif) print(boruta.df)
7. Information value and Weight of evidence Method
library(devtools) library(woe) library(riv) iv_df <- iv.mult(data, y="Target", summary=TRUE, verbose=TRUE) iv <- iv.mult(data, y="Target", summary=FALSE, verbose=TRUE) iv_df iv.plot.summary(iv_df) # Plot information value summary Calculate weight of evidence variables data_iv <- iv.replace.woe(data, iv, verbose=TRUE) # add woe variables to original data frame.
The newly created woe variables can alternatively be in place of the original factor variables.
8. Learning Vector Quantization (LVQ) Method
library(caret) control <- trainControl(method="repeatedcv", number=10, repeats=3) # train the model regressor<- train(Target~., data =data, method="lvq", preProcess="scale", trControl=control) # estimate variable importance importance <- varImp(regressor, scale=FALSE)
9. Recursive Feature Elimination RFE Method
library(caret) # define the control using a random forest selection function control <- rfeControl(functions=rfFuncs, method="cv", number=10) # run the RFE algorithm results <- rfe(data[,1:n-1], data[,n], sizes=c(1:8), rfeControl=control) # summarize the results # list the chosen features predictors(results) # plot the results plot(results, type=c("g", "o"))
10. DALEX Method
library(randomForest) library(DALEX) regressor <- randomForest(Target ~ . , data = data, importance=TRUE) # fit the random forest with default parameter # Variable importance with DALEX explained_rf <- explain(regressor, data =data, y=data$target) # Get the variable importances varimps = variable_dropout(explained_rf, type='raw') print(varimps) plot(varimps)
11. VITA
library(vita) regressor <- randomForest(Target ~ . , data = data, importance=TRUE) # fit the random forest with default parameter pimp.varImp.reg<-PIMP(data,data$target,regressor,S=10, parallel=TRUE) pimp.varImp.reg pimp.varImp.reg$VarImp pimp.varImp.reg$VarImp sort(pimp.varImp.reg$VarImp,decreasing = T)
12. Genetic Algorithm
library(caret) # Define control function ga_ctrl <- gafsControl(functions = rfGA, # another option is `caretGA`. method = "cv", repeats = 3) # Genetic Algorithm feature selection ga_obj <- gafs(x=data[, 1:n-1], y=data[, n], iters = 3, # normally much higher (100+) gafsControl = ga_ctrl) ga_obj # Optimal variables ga_obj$optVariables
13. Simulated Annealing
library(caret) # Define control function sa_ctrl <- safsControl(functions = rfSA, method = "repeatedcv", repeats = 3, improve = 5) # n iterations without improvement before a reset # Simulated Annealing Feature Selection set.seed(100) sa_obj <- safs(x=data[, 1:n-1], y=data[, n], safsControl = sa_ctrl) sa_obj # Optimal variables print(sa_obj$optVariables)
14. Correlation Method
library(caret) # calculate correlation matrix correlationMatrix <- cor(data [,1:n-1]) # summarize the correlation matrix print(correlationMatrix) # find attributes that are highly corrected (ideally >0.75) highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5) # print indexes of highly correlated attributes
print(highlyCorrelated)