Statistics Notes 3 - Other Terms

The Pearson / Wald / Score Chi-Square Test can be used to test the association between the independent variables and the dependent variable. 

Wald/Score chi-square test can be used for continuous and categorical variables. Whereas, Pearson chi-square is used for categorical variables. The p-value indicates whether a coefficient is significantly different from zero. In logistic regression, we can select top variables based on their high wald chi-square value.

Gain :Gain at a given decile level is the ratio of cumulative number of targets (events) up to that decile to the total number of targets (events) in the entire data set. This is also called CAP (Cumulative Accuracy Profile) in Finance, Credit Risk Scoring Technique

Interpretation: % of targets (events) covered at a given decile level. For example,  80% of targets covered in top 20% of data based on model. In the case of propensity to buy model, we can say we can identify and target 80% of customers who are likely to buy the product by just sending email to 20% of total customers.

Lift : It measures how much better one can expect to do with the predictive model comparing without a model. It is the ratio of gain % to the random expectation % at a given decile level. The random expectation at the xth decile is x%.
Interpretation: The Cum Lift of 4.03 for top two deciles, means that when selecting 20% of the records based on the model, one can expect 4.03 times the total number of targets (events) found by randomly selecting 20%-of-file without a model.

Gain / Lift Analysis
  1. Randomly split data into two samples: 70% = training sample, 30% = validation sample. 
  2. Score (predicted probability) the validation sample using the response model under consideration. 
  3. Rank the scored file, in descending order by estimated probability 
  4. Split the ranked file into 10 sections (deciles) 
  5. Number of observations in each decile 
  6. Number of actual events in each decile 
  7. Number of cumulative actual events in each decile 
  8. Percentage of cumulative actual events in each decile. It is called Gain Score. 
  9. Divide the gain score by % of data used in each portion of 10 bins. For example, in second decile, divide gain score by 20.
Decile Rank Number of cases Number of Responses Cumulative Responses % of Events Gain Cumulative Lift Number of Decile Score to divide Gain
1 2500 2179 2179 44.71% 44.71% 4.47% 10
2 2500 1753 3932 35.97% 80.67% 4.03% 20
3 2500 396 4328 8.12% 88.80% 2.96% 30
4 2500 111 4439 2.28% 91.08% 2.28% 40
5 2500 110 4549 2.26% 93.33% 1.87% 50
6 2500 85 4634 1.74% 95.08% 1.58% 60
7 2500 67 4701 1.37% 96.45% 1.38% 70
8 2500 69 4770 1.42% 97.87% 1.22% 80
9 2500 49 4819 1.01% 98.87% 1.10% 90
10 2500 55 4874 1.13% 100.00% 1.00% 100
  25000 4874          

Detecting Outliers 

There are two simple ways you can detect outlier problem :

1. Box Plot Method : If a value is higher than the 1.5*IQR above the upper quartile (Q3), the value will be considered as outlier. Similarly, if a value is lower than the 1.5*IQR below the lower quartile (Q1), the value will be considered as outlier.
QR is interquartile range. It measures dispersion or variation. IQR = Q3 -Q1.
Lower limit of acceptable range = Q1 - 1.5* (Q3-Q1)
Upper limit of acceptable range = Q3 + 1.5* (Q3-Q1)
Some researchers use 3 times of interquartile range instead of 1.5 as cutoff. If a high percentage of values are appearing as outliers when you use 1.5*IQR as cutoff, then you can use the following rule
Lower limit of acceptable range = Q1 - 3* (Q3-Q1)
Upper limit of acceptable range = Q3 + 3* (Q3-Q1)
2. Standard Deviation Method: If a value is higher than the mean plus or minus three Standard Deviation is considered as outlier. It is based on the characteristics of a normal distribution for which 99.87% of the data appear within this range. 
Acceptable Range : The mean plus or minus three Standard Deviation
This method has several shortcomings :
  1. The mean and standard deviation are strongly affected by outliers.
  2. It assumes that the distribution is normal (outliers included)
  3. It does not detect outliers in small samples
3. Percentile Capping (Winsorization): In layman's terms, Winsorization (Winsorizing) at 1st and 99th percentile implies values that are less than the value at 1st percentile are replaced by the value at 1st percentile, and values that are greater than the value at 99th percentile are replaced by the value at 99th percentile. The winsorization at 5th and 95th percentile is also common. 

The box-plot method is less affected by extreme values as compared to Standard Deviation method. If the distribution is skewed, the box-plot method fails. The Winsorization method is a industry standard technique to treat outliers. It works well. In contrast, box-plot and standard deviation methods are traditional methods to treat outliers. 

4. Weight of Evidence: Logistic regression model is one of the most commonly used statistical technique for solving binary classification problem. It is an acceptable technique in almost all the domains. These two concepts - weight of evidence (WOE) and information value (IV) evolved from the same logistic regression technique. These two terms have been in existence in credit scoring world for more than 4-5 decades. They have been used as a benchmark to screen variables in the credit risk modeling projects such as probability of default. They help to explore data and screen variables. It is also used in marketing analytics project such as customer attrition model, campaign response model etc.

What is Weight of Evidence (WOE)?

The weight of evidence tells the predictive power of an independent variable in relation to the dependent variable. Since it evolved from credit scoring world, it is generally described as a measure of the separation of good and bad customers. "Bad Customers" refers to the customers who defaulted on a loan. and "Good Customers" refers to the customers who paid back loan.

         Formulae - ln(% of Good Customers / % of Bad Customer)
Distribution of Goods - % of Good Customers in a particular group
Distribution of Bads - % of Bad Customers in a particular group
ln - Natural Log
Positive WOE means Distribution of Goods > Distribution of Bads
Negative WOE means Distribution of Goods < Distribution of Bads
Hint : Log of a number > 1 means positive value. If less than 1, it means negative value.

Many people do not understand the terms goods/bads as they are from different background than the credit risk. It's good to understand the concept of WOE in terms of events and non-events. It is calculated by taking the natural logarithm (log to base e) of division of % of non-events and % of events.
Weight of Evidence for a category = log (% events / % non-events) in the category

Weight of Evidence was originated from logistic regression technique. It tells the predictive power of an independent variable in relation to the dependent variable. It is calculated by taking the natural logarithm (log to base e) of division of % of non-events and % of events.
Outlier Treatment with Weight Of Evidence : Outlier classes are grouped with other categories based on Weight of Evidence (WOE).


  1. For a continuous variable, split data into 10 parts (or lesser depending on the distribution).
  2. Calculate the number of events and non-events in each group (bin)
  3. Calculate the % of events and % of non-events in each group.
  4. Calculate WOE by taking natural log of division of % of non-events and % of events
Note : For a categorical variable, you do not need to split the data (Ignore Step 1 and follow the remaining steps)

Home » Credit Risk Modeling » Data Science » Logistic Regression » Weight of Evidence (WOE) and Information Value (IV) Explained

WEIGHT OF EVIDENCE (WOE) AND INFORMATION VALUE (IV) EXPLAINED

In this article, we will cover the concept of weight of evidence and information value and how they are used in predictive modeling process along with details of how to compute them using SAS, R and Python.

Logistic regression model is one of the most commonly used statistical technique for solving binary classification problem. It is an acceptable technique in almost all the domains. These two concepts - weight of evidence (WOE) and information value (IV) evolved from the same logistic regression technique. These two terms have been in existence in credit scoring world for more than 4-5 decades. They have been used as a benchmark to screen variables in the credit risk modeling projects such as probability of default. They help to explore data and screen variables. It is also used in marketing analytics project such as customer attrition model, campaign response model etc.


What is Weight of Evidence (WOE)?

The weight of evidence tells the predictive power of an independent variable in relation to the dependent variable. Since it evolved from credit scoring world, it is generally described as a measure of the separation of good and bad customers. "Bad Customers" refers to the customers who defaulted on a loan. and "Good Customers" refers to the customers who paid back loan.
WOE Calculation
Distribution of Goods - % of Good Customers in a particular group
Distribution of Bads - % of Bad Customers in a particular group
ln - Natural Log
 Positive WOE means Distribution of Goods > Distribution of Bads
Negative WOE means Distribution of Goods < Distribution of Bads

Hint : Log of a number > 1 means positive value. If less than 1, it means negative value.

Many people do not understand the terms goods/bads as they are from different background than the credit risk. It's good to understand the concept of WOE in terms of events and non-events. It is calculated by taking the natural logarithm (log to base e) of division of % of non-events and % of events.
WOE = In(% of non-events ➗ % of events)
Weight of Evidence Formula

Steps of Calculating WOE

  1. For a continuous variable, split data into 10 parts (or lesser depending on the distribution).
  2. Calculate the number of events and non-events in each group (bin)
  3. Calculate the % of events and % of non-events in each group.
  4. Calculate WOE by taking natural log of division of % of non-events and % of events
Note : For a categorical variable, you do not need to split the data (Ignore Step 1 and follow the remaining steps)
Weight of Evidence and Information Value
Weight of Evidence and Information Value Calculation

Terminologies related to WOE

1. Fine Classing : Create 10/20 bins/groups for a continuous independent variable and then calculates WOE and IV of the variable
2. Coarse Classing : Combine adjacent categories with similar WOE scores


Weight of Evidence (WOE) helps to transform a continuous independent variable into a set of groups or bins based on similarity of dependent variable distribution i.e. number of events and non-events.

For continuous independent variables : First, create bins (categories / groups) for a continuous independent variable and then combine categories with similar WOE values and replace categories with WOE values. Use WOE values rather than input values in your model.

Home » Credit Risk Modeling » Data Science » Logistic Regression » Weight of Evidence (WOE) and Information Value (IV) Explained

WEIGHT OF EVIDENCE (WOE) AND INFORMATION VALUE (IV) EXPLAINED

In this article, we will cover the concept of weight of evidence and information value and how they are used in predictive modeling process along with details of how to compute them using SAS, R and Python.

Logistic regression model is one of the most commonly used statistical technique for solving binary classification problem. It is an acceptable technique in almost all the domains. These two concepts - weight of evidence (WOE) and information value (IV) evolved from the same logistic regression technique. These two terms have been in existence in credit scoring world for more than 4-5 decades. They have been used as a benchmark to screen variables in the credit risk modeling projects such as probability of default. They help to explore data and screen variables. It is also used in marketing analytics project such as customer attrition model, campaign response model etc.


What is Weight of Evidence (WOE)?

The weight of evidence tells the predictive power of an independent variable in relation to the dependent variable. Since it evolved from credit scoring world, it is generally described as a measure of the separation of good and bad customers. "Bad Customers" refers to the customers who defaulted on a loan. and "Good Customers" refers to the customers who paid back loan.
WOE Calculation
Distribution of Goods - % of Good Customers in a particular group
Distribution of Bads - % of Bad Customers in a particular group
ln - Natural Log
 Positive WOE means Distribution of Goods > Distribution of Bads
Negative WOE means Distribution of Goods < Distribution of Bads

Hint : Log of a number > 1 means positive value. If less than 1, it means negative value.

Many people do not understand the terms goods/bads as they are from different background than the credit risk. It's good to understand the concept of WOE in terms of events and non-events. It is calculated by taking the natural logarithm (log to base e) of division of % of non-events and % of events.
WOE = In(% of non-events ➗ % of events)
Weight of Evidence Formula

Steps of Calculating WOE

  1. For a continuous variable, split data into 10 parts (or lesser depending on the distribution).
  2. Calculate the number of events and non-events in each group (bin)
  3. Calculate the % of events and % of non-events in each group.
  4. Calculate WOE by taking natural log of division of % of non-events and % of events
Note : For a categorical variable, you do not need to split the data (Ignore Step 1 and follow the remaining steps)
Weight of Evidence and Information Value
Weight of Evidence and Information Value Calculation


Download : Excel Template for WOE and IV

Terminologies related to WOE

1. Fine Classing
Create 10/20 bins/groups for a continuous independent variable and then calculates WOE and IV of the variable
2. Coarse Classing
Combine adjacent categories with similar WOE scores

Usage of WOE

Weight of Evidence (WOE) helps to transform a continuous independent variable into a set of groups or bins based on similarity of dependent variable distribution i.e. number of events and non-events.

For continuous independent variables : First, create bins (categories / groups) for a continuous independent variable and then combine categories with similar WOE values and replace categories with WOE values. Use WOE values rather than input values in your model.
Categorical independent variables: Combine categories with similar WOE and then create new categories of an independent variable with continuous WOE values. In other words, use WOE values rather than raw categories in your model. The transformed variable will be a continuous variable with WOE values. It is same as any continuous variable.

Why combine categories with similar WOE?

It is because the categories with similar WOE have almost same proportion of events and non-events. In other words, the behavior of both the categories is same.
Rules related to WOE
  1. Each category (bin) should have at least 5% of the observations.
  2. Each category (bin) should be non-zero for both non-events and events.
  3. The WOE should be distinct for each category. Similar groups should be aggregated.
  4. The WOE should be monotonic, i.e. either growing or decreasing with the groupings.
  5. Missing values are binned separately.
FEATURE SELECTION : SELECT IMPORTANT VARIABLES WITH BORUTA PACKAGE

Home » Data Science » Feature Selection » » Feature Selection : Select Important Variables with Boruta Package

FEATURE SELECTION : SELECT IMPORTANT VARIABLES WITH BORUTA PACKAGE

This article explains how to select important variables using boruta package in R. Variable Selection is an important step in a predictive modeling project. It is also called 'Feature Selection'. Every private and public agency has started tracking data and collecting information of various attributes. It results to access to too many predictors for a predictive model. But not every variable is important for prediction of a particular task. Hence it is essential to identify important variables and remove redundant variables. Before building a predictive model, it is generally not know the exact list of important variable which returns accurate and robust model.

Why Variable Selection is important?
  1. Removing a redundant variable helps to improve accuracy. Similarly, inclusion of a relevant variable has a positive effect on model accuracy.
  2. Too many variables might result to overfitting which means model is not able to generalize pattern
  3. Too many variables leads to slow computation which in turns requires more memory and hardware.

Why Boruta Package?

There are a lot of packages for feature selection in R. The question arises " What makes boruta package so special".  See the following reasons to use boruta package for feature selection.
  1. It works well for both classification and regression problem.
  2. It takes into account multi-variable relationships.
  3. It is an improvement on random forest variable importance measure which is a very popular method for variable selection.
  4. It follows an all-relevant variable selection method in which it considers all features which are relevant to the outcome variable. Whereas, most of the other variable selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier.
  5. It can handle interactions between variables
  6. It can deal with fluctuating nature of random a random forest importance measure
Basic Idea of Boruta Algorithm
Perform shuffling of predictors' values and join them with the original predictors and then build random forest on the merged dataset. Then make comparison of original variables with the randomised variables to measure variable importance. Only variables having higher importance than that of the randomised variables are considered important.

How Boruta Algorithm Works

Follow the steps below to understand the algorithm -
  1. Create duplicate copies of all independent variables. When the number of independent variables in the original data is less than 5, create at least 5 copies using existing variables.
  2. Shuffle the values of added duplicate copies to remove their correlations with the target variable. It is called shadow features or permuted copies.
  3. Combine the original ones with shuffled copies
  4. Run a random forest classifier on the combined dataset and performs a variable importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each variable where higher means more important.
  5. Then Z score is computed. It means mean of accuracy loss divided by standard deviation of accuracy loss.
  6. Find the maximum Z score among shadow attributes (MZSA)
  7. Tag the variables as 'unimportant'  when they have importance significantly lower than MZSA. Then we permanently remove them from the process.
  8. Tag the variables as 'important'  when they have importance significantly higher than MZSA.
  9. Repeat the above steps for predefined number of iterations (random forest runs), or until all attributes are either tagged 'unimportant' or 'important', whichever comes first.

Major Disadvantages: Boruta does not treat collinearity while selecting important variables. It is because of the way algorithm works.

In Linear Regression:

There are two important metrics that helps evaluate the model - Adjusted R-Square and Mallows' Cp Statistics.

Adjusted R-Square: It penalizes the model for inclusion of each additional variable. Adjusted R-square would increase only if the variable included in the model is significant. The model with the larger adjusted R-square value is considered to be the better model.

Mallows' Cp Statistic: It helps detect model biasness, which refers to either underfitting the model or overfitting the model.

Formulae : Mallows Cp = (SSE/MSE) – (n – 2p) 

where SSE is Sum of Squared Error and MSE is Mean Squared Error with all independent variables in model and p is for the number of estimates in model (i.e. number of independent variables plus intercept).

Rules to select best model: Look for models where Cp is less than or equal to p, which is the number of independent variables plus intercept.

A final model should be selected based on the following two criteria's -

First Step : Models in which number of variables where Cp is less than or equal to p

Second Step : Select model in which fewest parameters exist. Suppose two models have Cp less than or equal to p. First Model - 5 Variables, Second Model - 6 Variables. We should select first model as it contains fewer parameters.

Important Note : 

To select the best model for parameter estimation, you should use Hocking's criterion for Cp.

For parameter estimation, Hocking recommends a model where Cp<=2p – pfull +1, where p is the number of parameters in the model, including the intercept. pfull - total number of parameters (initial variable list) in the model.

To select the best model for prediction, you should use Mallows' criterion for Cp.


How to check non-linearity
In linear regression analysis, it's an important assumption that there should be a linear relationship between independent variable and dependent variable. Whereas, logistic regression assumes there should be a linear relationship between independent variable and logit function.
  • Pearson correlation is a measure of linear relationship. The variables must be measured at interval scales. It is sensitive to outliers. If pearson correlation coefficient of a variable is close to 0, it means there is no linear relationship between variables.
  • Spearman's correlation is a measure of monotonic relationship. It can be used for ordinal variables. It is less sensitive to outliers. If spearman correlation coefficient of a variable is close to 0, it means there is no monotonic relationship between variables.
  • Hoeffding’s D correlation is a measure of linear, monotonic and non-monotonic relationship. It has values between –0.5 to 1. The signs of Hoeffding coefficient has no interpretation.
  • If a variable has a very low rank for Spearman (coefficient - close to 0) and a very high rank for Hoeffding indicates a non-monotonic relationship.
  • If a variable has a very low rank for Pearson (coefficient - close to 0) and a very high rank for Hoeffding indicates a non-linear relationship.

Appendix:


marketing mix notes

Marketing Mix

Product : It includes all product items marketed by the marketer, their features, quality brand, packaging, labelling, product life cycle, and all decision related to product

Product assortment , offered to customers by the entire industry

 

Product line is a group of similar featured items marketed by a marketer

Total number of lines is referred as breadth (width) of product mix

 

Product depth or item depth refers to the number of version offered to each product in the line

 

Distribution channel – is very important to Netflix

 

 

Price : brings revenue, act of determining value of a product

Includes pricing objectives, price setting strategies, general pricing policies, discount, allowance, rebate, etc. price mix also includes cash and credit policy, price discrimination, cost and contribution

 

Place : location distance , transport

 

Direct marketing no intermediary is there

 

 

Promotion: is defined as a combination of all activities concerned with informing and persuading the actuals and potential customers about the merits of a product with an intention to achieve sales goals

 

Sales promotion involves offering short-term incentive to promote buying and increase sales

 

Most popular form of sales promotion are free gifts, discounts, exchange offer, free home, delivery , after-sales services, guarantee, warrantee, various purchase schemes, etc.

 

Favourable relations between organizations and public

 

Modification and extensions to 4 p’s

Product, price place and promotion (marketed approach)

 

Consumer oriented approach (4c’s)

Commodity - Product

Cost - Cost

Channel - Place

Communication - Promotion

 

Services were fundamentally different from products

Process : procedures / mechanisms for delivering services and monitoring

People : human factor as they interact with the consumer using the services

Physical Evidences :

 

 

Extension of 4c’s

Consumer solution

Cost convenience

Communication

 

Elements of marketing mix are mutualy dependant

Marketing mix elements are meant for attaining the target markets

Essence of marketing mix is ensuring profitbality through customer satisfaction

 

Elements help the marketer in attaining marketing objectives

 

Customer is the central focus of marketing mix

 

Purpose and objectives of marketing mix

Marketing mix aims at customer satisfaction

Success of each and every product

Aims at assisting the marketers in creating effective marketing strategy

Profit maximization, image building, creation of goodwill, maintaining better customer relations

Success of each and every product

Marketing mix is the link between business and customers

Marketing mix helps to increase sales and profit

 

 for netflix : reduction in price could be attributed in diminishing returns from advertising

 

Market Mix Modelling

Marketing Mix Modelling (MMM) is a method that helps quantify the impact of several marketing inputs on sales or market share. the purpose of MMM is to understand how much each marketing input contributes to sales, and how much to spend on each marketing input.

MMM relies on statistical analysis such as multivariate regressions on sales and marketing time series data to estimate the impact of various marketing tactics (marketing mix) on sales and then forecast the impact of future sets of tactics. It is often used to optimize the advertising mix and promotional tactics with respect to sales and profits.

Marketing Mix Modeling (MMM) is one of the most popular analysis under Marketing Analytics which helps organisations in estimating the effects of spent on different advertising channels (TV, Radio, Print, Online Ads etc) as well as other factors (price, competition, weather, inflation, unemployment) on sales. In simple words, it helps companies in optimizing their marketing investments which they spent in different marketing mediums (both online and offline).

Uses of Marketing Mix Modeling
It answers the following questions which management generally wants to know.
  1. Which marketing medium (TV, radio, print, online ads) returns maximum return (ROI)?
  2. How much to spend on marketing activities to increase sales by some percent (15%)?
  3. Predict sales in future from investment spent on marketing activities
  4. Identifying Key drivers of sales (including marketing mediums, price, competition, weather and macro-economic factors)
  5. How to optimize marketing spend?
  6. Is online marketing medium better than offline?
Types of Marketing Mediums
Let's break it into two parts - offline and online.
Offline Marketing
Online Marketing
Print Media : Newspaper, Magazine Search Engine Marketing like Content Marketing, Backlink building etc.
TV Pay per Click, Pay per Impression
Radio Email Marketing
Out-of-home (OOH) Advertising like Billboards, ads in public places. Social Media Marketing (Facebook, YouTube, Instagram, LinkedIn Ads)
Direct Mail like catalogs, letters Affiliate Marketing
Telemarketing  
Below The Line Promotions like free product samples or vouchers  
Sponsorship  



Marketing Spend as a percent of companies revenues by industry

Marketing Mix Modeling

MMM has had a place in marketers’ analytics toolkit for decades. This is due to the unique insights marketing mix models can provide. By leveraging regression analysis, MMM provides a “top down” view into the marketing landscape and the high-level insights that indicate where media is driving the most impact.

For example: by gathering long-term, aggregate data over several months, marketers can identify the mediums consumers engage with the most. MMM provides a report of where and when media is engaged over a long stretch of time.

Background: Marketing Mix Modeling (MMM)

The beginning of the offline measurement

Marketing Mix Modelling is a decades-old process developed in the earliest data of modern marketing that applies regression analysis to historical sales data to analyse the effects of changing marketing activities. Many marketers still use MMM for top-level media planning and budgeting; it delivers a broad view into variables both inside and outside of the marketer's control.

Some of the factors are:

  1. Price
  2. Promotions
  3. Competitor Activity
  4. Media Activity
  5. Economic Conditions

Analytical and Statistical Methods used to quantify the effect of media and marketing efforts on a product's performance is called Marketing Mix Modeling


"It helps to maximize investment and grow ROI"

ROI = (Incremental returns from investment) / Cost of Investment

Marketing ROI = (Incremental Dollar Sales from Marketing Investment) / Spend on Marketing Investment

Why is MMM Needed? Guiding Decisions for Improved Effectiveness

  1. How do I change the mix to increase sales with my existing budget?
  2. Where am I over-spending or under-spending?
  3. Which marketing channels are effective but lack the efficiency for positive ROI?
  4. To what degree do non-marketing factors influence sales?

How does MMM work?

  1. Correlate marketing to sales
  2. Factor in lag time
  3. Test interaction effects
  4. Attribute sales by input
  5. Model to most predictive
  6. Maximize significance - to empower decisions

Example Marketing Mix Model Output

Detailed output includes:

  1. Weekly sales lift
  2. More marketing channels
  3. Contribution by tactic
  4. Contribution by campaign
  5. Non-Marketing impact


Market Contribution vs. Base


ROI Assessment:

We measure ROI because not all ads will convert to sales, but because they are cost-effective and most bang for the buck

MMM Strengths:

  • Complete set of marketing tactics
  • Impact of non-marketing factors
  • High Statistical Reliability
  • Guides change in the marketing mix
  • Guides change in spend
  • Optimizes budget allocation

MMM Limitations:

  • More Tactical than Strategic
  • Short-Term impact only
  • Dependant on variance over time
  • Average Effectiveness
  • No diagnostics to improve
  • Hard to refresh frequently

Critical Success Factors of MMM:

  • Use a Strategic approach (not tactical)
  • Disclose gaps and limitations
  • Add Diagnostic measures
  • Integrate into robust measurement plan
  • Make marketing more measurable
  • Create ROI simulation tools

Media Mix Modeling as Econometric Modeling:

Strengths:

  1. It reduces the biases
  2. It correctly or accurately isolates the impact of media on sales from the impact of all other factors that influence sales.

Weaknesses:

  1. If two types of media are highly correlated in the historical record, then isolating and separating each media type on sales gets reduced.

For working with Market Mix Modeling - a good understanding of econometrics types of modelling is needed

The objective before starting this approach is how can we maximize the value and minimize the harm of marketing mix models like store-based models or shopper based multi-user attribution models.

Marketing End Users are the root of the cause of marketing mix models problems.

Tip: Most attribution projects begin long after the strategy has already been set. So it's important to understand what the client did, why they did it, and what they expected to happen. Only then can you answer their questions in a way they'll be happy with. Remember they hired you because the results weren't what they expected... or because they never thought about how to measure them in the first place.

As we all know weekly variation is the lifeblood of marketing mix models.


Some of the problems are continuity bias

Very interesting article on using Market Mix Modelling during COVID-19.

Market Mix Modeling (MMM) in times of Covid-19 | by Ridhima Kumar | Aryma Labs | Medium

In the model, i read that there will be sudden demand of essential items during the pandemic, but this deviance cannot be attributed to existing advertisement factors.

In the regression model we can see that there will be;

  • Heteroscedasticity: The sales trend could show significant changes from the beginning to end of the series. Hence, the model could have heteroscedasticity. One of the reasons for heteroscedasticity is presence of outliers in the data or due to large range between the largest and smallest observed value.
  • Autocorrelation: Also, the model could show signs of autocorrelation due to missing independent variable (the missing variable being Covid-19 variable).

Another very interesting article on Marketing Analytics using Markov chain

Marketing Analytics through Markov Chain | LinkedIn

In the article, I read that how we can use transition matrix to understand the change in states. It explains very neatly.

Article on Conjoint Analysis : Conjoint Analysis: What type of chocolates do the Indian customers prefer? | LinkedIn

Marketing Mix Modeling (MMM) is the use of statistical analysis to estimate the past impact and predict the future impact of various marketing tactics on sales. Your Marketing Mix Modeling project needs to have goals, just like your marketing campaigns.

The main goal of any Marketing Mix Modeling project is to measure past marketing performance so you can use it to improve future Marketing Return on Investment (MROI).

The insights you gain from your project can help you reallocate your marketing budget across your tactics, products, segments, time and markets for a better future return. All of the marketing tactics you use should be included in your project, assuming there is high-quality data with sufficient time, product, demographic, and/or market variability. Each project has four distinct phases, starting with data collection and ending with optimization of future strategies. Let’s take a look at each phase in depth:

Phase 1 : Data Collection and Integrity : It can be tempting to request as much data as possible, but it's important to note that every request has a very real cost to the client. In this case the task could be simplified down to just marketing spend by day, by channel, as well as sales revenue.

Phase 2 : Modeling: Before modelling we need to;

  • Identify Baseline and Incremental Sales

  • Identify Drivers of Sales

  • Identify Drivers of Growth

  • Sales Drivers by Week

  • Optimal Media Spend
  • Understanding Brand Context: Understanding the clients marketing strategy & its implementation is key for succeeding in the delivery of the MMM project.
    • The STP Strategy (Segmentation, Targeting and Positioning) impacts the choice of the target audience and influences the interpretation of the model results.
    • The company context and 4P's determine the key datasets that needed to be collected and influence the key factors. Eg: Impact of Seasonality , Distribution of Channels

Phase 3 : Model-Based Business Measures

Phase 4 : Optimization & Strategies

Pitfalls in Market Mix Modeling: 

1. Why MMX vendors being “personally objective” is not the same as their being “statistically unbiased”.
2. How to clear the distortions that come from viewing “today’s personalized continuity marketing” through “yesterday’s mass-market near-term focused lens”.
3. Why “statistically controlling” for a variable (seasonality, trend, etc.) does NOT mean removing its influence on marketing performance.


Some points about Marketing Mix Modeling:

Your Marketing Return on Investment (MROI) will be a key metric to look at during your Marketing Mix Modeling project, whether that be Marginal Marketing Return on Investment for future planning or Average Marketing Return on Investment for past interpretation. The best projects also gauge the quality of their marketing mix model, using Mean Absolute Percent Error (MAPE) and R^2

1. Ad creative is very important to your sales top line and your MROI, especially if you can tailor it to a segmented audience. This paper presents five best Spanish language creative practices to drive MROI, which should also impact top-of-the-funnel marketing measures. 

 2. The long-term impact of marketing on sales is hard to nail down, but we have found that ads that don’t generate sales lift in the near-term usually don’t in the long-term either. You can also expect long-term Marketing Return on Investment to be about 1.5 to 2.5 times the near-term Marketing Return on Investment. 

3. Modeled sales may not be equivalent to total sales. Understand how marketing to targeted segments will be modeled.

4. Brand size matters. As most brand managers know firsthand, the economics of advertisement favors large brands over small brands. The same brand TV expenditure and TV lift produces larger incremental margin dollars, and thus larger Marketing Return on Investment, for the large brand than the small brand. 5. One media’s Marketing Return on Investment does not dominate consistently. Since flighting, media weight, targeted audience, timing, copy and geographic execution vary by media for a brand, each media’s Marketing Return on Investment can also vary significantly.

Some more background into Marketing Mix Models:

Product : A product can be either a tangible product or an intangible service that meets a specific customer need or demand
Price : Price is the actual amount the customer is expected to pay for the product
Promotion : Promotion includes marketing communication strategies like advertising, offers, public relations etc.
Place : Place refers to where a company sells their product and how it delivers the product to the market.

Marketing Objectives:
For the different marketing types: TV, Radio, Print, Outdoor, Internet, Search Engine, Mobile Apps. We would like to

1. Measure ROI by media type
2. Simulate alternative media plans

Research Objectives:

1. Measure ROI by media type
2. Simulate alternative media plans
3. Build a User-Friendly simulation tool
4. Build User-Friendly optimization tool

First Step: Building the Modeling Data Set 
  1. Cross-Sectional Unit :
      • Regions
      • Markets
      • Trade Spaces
      • Channels
      • Your brands
      • Competitor brands
    1. Unit of Time
      • Months
      • Weeks
    1. Length of History
      • At least 5 years of monthly data
      • At least 2 years of weekly data


    Define the Variables

    Sales

      • Dependent Variables
      • units(not currency)

    Media Variables: 

      • TV, Radio, Internet, Social, etc.
      • Measure as units of activity (e.g., GRPs, impressions)

    Control Variables

      • Macroeconomic factors
      • Seasonality
      • Price
      • Trade Promotions
      • Retail Promotions
      • Competitor Activity

    Pick Functional Form of Demand Equation

    Quantity Demanded = f

      • Conditions:
      • Price
      • Economic Conditions
      • Size of Market
      • Customer Preferences
      • Strength of Competition
      • Marketing Activity

    Most Common Functional Forms

      • Linear
      • Log-Linear - strong assumptions
      • Double Log - more strong assumptions (used by a large percentage of models)

    Modelling Issues

      • Omitted Variables ( try to get as many variables as possible which are considered to have big impact on demand)
      • Endogeneity Bias (Instrumental variable approach, if the variable is in our predictor variable and also in our dependant variable, this creates bias and we need to account for the bias)
      • Serial Correlation (all-time series data have serial correlation which creates bias)
      • Counterintuitive results ( time series is short, we may not have enough data to look back, then we try to go more cross-sectional variables in more granular)
      • Short Time Series

    Market-Mix Modeling Econometrics

      • Mixed Modeling: fixed effects, random effects
      • Parks Estimator
      • Bayesian Methods: Random effects
      • Adstock variables: can be split up into multiple variables for different types of advertisements like promotion, equity, etc.

    Multiple Factors that Affect Outcome (Incremental Sales) :

    1. Campaign
    2. Pricing
    3. Other Campaigns
    4. Competitor Effects
    5. Seasonality
    6. Regulatory Factors

    Market Mix modelling: is designed to pick up short term effects, it is not able to model long term effects such as the effect of the brand. Advertisement helps in making a brand but this is difficult to model.

    Attribution Modeling: is different Media/Market Mix Modeling as it offers additional insight. In this type of modelling, we measure the contribution of earlier touchpoints of customer digital journey to final sale. Attribution Modeling is bottom-up approach but will be difficult to do because third party cookies are getting phased out

    Multi-Touch Attribution modelling is more advanced than top-down Market Mix Modeling because there is an instant feed loop to understand what is working. whereas in Market Mix Modeling we would just determine the percentage of x change to drive sales and then in next year model we will do the adjustment again, without getting any real on the ground feedback to understand that whether we reached the target that we set out to achieve.


    Nielson Marketing Mix Modeling is the largest Market Mix Modeling provider in the world.

    The Pros and Cons of Marketing Mix Modeling

    When it comes to initial marketing strategy or understanding external factors that can influence the success of a campaign, marketing mix modeling shines. Given the fact that MMM leverages long-term data collection to provide its insights, marketers measure the impact of holidays, seasonality, weather, band authority, etc. and their impact on overall marketing success.

    As consumers engage with brands across a variety of print, digital, and broadcast channels, marketers need to understand how each touchpoint drives consumers toward conversion. Simply put, marketers need measurements at the person-level that can measure an individual consumer’s engagement across the entire customer journey in order to tailor marketing efforts accordingly.

    Unfortunately, marketing mix modeling can’t provide this level of insight. While MMM has a variety of pros and cons, the biggest pitfall of MMM is its inability to keep up with the trends, changes, and online and offline media optimization opportunities for marketing efforts in-campaign.

    Research on IT Certifications

    Top IT management certifications

    The most valuable certifications for 2021

    • Google Certified Professional Data Engineer
    • Google Certified Professional Cloud Architect
    • AWS Certified Solutions Architect Associate
    • Certified in Risk and Information Systems Control (CRISC)
    • Project Management Professional (PMP)

    Top agile certifications

    • PMI-ACP

    Top 15 data science certifications

    • Certified Analytics Professional (CAP)
    • Cloudera Certified Associate (CCA) Data Analyst
    • Cloudera Certified Professional (CCP) Data Engineer
    • Data Science Council of America (DASCA) Senior Data Scientist (SDS)
    • Data Science Council of America (DASCA) Principle Data Scientist (PDS)
    • Dell EMC Data Science Track (EMCDS)
    • Google Professional Data Engineer Certification
    • IBM Data Science Professional Certificate
    • Microsoft Certified: Azure AI Fundamentals
    • Microsoft Certified: Azure Data Scientist Associate
    • Open Certified Data Scientist (Open CDS)
    • SAS Certified AI & Machine Learning Professional
    • SAS Certified Big Data Professional
    • SAS Certified Data Scientist
    • Tensorflow Developer Certificate
    • Mining Massive Data Sets Graduate Certificate by Stanford

    Top 10 business analyst certifications

    • Certified Analytics Professional (CAP)
    • IIBA Entry Certificate in Business Analysis (ECBA)
    • IIBA Certification of Competency in Business Analysis (CCBA)
    • IIBA Certified Business Analysis Professional (CBAP)
    • IIBA Agile Analysis Certification (AAC)
    • IIBA Certification in Business Data Analytics (CBDA)
    • IQBBA Certified Foundation Level Business Analyst (CFLBA)
    • IREB Certified Professional for Requirements Engineering (CPRE)
    • PMI Professional in Business Analysis (PBA)
    • SimpliLearn Business Analyst Masters Program

    The top 11 data analytics and big data certifications

    • Associate Certified Analytics Professional (aCAP)
    • Certification of Professional Achievement in Data Sciences
    • Certified Analytics Professional
    • Cloudera Data Platform Generalist
    • EMC Proven Professional Data Scientist Associate (EMCDSA)
    • IBM Data Science Professional Certificate
    • Microsoft Certified Azure Data Scientist Associate
    • Microsoft Certified Data Analyst Associate
    • Open Certified Data Scientist
    • SAS Certified Advanced Analytics Professional Using SAS 9
    • SAS Certified Data Scientist

    Chartered Data ScientistTM

    This distinction is provided by the Association of Data Scientists (ADaSci). This designation is awarded to those candidates who pass the CDS exam and hold a minimum of two years of work experience as a data scientist. However, the candidates who do not have experience can also take the exam and carry the results. But their charter, in this case, is put on hold until they attain the two years of experience. There is no training or course required to earn this award. The cost of taking this exam is 250 US Dollar. This charter has lifetime validity and hence it does not expire. 

    Chartered Financial Data Scientist

    The Chartered Financial Data Scientist program is organized by the Society of Investment Professionals in Germany. They first provide a training course conducted by the Swiss Training Centre for Investment Professionals. After completing this training, the candidates are allowed to earn this designation. It costs around 8,690 Euro. 

    Certified Analytics Professional

    This professional certification is offered by INFORMS. It is supported by the Canadian Operational Research Society and 3 more professional societies. There are various levels of certification. Each level has different eligibility requirements, from graduate to postgraduate etc. To earn this certification, the cost starts from 495 US Dollar. To take this exam, the candidate needs to be available in-person in the designated test centres. It is valid for three years only.

    Cloudera Certified Associate Data Analyst

    This certification program is organized by Cloudera. It is more specific towards SQL and databases and more suitable for Data Analysts. It costs around 295 US Dollar and there is no any specific eligibility requirement for this certification. This certification is valid only for two years.

    EMC Proven Professional Data Scientist Associate

    This certification program is organized by Dell EMC. To earn this distinction, it is mandatory to attend a training program, either in-class or online. It costs around 230 US Dollar. To take this exam, the candidate needs to be available in-person in the designated test centres.

    Open Certified Data Scientist

    It is organized by the Open Group. The members of the Open Group include HCL, Huawei, IBM, Oracle etc. There are 3 levels of this certification. Require to have a different level of experience for each level of certification. The cost for this certification starts from 295 US Dollar. To take this exam, the candidate needs to be available in-person at the specified place.

    Senior Data Scientist

    This certification program is provided by the Data Science Council of America (DASCA). It requires 6+ years of experience of Big Data Analytics / Big Data Engineering. It costs around 650 US Dollar. This certification has 5 years of validity. 

    Principal Data Scientist

    This certification program is provided by the Data Science Council of America (DASCA). It requires 10+ years of experience of Big Data Analytics / Big Data Engineering. There are various tracks of this exam. It costs between 850-950 US Dollar depending on the track.

    SAS Certified Data Scientist

    It is organized by SAS. To get this certification, you need to pass two more exams first SAS Big Data Professional and SAS Advanced Analytics Professional. Along with this, you need to take 18 courses as well. It costs around 4,400 US Dollar.

    Financial Data Professional 

    Financial Data Professional program is organized by Financial Data Professional Institute (FDPI). It is more suitable for financial professionals who apply AI and data science in finance. It opens the exam window with a fixed registration period. The cost of the FDP exam is 1350 US Dollar. To take this exam, the candidate needs to be available in-person in the designated test centres.

    So, here we have listed the top certification exams in data science across the world. To choose from the list, a candidate should analyze the requirements in the coming future, the suitability of certification, contents covered in the exam so that it can meet the job requirements, exam cost, exam dates and time flexibility etc. The candidate should take one such certification which meets all their expectations instead of taking multiple certification exams. 


    Also there are many more certifications provided by insurance bodies

    IFoA and CAS which are in development but need strong insurance domain knowledge

    If you are a member of Pega Academy - then Pega has their own Data Science Program








    Machine Learning - Basic Starting Notes

    Machine Learning Problem Framing - 

    Define a  ML Problem and propose a solution

    1. Articulate a problem
    2. See if any labeled data exists
    3. Design your data for the model
    4. Determine where the data comes from
    5. Determine easily obtained inputs
    6. Determine quantifiable inputs


    We have major three types of models:

    1. Supervised Learning
    2. Un-Supervised Learning
    3. Reinforcement Learning : There is no data requirement of labeled data, and the model acts like an agent which learns. It works on foundation of a reward function. Challenges lie in defining a good reward function. Also RL models are less stable and predictable than supervised approaches. Additionally, you need to provide a way for the agent to interact with the game to produce data, which means either building a physical agent that can interact with the real world or a virtual agent and a virtual world, either of which is a big challenge.


    Type of ML Problem Description Example
    Classification Pick one of N labels Cat, dog, horse, or bear
    Regression Predict numerical values Click-through rate
    Clustering Group similar examples Most relevant documents (unsupervised)
    Association rule learning Infer likely association patterns in data If you buy hamburger buns, you're likely to buy hamburgers (unsupervised)
    Structured output Create complex output Natural language parse trees, image recognition bounding boxes
    Ranking Identify position on a scale or status Search result ranking


    In traditional software engineering, you can reason from requirements to a workable design, but with machine learning, it will be necessary to experiment to find a workable model.

    Models will make mistakes that are difficult to debug, due to anything from skewed training data to unexpected interpretations of data during training. Furthermore, when machine-learned models are incorporated into products, the interactions can be complicated, making it difficult to predict and test all possible situations. These challenges require product teams to spend a lot of time figuring out what their machine learning systems are doing and how to improve them.


    Know the Problem Before Focusing on the Data

    If you understand the problem clearly, you should be able to list some potential solutions to test in order to generate the best model. Understand that you will likely have to try out a few solutions before you land on a good working model.

    Exploratory data analysis can help you understand your data, but you can't yet claim that patterns you find generalize until you check those patterns against previously unseen data. Failure to check could lead you in the wrong direction or reinforce stereotypes or bias.


    AI - 900 Azure AI fundamentals prep notes

    The Layers of AI

    • What is Artificial Intelligence (AI) ?
    • Machines that perform jobs that mimic human behavior.

    • What is Machine Learning (ML) ?
    • Machines that get better at a task without explicit programming. It is a subset of artificial intelligence that uses technologies (such as deep learning) that enable machines to use experience to improve at tasks. 

    • What is Deep Learning (DL) ?
    • Machines that have an artificial neural network inspired by the human brain to solve complex problems. It is a subset of machine learning that's based on artificial neural network.

    • What is a Data Scientist ?
    • A person with Multi-Disciplinary skills in math, statistics, predictive modeling and machine learning to make future predictions.

     

    Principle of AI

    Challenges and Risks with AI
    • Bias can affect results
    • Errors can cause harm
    • Data could be exposed
    • Solutions may not work for everyone
    • Users must trust a complex system
    • Who's liable for AI driven decision ?

    1.  Reliability and Safety : Ensure that AI systems  operate as they were originally designed, respond to unanticipated conditions and resist harmful manipulation. If AI is making mistakes it is important to release a report quantified risks and harms to end-users so they are informed of the short comings of an AI solution.

    • AI-based software application development must be subjected to rigorous testing and deployment management processes to ensure that they work as expected before release.
    • Good Examples : while developing an AI system for a self-driving car?

    2.  Fairness : Implementing processes to ensure that decisions made by AI systems can be override by humans.

    • Harm of Allocation : AI Systems that are used to Allocate or Withhold:
      • Opportunities
      • Resources
      • Information
    • Harm of Quality-of-Service : AI systems can reinforce existing stereotypes.
      • An AI system does not work well for one group of people as it does for another. As example is a voice recognition system which works well for men but not well for women.
    • Reduce bias in the model as we will live in an unfair world.
      • Fair learn is an open-source python package that allows machine learning systems developers to assess their systems' fairness and mitigate the observed the observed fairness issues.

    3.  Privacy and Security : Provide customers with information and controls over the collection, use and storage of the data. 

    • Example: On device machine learning
    • AI security Aspects: Data Origin and Lineage , Data Use : Internal vs External
    • Anomaly Detection API is good example for the above use case.

    4.  Inclusiveness: AI systems should empower everyone and engage people especially minority groups based on:

    • Physical Ability
    • Gender
    • Sexual orientation
    • Ethnicity
    • Other factors
    • Microsoft Statement- "We firmly believe everyone should benefit from intelligent technology, meaning it must incorporate and address a broad range of human needs and experiences. For the 1 billion people with disabilities around the world, AI technologies can be a game-changer."

    5. Transparency : AI systems should be understandable. Interpretability / intelligently is when end-users can understand the behavior of UI. Adopting an open source framework for AI can provide transparency (at least from the technical perspective) on the internal working of an AI systems.

    • AI systems should be understandable. Users should be made fully aware of the purpose of the system, how it works, and what limitations may be expected.
    • Example : Detail Documentation of Code for debugging 

    6. Accountability : People should be responsible for AI systems. The structure put in place to consistently enact AI principles and taking them into account. AI systems should work with the :

    • Framework of governance
    • Organizational principles
    • Ethical and legal standards
    • That are clearly defined
    • AI-Based solutions meets ethical and legal standards that advocate regulations on people civil liberties and works within a framework of governance and organizational principles.
    • Designers and developers of AI-based solution should work within a framework of governance and organizational principles that ensure the solution meets ethical and legal standards that are clearly defined.
    • AI-Based solutions meets ethical and legal standards that advocate regulations on people civil liberties and works withing a framework of governance and organizational principles.
    • Ensure that AI systems are not the final authority on any decision that impacts people's lives and that humans maintain meaningful control over otherwise highly autonomous AI systems.

    Dataset : A dataset is a logical grouping of units of data that are closely related and/or share the same data structure.

    Data labeling : process of identifying raw data and adding one or more meaningful and informative labels to provide context so that a machine learning model can learn.

    Ground Truth : a properly labeled dataset to you use as the objective standard to train and assess a given model is often called as ‘ground truth’. The accuracy of your trained model will depend on the accuracy of the ground truth.


    Machine learning in Microsoft Azure

    Microsoft Azure provides the Azure Machine Learning service - a cloud-based platform for creating, managing, and publishing machine learning models. Azure Machine Learning provides the following features and capabilities:

    Feature

    Capability

    Automated machine learning
    This feature enables non-experts to quickly create an effective machine learning model from data.
    Azure Machine Learning designer
    A graphical interface enabling no-code development of machine learning solutions.
    Data and compute management
    Cloud-based data storage and compute resources that professional data scientists can use to run data experiment code at scale.
    Pipelines
    Data scientists, software engineers, and IT operations professionals can define pipelines to orchestrate model training, deployment, and management tasks.

    Other Features of Azure Machine Learning Services :

    A service that simplifies running AI/ML related workloads allowing you to build flexible Automated ML Pipelines. Use Python, R, Run DL workloads such as TensorFlow.

    1. Jupyter Notebooks
    • build and document your machine learning models as you build them, share and collaborate.

    2. Azure Machine Learning SDK for Python

    • As SDK designed specifically to interact with Azure Machine Learning Services.

    3. MLOps

    • End to End Automation of ML Model pipelines eg. CI/CD, training, inference.

    4. Azure Machine Learning Designer

    • drag and drop interface to visually build, test, and deploy machine learning models.

    5. Data Labeling Service

    • Ensemble a team of humans to label your training data.

    6. Responsible Machine Learning

    • Model fairness through disparity metrics and mitigate unfairness.

    Performance/Evaluation Metrics are used to evaluate different Machine Learning Algorithms

    For different types of problems different metrics matters

    • Classification Metrics (accuracy, precision, recall, F1-Score, ROC, AUC)
    • Regression Metrics (MSE, RMSE, MAE)
    • Ranking Metrics (MRR, DCG, NDCG)
    • Statistical Models (Correlation)
    • Computer Vision Models (PSNR, SSIM, IoU)
    • NLP Metrics (Perplexity, BLEU, METEOR, ROUGE)
    • Deep Learning Related Metrics (Inception Score, Frechet Inception Distance)

    There are two categories of evaluation metrics:

    • Internal Metrics : metrics used to evaluate the internals of the ML Model
      • The Famous Four - Accuracy, Precision, Recall, F1-Score 
    • External Metrics : metrics used to evaluate the final prediction of the ML Model

    Random Forest Model and find the most important variables using R

    One of the benefits of using Random Forest Model is

    1. In Regression, when the variables may be highly correlated with each other, the approach of Random Forest really help in understanding the feature importance. The trick is Random forest selects explanatory variables at each variable split in the learning process, which means it trains a random subset of the feature instead of all sets of features. This is called feature bagging. This process reduces the correlation between trees; because the strong predictors could be selected by many of the trees, and it could make them correlated.

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    How to find the most important variables in R

    Find the most important variables that contribute most significantly to a response variable

    Selecting the most important predictor variables that explains the major part of variance of the response variable can be key to identify and build high performing models.

    1. Random Forest Method

    Random forest can be very effective to find a set of predictors that best explains the variance in the response variable.

    library(caret)
    
    library(randomForest)
    
    library(varImp)
    
    regressor <- randomForest(Target ~ . , data       ​= data, importance=TRUE) # fit the random forest with default parameter
    
    varImp(regressor) # get variable importance, based on mean decrease in accuracy
    
    varImp(regressor, conditional=TRUE) # conditional=True, adjusts for correlations between predictors
    

    varimpAUC(regressor) # more robust towards class imbalance.


    2. xgboost Method

    library(caret)
    
    library(xgboost)
    
    regressor=train(Target~., data        ​= data, method = "xgbTree",trControl = trainControl("cv", number = 10),scale=T)
     

    varImp(regressor)


    3. Relative Importance Method

    Using calc.relimp {relaimpo}, the relative importance of variables fed into lm model can be determined as a relative percentage.

    library(relaimpo)
    
    regressor <- lm(Target ~ . , data       ​= data) # fit lm() model
    
    relImportance <- calc.relimp(regressor, type = "lmg", rela = TRUE) # calculate relative importance scaled to 100
     

    sort(relImportance$lmg, decreasing=TRUE) # relative importance


    4. MARS (earth package) Method

    The earth package implements variable importance based on Generalized cross validation (GCV), number of subset models the variable occurs (nsubsets) and residual sum of squares (RSS).

    library(earth)
    
    regressor <- earth(Target ~ . , data       ​= data) # build model
    
    ev <- evimp (regressor) # estimate variable importance
     

    plot (ev)

    5. Step-wise Regression Method

    If you have large number of predictors , split the Data in chunks of 10 predictors with each chunk holding the responseVar.

    base.mod <- lm(Target ~ 1 , data       ​= data) # base intercept only model
    
    all.mod <- lm(Target ~ . , data       ​= data) # full model with all predictors
    
    stepMod <- step(base.mod, scope = list(lower = base.mod, upper = all.mod), direction = "both", trace = 1, steps = 1000) # perform step-wise algorithm
    
    shortlistedVars <- names(unlist(stepMod[[1]])) # get the shortlisted variable.
    
    shortlistedVars <- shortlistedVars[!shortlistedVars %in% "(Intercept)"] # remove intercept
    

    The output might include levels within categorical variables, since ‘stepwise’ is a linear regression based technique.

    If you have a large number of predictor variables, the above code may need to be placed in a loop that will run stepwise on sequential chunks of predictors. The shortlisted variables can be accumulated for further analysis towards the end of each iteration. This can be very effective method, if you want to

    ·        Be highly selective about discarding valuable predictor variables.

    ·        Build multiple models on the response variable.


    6. Boruta Method

    The ‘Boruta’ method can be used to decide if a variable is important or not.

    library(Boruta)
    
    # Decide if a variable is important or not using Boruta
    
    boruta_output <- Boruta(Target ~ . , data  ​= data, doTrace=2) # perform Boruta search
    
    boruta_signif <- names(boruta_output$finalDecision[boruta_output$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables
    
    # for faster calculation(classification only)
    
    library(rFerns)
    
    boruta.train <- Boruta(factor(Target)~., data  ​=data, doTrace = 2, getImp=getImpFerns, holdHistory = F)
    boruta.train
     
    boruta_signif <- names(boruta.train$finalDecision[boruta.train$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables
     
    boruta_signif
    
    ##
    getSelectedAttributes(boruta_signif, withTentative = F)
    
    boruta.df <- attStats(boruta_signif)
    
    print(boruta.df)
    

    7. Information value and Weight of evidence Method

    library(devtools)
    
    library(woe)
    
    library(riv)
    
    iv_df <- iv.mult(data, y="Target", summary=TRUE, verbose=TRUE)
    
    iv <- iv.mult(data, y="Target", summary=FALSE, verbose=TRUE)
    
    iv_df
    
    iv.plot.summary(iv_df) # Plot information value summary
    
    Calculate weight of evidence variables
    
    data_iv <- iv.replace.woe(data, iv, verbose=TRUE) # add woe variables to original data frame.
    

    The newly created woe variables can alternatively be in place of the original factor variables.


    8. Learning Vector Quantization (LVQ) Method

    library(caret)
    control <- trainControl(method="repeatedcv", number=10, repeats=3)
    
    # train the model
    
    regressor<- train(Target~., data       ​=data, method="lvq", preProcess="scale", trControl=control)
    
    # estimate variable importance
    
    importance <- varImp(regressor, scale=FALSE)
    

    9. Recursive Feature Elimination RFE Method

    library(caret)
    
    # define the control using a random forest selection function
    
    control <- rfeControl(functions=rfFuncs, method="cv", number=10)
    
    # run the RFE algorithm
    
    results <- rfe(data[,1:n-1], data[,n], sizes=c(1:8), rfeControl=control)
    
    # summarize the results
    
    # list the chosen features
    predictors(results)
    
    # plot the results
    plot(results, type=c("g", "o"))
    

    10. DALEX Method

    library(randomForest)
    
    library(DALEX)
    
    regressor <- randomForest(Target ~ . , data       ​= data, importance=TRUE) # fit the random forest with default parameter
    
    
    # Variable importance with DALEX
    
    explained_rf <- explain(regressor, data   ​=data, y=data$target)
    
    
    
    # Get the variable importances
    
    varimps = variable_dropout(explained_rf, type='raw')
    
    
    
    print(varimps)
    
    plot(varimps)
    

    11. VITA

    library(vita)
    
    regressor <- randomForest(Target ~ . , data    ​= data, importance=TRUE) # fit the random forest with default parameter
    
    pimp.varImp.reg<-PIMP(data,data$target,regressor,S=10, parallel=TRUE)
    pimp.varImp.reg
    
    pimp.varImp.reg$VarImp
    
    pimp.varImp.reg$VarImp
    sort(pimp.varImp.reg$VarImp,decreasing = T)
    


    12. Genetic Algorithm

    library(caret)
    
    # Define control function
    
    ga_ctrl <- gafsControl(functions = rfGA, # another option is `caretGA`.
    
                method = "cv",
    
                repeats = 3)
    
    
    
    # Genetic Algorithm feature selection
    
    ga_obj <- gafs(x=data[, 1:n-1], 
    
            y=data[, n], 
    
            iters = 3,  # normally much higher (100+)
    
            gafsControl = ga_ctrl)
    
    
    
    ga_obj
    
    # Optimal variables
    
    ga_obj$optVariables
    


    13. Simulated Annealing

    library(caret)
    
    # Define control function
    
    sa_ctrl <- safsControl(functions = rfSA,
    
                method = "repeatedcv",
    
                repeats = 3,
    
                improve = 5) # n iterations without improvement before a reset
    
    
    
    # Simulated Annealing Feature Selection
    
    set.seed(100)
    
    sa_obj <- safs(x=data[, 1:n-1], 
    
            y=data[, n],
    
            safsControl = sa_ctrl)
    
    
    
    sa_obj
    
    # Optimal variables
    
    print(sa_obj$optVariables)
    
    
    

    14. Correlation Method

    library(caret)
    
    # calculate correlation matrix
    
    correlationMatrix <- cor(data [,1:n-1])
    
    # summarize the correlation matrix
    
    print(correlationMatrix)
    
    # find attributes that are highly corrected (ideally >0.75)
    
    highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5)
    
    # print indexes of highly correlated attributes
     

    print(highlyCorrelated)