Markov Boundary Feature Selection - Causal Feature Selection

Git Hub Page of Sisa Ma : https://github.com/SisiMa1729/Causal_Feature_Selection

Git Hub of Demo on Lung Cancer : https://github.com/SisiMa1729/Causal_Feature_Selection/tree/master/Demo/Bhattacharjee2001

Git Hub of Demo on Survival Analysis : https://github.com/SisiMa1729/Causal_Feature_Selection/tree/master/Demo/Vijver2002

Book to refer : Guyon I, Aliferis C. Causal feature selection. InComputational methods of feature selection 2007 Oct 29 (pp. 75-97). Chapman and Hall/CRC.

Predictive Modeling uses statistics to predict outcome : it could be data, standard deviation of data, in fact it can be any statistic we can use to predict outcome

Types of models;

Diagnostic Models : Determine Sub-types of Breast Cancer

Prognostic : Predicting time-to-death / time-to-relapse for cancer patients

Risk Assessment : Predicting the risk of PTSD after trauma

Goal of a model is to use high dimensional feature space to predict a outcome. Two problems that come are; (1) Under fitting and (2) Over fitting. Over fitting is a more severe problem with high number of features and limited or small sample size of data.

Over fitting problems are dealt with : Support Vector Machine , Random Forest and Lasso Regression all have built-in regularization which penalizes model complexity. Over fitted model may not work well with new data.

Feature Selection : Selecting a subset of features from all the available features for constructing the predictive model.

Question : How many subset can i make out of N total number of features.

Answer : 2^N distinct features subset.

Follow up Question : Which subset  of features would be the best to  predict outcome ?

Answer : This can be found by cross-validation of subsets.

Goal of Feature Selection : 

  • Improve Model Predictability;
  • Enhance Model Interpretability 
  • Increase cost-efficiency for model development


For selecting features, an intuitive and popular strategy is to select features that are uni-variate associated with target of interest.

However, features that are useless for predicting by themselves can be useful when combined with other features (i.e. interactive among features can be predictive). Selection based on predictivity/relevance of individual features may miss useful features

It is critical to consider combination of features and the combined information in these features with respect to target of interest. (this look like LDA approach to me or Supervised PCA, ... need to confirm)


What would be really nice is that if can determine in advance how small with be the optimal feature set.


If a feature is strongly relevant, it means that i cannot replace part of the information to explain Target with any other feature.

 

By defining all features as strong, weak or irrelevant features, we can formerly put all features in one of three categories and define a strategy.

Weak features do not improve predictive performance if all strong features are already selected. But if we keep a few weak features it doesn't deteriorate the performance.  


How to select features methodology

  • In Recursive Feature selection we rank the features based on importance and strength with general methods like below;
    • We can use Support Vector Machines for features selection where the coefficients of normalized features will tell the importance of features.
    • Random Forest give the Gini Coefficient of strong vs Weak vs Irrelevant 
    • This method requires a nested cross-validation setup to extract the features.
    • Sometimes dropping/eliminating features/variables one at a time is considered but we should also check by dropping multiple weak or irrelevant features along with cross-validation for testing to get the optimal subset.
    • Lasso Regularization, presses for a simple model by penalizing complex models.


Recursive Feature Selection, Random Forest, SVM and Lasso Regression are non-causal feature selection methods because they are trying to maximize predictive performance while minimizing the feature subset for the model. These methods do not directly utilize the concepts of different types of feature relevance but in practice they produce models with excellent predictive performances in various application domains.

But is this the optimal subset as the above methods do not check if the weak features or irrelevant features might be relevant in presence of other features. Thus we need to discuss the methods of causal feature selection. In causal feature selection we can improve interpretability but not necessarily the predictability because the above Non-Causal Models already give very good models with high predictability. With Causal Feature selection we are trying the see if we can determine the optimal feature set with similar predictability so that all other features become independent to the target(that means the other variables outside the Markov Boundary do not give any information).

Markov Boundary is a NP Complete Problem even for Linear Regression

Markov Boundary (MB) depends on learner and the metric for evaluation. The MB changes if the learning algorithm changes or the metric of evaluation threshold changes.


Causal Feature Selection

Causal Feature Selection (CFS) works on the concept of feature of relevance and joint distribution. 

It would be good if we know the data generating process to use CFS and generate the causal structure between the data points. Then draw the causal structure and find the joint distribution.

The below graph shows if we the data generating process, by CFS we can find the subset of optimal number of features to describe the feature of interest (i.e. Target)

For the above graph if i know X1, X2 , X3, X4 and X5 then i do not need to know the other variables as information is passing though these variables which are inside Markov Boundary and can explain the Variable of Interest (i.e. Target). Information outside the Markov Boundary Variable is embedded in the variables inside the Markov Boundary.

The Markov Boundary (direct causes + direct effects + direct causes of effects) is the minimal feature set that contain all information regarding the target of interest. Causal Feature Selection is also sometimes known as Markov Boundary when they overlap, this happens when a spatial condition is met.

In the above image Direct Cause is X1, X2 and X3. Direct Effect are X4 and X5. These variables will give the best performance because they capture all the Target Variable information.

Benefit of Causal Feature Selection:

  • Selected Features have Causal Interpretations.
  • Selected Features set is generally smaller than Non Causal Methods like Random Forest, SVM, Lasso Regression.
  • Model Generalize better under certain type of distribution shifts.


Causal Feature Selection selects Variables which are more closely related around the target and does a better job then Non Causal Feature Selection. In the image below we see an example of Lasso (Recursive Feature Selection) where variables are spread all around with comparison to Causal Feature Selection shown in a small zoom in box on the right hand side of the image. The Markov Blanked or Causal Feature Selection tries to find out what are the features which makes other features independent. These subset features are neighbors of the target variable and cannot be blocked as they have parent-child relationship.

Benefits of Causal Feature Selection in comparison to Non Causal Feature Selection


Assumptions:

  1. Markov Condition : Every Variable is independent  of its non-descendants given its parents
  2. Faithfulness Assumption : Interdependence stem only from the network structure and not the parameterization of the distribution 
    1. Some interdependencies are explicitly determined by the Markov Boundary, some are entailed using probability theory
    2. Using Bayesian Networks, The d-separation criterion algorithm determine all causal interdependencies entailed by the Markov Condition, which determines path through which information passes also called as open path (also known as Directed Path) or closed paths where information does not pass.
    3. The Markov Blanket / boundary is unique in the Faithfulness Assumption. The variables inside the Markov Blanket cannot be replaced.


We need to learn about partial correlation test to under if a variable is connected to another variables through a middle variable.

If we look at the residuals using partial correlation we know that there is structure which shows a relationship.

How to find Markov Boundary or Strong Relevant Features;

Conditional Independence Tests: (What it means is that two dependent quantities becomes conditionally independent when we learn about a third variable. The opposite can also happen when two independent quantities become conditionally dependent when seen in the context of other variables)
    • fisher's Test
      • X^2 (Chi Square) , G^2 (G - Square) for Categorical Variables
      • conditional mutual information, distance correlation (for non-linear test)
      • comparison of nested models


      The Statistical Tests for above can be read like below;

      • If it is a small-p value then highly likely for their to be dependence
      • If it is a large-p value then independence or don't know
      • Conditioning on large sets make p-value unreliable (loses statistical power)


      Some Algorithms implemented are:

      Forward-Backward Selection Algorithm

      Working;

      • We start with all variables in forward selection and from the data compute the p-values of all variables with Target 'T'.
      • Select the variables with smallest p-values.
        • In every step compute the p-values with all other variables.
        • Try the remove the variables which become independent because of large p-values.
      • Then in backward selection we remove the false positive variables which got selected in forward selection
        • This will remove the variables from the forward selection which have become independent with respect to other selected variables by checking for large p-values.
      • The entire algorithm working by calculating statistical p-values using the data.

      Explanation;

      • Simple, easy general algorithm
      • Suitable when one has the statistical power to condition on all the statistical features
      • Theorem : Forward-Backward Search returns the Markov Blanket of T in distributions faithful to a Bayesian Network with latent variables given perfect tests of conditional independence (CI)
      • Complexity(in number of CI Tests): O(n * s = complexity), N - number of total features, S - number of selected features.
      • Rediscovered as incremental Association Market Basket (Tsarmardinos et al 2003a), Grow-Shrink (Margaritis & Thrun - 2000) (but with a static ordering of features)
      • When one has enough sample size to condition on all features, just perform Backward-Search
      • A variation of Forward-Backward Selection is Forward -Backward Selection with early dropping (Single run) (Borboudakis & Tsamardinos, 2017)
        • Disadvantage with this fast algorithm is the appearance of false negatives.

      Max-Min Parents and Children : Conditioning on Subsets (Tsamardinos, et al 2003b) 

      • Here we first decide a subset and for a variable the condition of independence is checked with a subset of size k and based on P-value, if the P-Value is high then we drop it.
      • K should be greater than 5
      • It may return features which are not parents or children, false positives but it is computationally faster.
      • This algorithm is greedy in some sense.
      • Explanation;
        • when one has the statistical power to condition only upto K features
        • Theorem : MMPC (with symmetry correction and large-enough K) returns the neighbors (parents and children) of T in distributions faithful to a Bayesian Network with latent variables.
          • Returns a small superset without symmetry correction
        • Simple Extensions (MMMB) returns the full Markov Blanket
        • Complexity (in number of CI Tests) : O(n*s^k)
        • Typically conditioning only on k=3,4 suffices on excellent results
        • Max-Min Heuristic: Select the variable that has the largest minimum conditional association (or smallest maximum p-value) with T (conditioned on all possible subsets of the selected features)
          • Smallest maximum p-value minimizes a bound on the false discovery rate of neighbors (Tsamardinos and Brown 2008)
        • Works very well in binary prediction and survival analysis and high dimensional dataset like genomics for dimension reduction  

      PC-Simple Algorithm is the simplest implementation;

      • It examines all the uni-variate association (or correlation) of the variables with the target.
      • It removes the uni-variate and all the associate variables.
      • It then checks with conditional independence tests of one variable with condition on another.
      • Then iterate with conditional independence tests of one variable with condition on  set of two or more variables.

      In the end the Markov Boundary gives variables B, C, F and H to explain target T.


      R Package name is MXM. The algorithm implemented in the package are Backward Search, Forward-Backward Search, FBED, MMPC, MMMB, SES (for multiple solutions)


      For very large data; challenge is how to calculate local p-values.

      1. We can use Early Dropping
        1. Same as Forward-Backward with Early Dropping
        2. Filter out features as soon as deemed conditionally independent of T
      2. Early Stopping
        1. Stop computing local statistics as soon as it is deemed that a feature will not be selected for inclusion/exclusion in the forward/backward phase
      3. Early Return
        1. Stop computations as soon as enough samples have been seen to determine a good enough feature for inclusion/exclusion.

      Causal Feature Selection Models help us to understand the feature selection problem in a non-parametric way and is arguably the main tool in knowledge discovery and designing new algorithms