top ai certification in 2024

Top AI Certifications for 2024. In the ever-changing world of… | by Philip Smith | Blockchain Council | Nov, 2023 | Medium

10 Valuable Artificial Intelligence Certifications for 2024 (analyticsinsight.net)

10 AI Certifications for 2024: Build Your Skills and Career | Upwork


Intel® Edge AI Certification

Jetson AI Courses and Certifications | NVIDIA Developer

Microsoft Certified: Azure AI Engineer Associate - Certifications | Microsoft Learn

Artificial Intelligence Certification | AI Certification | ARTIBA

Certified Artificial Intelligence Scientist | CAIS™ | USAII®

good introduction

Dear Norma, I hope this email finds you well. I am writing to express my strong interest in the HealthCare roles. I came across the job opening and was immediately drawn to the opportunity to collaborate with diverse lines of business and leverage data analytics and machine learning capabilities to drive actionable insights. With my background in Engineering/Technology and extensive knowledge in Data Engineering, Data Analytics, and Advanced Analytics, I am confident in my ability to uncover valuable enterprise insights and implement data management applications that contribute to operational effectiveness. My passion for Machine Learning and Artificial Intelligence has led me to develop end-to-end ML workflows, including data collection, feature engineering, model training, and deploying models in production. Throughout my career, I have utilized Python, PySpark, and SQL to build robust backend solutions and employed visualization tools such as Power BI and Tableau to effectively communicate data insights. Additionally, I have hands-on experience working with cloud platforms like Azure and have expertise in creating ETL pipelines and leveraging distributed computing for scalability. One of the aspects that excites me the most about this role is the opportunity to operationalize and monitor machine learning models using MLflow and Kubeflow while applying DevOps principles to ensure smooth deployment and management. I am also experienced in designing executive dashboards that provide actionable insights, empowering decision-making at all levels. With a bachelor's degree in mathematics and a master's degree in a quantitative field like Artificial Intelligence, I am well-equipped to tackle complex data challenges and provide innovative solutions. My 11+ years of experience in data science settings My functional and technical competencies encompass a wide array of skills, including data analytics, data engineering, cloud technologies, and data science, making me confident in my ability to contribute effectively to the success of Global Solutions. If possible i would like to discuss further how my qualifications align with the role's requirements and how I can be a valuable addition to the team. I am eagerly looking forward to the opportunity to connect and explore this exciting career prospect further. Best regards, Salman Ahmed 

+44-7587652115

AI Prompt Engineering

Understanding Big Language Models:

1. DALL-E 2 (Open.AI)
2. Stable Diffusion (Stability.AI)
3. Midjourney (Midjourney)
4. Codex - Github Copilot (Open.AI)
5. You.com (You.com)
6. Whisper AI (Open.AI)
7. GPT-3 Models (175B?) (Open.AI)
8. OPT (175B and 66B) (Meta)
9. BLOOM (176B) (Hugging Face)
10. GPT-NeoX (20B) (Eleuther.AI)

Topics where user can contribute:

  • Retrieval augment in-context learning
  • Better benchmarks
  • "Last Mile" for productive applications
  • Faithful, human-interpretable explanations. 

Prompt Engineering Overview:

At the very basic we have interface to interact with a language model, where we pass some instruction and the model passes a response. The response is generated by the language model.

A prompt is composed with the following components:

  • Instructions
  • Context (this is not always given but is part of more advanced techniques)
  • Input Data
  • Output Indicator

Settings to keep in mind:

  • When prompting a new language model you should keep in mind a few settings
  • You can get very different results with prompts when using different settings
  • One important setting is controlling how deterministic the model is when generating completion of prompts:
    • Temperature and top_p are two important parameters to keep in mind.
    • Generally, keep these low if you are looking for exact answers like mathematics equation answers
    • ... and keep them high for more diverse responses like text generation, poetry generation.

Designing prompts for Different Tasks:

Tasks Covered:

  • Text Summarization
  • Question Answering
  • Text Classification
  • Role Playing
  • Code Generation
  • Reasoning
      Prompt Engineering Techniques: Many advanced prompting techniques have been designed to improve performance on complex tasks.
      • Few-Shot prompts
      • Chain-of-Thought (CoT) prompting
      • Self-Consistency
      • Knowledge Generation prompting
      • ReAct


      Tools & IDE's : Tools, libraries and platforms with different capabilities and functionalities include:

      • Developing and Experimenting with Prompts
      • Evaluating Prompts
      • Versioning and deploying prompts
      • Dyno
      • Dust
      • LangChain
      • PROMPTABLE

      Example of LLMs with external tools:

      • The generative capabilities of LLMs can be combined with an external tool to solve complex problems.
      • The components you need:
        • An agent powered by LLM to determine which action to take
        • A tool used by the agent to interact with the world (e.g. search API, Wolfram, Python REPL, database lookup)
        • The LLM that will power the agent.

      Opportunities and Future Directions:

      • Model Safety: This can be used to not only improve the performance but also the reliability of response from a safety perspective.
        • Prompt engineering can help identify risky behavior of LLMs which can help to reduce harmful behaviors and risks that may arise from language models.
        • There is also a part of the community performing prompt inject to understand the vulnerability of LLMs.
      • Prompt Injection: it turns out that building LLMs, like any other systems comes with safety and challenges and safety considerations. Prompt injection aim to find vulnerabilities in LLMs.
        • Some common issues include:
          • Prompt Injection
          • Prompt Leaking: It aims to force the model to spit out information about its own prompt. This can lead to leaking of either sensitive, private or information that is confidential. 
          • Jailbreaking: Is another form of prompt injection where the goal is to bypass safety and moderation features.
            • LLMs provided via API's might be coupled with safety features or content moderation which can be bypassed with harmful prompts/attacks.
      • RLHF: Train LLM's to meet a specific human preference. Involves collecting high-quality prompt datasets. 
        • Popular Examples : 
        • Claude (Anthropic)
        • ChatGPT (OpenAI)
      • Future Directions include:
        • Augmented LLM's
        • Emergent ability of LLM's
        • Acting / Planning - Reinforcement Learning
        • Multimodal Planning
        • Graph Planning





      A token is ChatGPT is roughly 4 words.

      LLM Chat GPT

      Some notes on Recurrent Neural Network: A neural network which has a high hidden dimension state. When a new observations comes it updates its high hidden dimension state.

      In machine learning there is lot of unity in principles to be applied to different data modalities. We use the same neural net architecture, gradients and adam optimizer to fine tune the gradients. For RNN we use some additional tools to reduce the variance of the gradients. For example: using CNN for image learning or Transformers to NLP problems. Years back in NLP for every tiny problem there was a different architecture. 

      Question : Where does vision stop and language begin

      1. Proposed future is to develop Reinforcement Learning  techniques to help supervised learning perform better.
      2. Another are of active research is Spike-timing-dependent plasticity. The concept of STDP has been shown to be a proven learning algorithm for forward-connected artificial neural network in pattern recognition. A general approach, replicated from the core biological principles, is to apply a window function (Δw) to each synapse in a network. The window function will increase the weight (and therefore the connection) of a synapse when the parent neuron fires just before the child neuron, but will decrease otherwise.

      With Deep learning we are looking at a static problem with a probability distribution and applying the model to the distribution.

      Back Propagation is useful algorithm and not go away, because it helps in finding a neural circuit subject to some constraints.

      For Natural Language Modelling it is proven that very large datasets work because we are trying to predict the next word by broad strokes and surface level pattern. Once the language model becomes large, it understand the characters, spacing, punctuations, words, and finally the model learns the semantics and the facts.

      Transformers is the most important advance in neural networks. Transformers is a combination of multiple ideas in which attention is one in which attention is a key. Transformers is designed in a way that it runs on a really fast GPU. It is not recurrent, thus it is shallow (less deep) and very easy to optimize.

      After Transformers to built AGI, research is going on in Self Play and Active Learning.

      GAN's don't have a mathematical cost function which it tries to optimize by gradient descent. Instead there is a game in which through mathematical functions it tries to find equilibrium.

      Other example of deep learning models without cost function is reinforcement learning with self-play and surprise actions. 


      Double Descent:

      When we make neural network larger it becomes better which is contrarian to statistical ideas. But there is a problem called the double descent bump as shown below;

      Double descent occurs for all practical deep learning systems. Take a neural network and start increasing its size slowly while keeping the dataset size fixed. If you keep increasing the neural network size and don't do early stopping then, there is increase in performance and then it gets worse. It the point the model gets worst is precisely the point at which the model gets zero training error or zero training loss and then when we make it larger it start to get better again. It counter-intuitive because we expect the deep learning phenomenon to be monotonic.

      The intuition is as follows:

      "When we have a large data and a small model then small model is not sensitive to randomness/uncertainty in the training dataset. As the model gets large it achieves zero training error at approximately the point with the smallest norm in that subspace. At the point the dimensionality of the training data is equal to the dimensionality of the neural network model (one-to-one correspondence or degrees of freedom of dataset is same as degrees of freedom of model) at that point random fluctuation in the data worsens the performance (i.e. small changes in the data leads to noticeable changes in the model). But this double descent bump can be removed by regularization and early stopping."

      If we have more data than parameters or more parameters than data, then model will be insensitive to the random changes in the dataset.

      Overfitting: When model is very sensitive to small random unimportant stuff in the training dataset.

      Early Stop: We train our model and monitor our performance and at some point when the validation performance starts to become worse we stop training (i.e. we determine to stop training and consider the model to be good enough)


      ChatGPT:

      ChatGPT has become a water-shed moment for organization because all companies are inherently language based companies. Whether it is text, video, audio, financial records all can be described as tokens which can be fed to large language models.

      A good example of this is when during training of ChatGPT on amazon reviews, they found that after large amount of training the model became an excellent classifier of sentiment. So the model from predicting the next word (token) in a sentence, started to understand the semantics of the sentence and could tell if the review was a positive or negative.

      With Advancement of AI, we have a likeness of a particular person as a separate bot, and the particular person will get a say, cut and licensing opportunities of his likeness.

      Great Google Analytics courses and Google Material on Udemy

      All the material is for for getting certification for Google Universal Analytics or GA3, but the material will also help to prepare for GA4. Unfortunately GA4 is very new and very few people are using it. 

      Udemy:

      https://www.udemy.com/share/101YUA3@1ZQpoeanMxxthiBi3TRUePtvhK8jpKedLNfathrLsI_5x8FtERy5aZusAp5R/


      This one is excellent resource before the exam

      https://www.udemy.com/share/1057WK3@B0vqy8cXKsPzaotyxGtf8OMJUbk6LabDRa9MvahhOqCaaXBprgawEPRvwRFK/


      Google Material

      https://skillshop.exceedlms.com/student/catalog/list?category_ids=6431-google-analytics-4

      https://skillshop.exceedlms.com/student/path/2938

      Statistics Notes 3 - Other Terms

      The Pearson / Wald / Score Chi-Square Test can be used to test the association between the independent variables and the dependent variable. 

      Wald/Score chi-square test can be used for continuous and categorical variables. Whereas, Pearson chi-square is used for categorical variables. The p-value indicates whether a coefficient is significantly different from zero. In logistic regression, we can select top variables based on their high wald chi-square value.

      Gain :Gain at a given decile level is the ratio of cumulative number of targets (events) up to that decile to the total number of targets (events) in the entire data set. This is also called CAP (Cumulative Accuracy Profile) in Finance, Credit Risk Scoring Technique

      Interpretation: % of targets (events) covered at a given decile level. For example,  80% of targets covered in top 20% of data based on model. In the case of propensity to buy model, we can say we can identify and target 80% of customers who are likely to buy the product by just sending email to 20% of total customers.

      Lift : It measures how much better one can expect to do with the predictive model comparing without a model. It is the ratio of gain % to the random expectation % at a given decile level. The random expectation at the xth decile is x%.
      Interpretation: The Cum Lift of 4.03 for top two deciles, means that when selecting 20% of the records based on the model, one can expect 4.03 times the total number of targets (events) found by randomly selecting 20%-of-file without a model.

      Gain / Lift Analysis
      1. Randomly split data into two samples: 70% = training sample, 30% = validation sample. 
      2. Score (predicted probability) the validation sample using the response model under consideration. 
      3. Rank the scored file, in descending order by estimated probability 
      4. Split the ranked file into 10 sections (deciles) 
      5. Number of observations in each decile 
      6. Number of actual events in each decile 
      7. Number of cumulative actual events in each decile 
      8. Percentage of cumulative actual events in each decile. It is called Gain Score. 
      9. Divide the gain score by % of data used in each portion of 10 bins. For example, in second decile, divide gain score by 20.
      Decile Rank Number of cases Number of Responses Cumulative Responses % of Events Gain Cumulative Lift Number of Decile Score to divide Gain
      1 2500 2179 2179 44.71% 44.71% 4.47% 10
      2 2500 1753 3932 35.97% 80.67% 4.03% 20
      3 2500 396 4328 8.12% 88.80% 2.96% 30
      4 2500 111 4439 2.28% 91.08% 2.28% 40
      5 2500 110 4549 2.26% 93.33% 1.87% 50
      6 2500 85 4634 1.74% 95.08% 1.58% 60
      7 2500 67 4701 1.37% 96.45% 1.38% 70
      8 2500 69 4770 1.42% 97.87% 1.22% 80
      9 2500 49 4819 1.01% 98.87% 1.10% 90
      10 2500 55 4874 1.13% 100.00% 1.00% 100
        25000 4874          

      Detecting Outliers 

      There are two simple ways you can detect outlier problem :

      1. Box Plot Method : If a value is higher than the 1.5*IQR above the upper quartile (Q3), the value will be considered as outlier. Similarly, if a value is lower than the 1.5*IQR below the lower quartile (Q1), the value will be considered as outlier.
      QR is interquartile range. It measures dispersion or variation. IQR = Q3 -Q1.
      Lower limit of acceptable range = Q1 - 1.5* (Q3-Q1)
      Upper limit of acceptable range = Q3 + 1.5* (Q3-Q1)
      Some researchers use 3 times of interquartile range instead of 1.5 as cutoff. If a high percentage of values are appearing as outliers when you use 1.5*IQR as cutoff, then you can use the following rule
      Lower limit of acceptable range = Q1 - 3* (Q3-Q1)
      Upper limit of acceptable range = Q3 + 3* (Q3-Q1)
      2. Standard Deviation Method: If a value is higher than the mean plus or minus three Standard Deviation is considered as outlier. It is based on the characteristics of a normal distribution for which 99.87% of the data appear within this range. 
      Acceptable Range : The mean plus or minus three Standard Deviation
      This method has several shortcomings :
      1. The mean and standard deviation are strongly affected by outliers.
      2. It assumes that the distribution is normal (outliers included)
      3. It does not detect outliers in small samples
      3. Percentile Capping (Winsorization): In layman's terms, Winsorization (Winsorizing) at 1st and 99th percentile implies values that are less than the value at 1st percentile are replaced by the value at 1st percentile, and values that are greater than the value at 99th percentile are replaced by the value at 99th percentile. The winsorization at 5th and 95th percentile is also common. 

      The box-plot method is less affected by extreme values as compared to Standard Deviation method. If the distribution is skewed, the box-plot method fails. The Winsorization method is a industry standard technique to treat outliers. It works well. In contrast, box-plot and standard deviation methods are traditional methods to treat outliers. 

      4. Weight of Evidence: Logistic regression model is one of the most commonly used statistical technique for solving binary classification problem. It is an acceptable technique in almost all the domains. These two concepts - weight of evidence (WOE) and information value (IV) evolved from the same logistic regression technique. These two terms have been in existence in credit scoring world for more than 4-5 decades. They have been used as a benchmark to screen variables in the credit risk modeling projects such as probability of default. They help to explore data and screen variables. It is also used in marketing analytics project such as customer attrition model, campaign response model etc.

      What is Weight of Evidence (WOE)?

      The weight of evidence tells the predictive power of an independent variable in relation to the dependent variable. Since it evolved from credit scoring world, it is generally described as a measure of the separation of good and bad customers. "Bad Customers" refers to the customers who defaulted on a loan. and "Good Customers" refers to the customers who paid back loan.

               Formulae - ln(% of Good Customers / % of Bad Customer)
      Distribution of Goods - % of Good Customers in a particular group
      Distribution of Bads - % of Bad Customers in a particular group
      ln - Natural Log
      Positive WOE means Distribution of Goods > Distribution of Bads
      Negative WOE means Distribution of Goods < Distribution of Bads
      Hint : Log of a number > 1 means positive value. If less than 1, it means negative value.

      Many people do not understand the terms goods/bads as they are from different background than the credit risk. It's good to understand the concept of WOE in terms of events and non-events. It is calculated by taking the natural logarithm (log to base e) of division of % of non-events and % of events.
      Weight of Evidence for a category = log (% events / % non-events) in the category

      Weight of Evidence was originated from logistic regression technique. It tells the predictive power of an independent variable in relation to the dependent variable. It is calculated by taking the natural logarithm (log to base e) of division of % of non-events and % of events.
      Outlier Treatment with Weight Of Evidence : Outlier classes are grouped with other categories based on Weight of Evidence (WOE).


      1. For a continuous variable, split data into 10 parts (or lesser depending on the distribution).
      2. Calculate the number of events and non-events in each group (bin)
      3. Calculate the % of events and % of non-events in each group.
      4. Calculate WOE by taking natural log of division of % of non-events and % of events
      Note : For a categorical variable, you do not need to split the data (Ignore Step 1 and follow the remaining steps)

      Home » Credit Risk Modeling » Data Science » Logistic Regression » Weight of Evidence (WOE) and Information Value (IV) Explained

      WEIGHT OF EVIDENCE (WOE) AND INFORMATION VALUE (IV) EXPLAINED

      In this article, we will cover the concept of weight of evidence and information value and how they are used in predictive modeling process along with details of how to compute them using SAS, R and Python.

      Logistic regression model is one of the most commonly used statistical technique for solving binary classification problem. It is an acceptable technique in almost all the domains. These two concepts - weight of evidence (WOE) and information value (IV) evolved from the same logistic regression technique. These two terms have been in existence in credit scoring world for more than 4-5 decades. They have been used as a benchmark to screen variables in the credit risk modeling projects such as probability of default. They help to explore data and screen variables. It is also used in marketing analytics project such as customer attrition model, campaign response model etc.


      What is Weight of Evidence (WOE)?

      The weight of evidence tells the predictive power of an independent variable in relation to the dependent variable. Since it evolved from credit scoring world, it is generally described as a measure of the separation of good and bad customers. "Bad Customers" refers to the customers who defaulted on a loan. and "Good Customers" refers to the customers who paid back loan.
      WOE Calculation
      Distribution of Goods - % of Good Customers in a particular group
      Distribution of Bads - % of Bad Customers in a particular group
      ln - Natural Log
       Positive WOE means Distribution of Goods > Distribution of Bads
      Negative WOE means Distribution of Goods < Distribution of Bads

      Hint : Log of a number > 1 means positive value. If less than 1, it means negative value.

      Many people do not understand the terms goods/bads as they are from different background than the credit risk. It's good to understand the concept of WOE in terms of events and non-events. It is calculated by taking the natural logarithm (log to base e) of division of % of non-events and % of events.
      WOE = In(% of non-events ➗ % of events)
      Weight of Evidence Formula

      Steps of Calculating WOE

      1. For a continuous variable, split data into 10 parts (or lesser depending on the distribution).
      2. Calculate the number of events and non-events in each group (bin)
      3. Calculate the % of events and % of non-events in each group.
      4. Calculate WOE by taking natural log of division of % of non-events and % of events
      Note : For a categorical variable, you do not need to split the data (Ignore Step 1 and follow the remaining steps)
      Weight of Evidence and Information Value
      Weight of Evidence and Information Value Calculation

      Terminologies related to WOE

      1. Fine Classing : Create 10/20 bins/groups for a continuous independent variable and then calculates WOE and IV of the variable
      2. Coarse Classing : Combine adjacent categories with similar WOE scores


      Weight of Evidence (WOE) helps to transform a continuous independent variable into a set of groups or bins based on similarity of dependent variable distribution i.e. number of events and non-events.

      For continuous independent variables : First, create bins (categories / groups) for a continuous independent variable and then combine categories with similar WOE values and replace categories with WOE values. Use WOE values rather than input values in your model.

      Home » Credit Risk Modeling » Data Science » Logistic Regression » Weight of Evidence (WOE) and Information Value (IV) Explained

      WEIGHT OF EVIDENCE (WOE) AND INFORMATION VALUE (IV) EXPLAINED

      In this article, we will cover the concept of weight of evidence and information value and how they are used in predictive modeling process along with details of how to compute them using SAS, R and Python.

      Logistic regression model is one of the most commonly used statistical technique for solving binary classification problem. It is an acceptable technique in almost all the domains. These two concepts - weight of evidence (WOE) and information value (IV) evolved from the same logistic regression technique. These two terms have been in existence in credit scoring world for more than 4-5 decades. They have been used as a benchmark to screen variables in the credit risk modeling projects such as probability of default. They help to explore data and screen variables. It is also used in marketing analytics project such as customer attrition model, campaign response model etc.


      What is Weight of Evidence (WOE)?

      The weight of evidence tells the predictive power of an independent variable in relation to the dependent variable. Since it evolved from credit scoring world, it is generally described as a measure of the separation of good and bad customers. "Bad Customers" refers to the customers who defaulted on a loan. and "Good Customers" refers to the customers who paid back loan.
      WOE Calculation
      Distribution of Goods - % of Good Customers in a particular group
      Distribution of Bads - % of Bad Customers in a particular group
      ln - Natural Log
       Positive WOE means Distribution of Goods > Distribution of Bads
      Negative WOE means Distribution of Goods < Distribution of Bads

      Hint : Log of a number > 1 means positive value. If less than 1, it means negative value.

      Many people do not understand the terms goods/bads as they are from different background than the credit risk. It's good to understand the concept of WOE in terms of events and non-events. It is calculated by taking the natural logarithm (log to base e) of division of % of non-events and % of events.
      WOE = In(% of non-events ➗ % of events)
      Weight of Evidence Formula

      Steps of Calculating WOE

      1. For a continuous variable, split data into 10 parts (or lesser depending on the distribution).
      2. Calculate the number of events and non-events in each group (bin)
      3. Calculate the % of events and % of non-events in each group.
      4. Calculate WOE by taking natural log of division of % of non-events and % of events
      Note : For a categorical variable, you do not need to split the data (Ignore Step 1 and follow the remaining steps)
      Weight of Evidence and Information Value
      Weight of Evidence and Information Value Calculation


      Download : Excel Template for WOE and IV

      Terminologies related to WOE

      1. Fine Classing
      Create 10/20 bins/groups for a continuous independent variable and then calculates WOE and IV of the variable
      2. Coarse Classing
      Combine adjacent categories with similar WOE scores

      Usage of WOE

      Weight of Evidence (WOE) helps to transform a continuous independent variable into a set of groups or bins based on similarity of dependent variable distribution i.e. number of events and non-events.

      For continuous independent variables : First, create bins (categories / groups) for a continuous independent variable and then combine categories with similar WOE values and replace categories with WOE values. Use WOE values rather than input values in your model.
      Categorical independent variables: Combine categories with similar WOE and then create new categories of an independent variable with continuous WOE values. In other words, use WOE values rather than raw categories in your model. The transformed variable will be a continuous variable with WOE values. It is same as any continuous variable.

      Why combine categories with similar WOE?

      It is because the categories with similar WOE have almost same proportion of events and non-events. In other words, the behavior of both the categories is same.
      Rules related to WOE
      1. Each category (bin) should have at least 5% of the observations.
      2. Each category (bin) should be non-zero for both non-events and events.
      3. The WOE should be distinct for each category. Similar groups should be aggregated.
      4. The WOE should be monotonic, i.e. either growing or decreasing with the groupings.
      5. Missing values are binned separately.
      FEATURE SELECTION : SELECT IMPORTANT VARIABLES WITH BORUTA PACKAGE

      Home » Data Science » Feature Selection » » Feature Selection : Select Important Variables with Boruta Package

      FEATURE SELECTION : SELECT IMPORTANT VARIABLES WITH BORUTA PACKAGE

      This article explains how to select important variables using boruta package in R. Variable Selection is an important step in a predictive modeling project. It is also called 'Feature Selection'. Every private and public agency has started tracking data and collecting information of various attributes. It results to access to too many predictors for a predictive model. But not every variable is important for prediction of a particular task. Hence it is essential to identify important variables and remove redundant variables. Before building a predictive model, it is generally not know the exact list of important variable which returns accurate and robust model.

      Why Variable Selection is important?
      1. Removing a redundant variable helps to improve accuracy. Similarly, inclusion of a relevant variable has a positive effect on model accuracy.
      2. Too many variables might result to overfitting which means model is not able to generalize pattern
      3. Too many variables leads to slow computation which in turns requires more memory and hardware.

      Why Boruta Package?

      There are a lot of packages for feature selection in R. The question arises " What makes boruta package so special".  See the following reasons to use boruta package for feature selection.
      1. It works well for both classification and regression problem.
      2. It takes into account multi-variable relationships.
      3. It is an improvement on random forest variable importance measure which is a very popular method for variable selection.
      4. It follows an all-relevant variable selection method in which it considers all features which are relevant to the outcome variable. Whereas, most of the other variable selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier.
      5. It can handle interactions between variables
      6. It can deal with fluctuating nature of random a random forest importance measure
      Basic Idea of Boruta Algorithm
      Perform shuffling of predictors' values and join them with the original predictors and then build random forest on the merged dataset. Then make comparison of original variables with the randomised variables to measure variable importance. Only variables having higher importance than that of the randomised variables are considered important.

      How Boruta Algorithm Works

      Follow the steps below to understand the algorithm -
      1. Create duplicate copies of all independent variables. When the number of independent variables in the original data is less than 5, create at least 5 copies using existing variables.
      2. Shuffle the values of added duplicate copies to remove their correlations with the target variable. It is called shadow features or permuted copies.
      3. Combine the original ones with shuffled copies
      4. Run a random forest classifier on the combined dataset and performs a variable importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each variable where higher means more important.
      5. Then Z score is computed. It means mean of accuracy loss divided by standard deviation of accuracy loss.
      6. Find the maximum Z score among shadow attributes (MZSA)
      7. Tag the variables as 'unimportant'  when they have importance significantly lower than MZSA. Then we permanently remove them from the process.
      8. Tag the variables as 'important'  when they have importance significantly higher than MZSA.
      9. Repeat the above steps for predefined number of iterations (random forest runs), or until all attributes are either tagged 'unimportant' or 'important', whichever comes first.

      Major Disadvantages: Boruta does not treat collinearity while selecting important variables. It is because of the way algorithm works.

      In Linear Regression:

      There are two important metrics that helps evaluate the model - Adjusted R-Square and Mallows' Cp Statistics.

      Adjusted R-Square: It penalizes the model for inclusion of each additional variable. Adjusted R-square would increase only if the variable included in the model is significant. The model with the larger adjusted R-square value is considered to be the better model.

      Mallows' Cp Statistic: It helps detect model biasness, which refers to either underfitting the model or overfitting the model.

      Formulae : Mallows Cp = (SSE/MSE) – (n – 2p) 

      where SSE is Sum of Squared Error and MSE is Mean Squared Error with all independent variables in model and p is for the number of estimates in model (i.e. number of independent variables plus intercept).

      Rules to select best model: Look for models where Cp is less than or equal to p, which is the number of independent variables plus intercept.

      A final model should be selected based on the following two criteria's -

      First Step : Models in which number of variables where Cp is less than or equal to p

      Second Step : Select model in which fewest parameters exist. Suppose two models have Cp less than or equal to p. First Model - 5 Variables, Second Model - 6 Variables. We should select first model as it contains fewer parameters.

      Important Note : 

      To select the best model for parameter estimation, you should use Hocking's criterion for Cp.

      For parameter estimation, Hocking recommends a model where Cp<=2p – pfull +1, where p is the number of parameters in the model, including the intercept. pfull - total number of parameters (initial variable list) in the model.

      To select the best model for prediction, you should use Mallows' criterion for Cp.


      How to check non-linearity
      In linear regression analysis, it's an important assumption that there should be a linear relationship between independent variable and dependent variable. Whereas, logistic regression assumes there should be a linear relationship between independent variable and logit function.
      • Pearson correlation is a measure of linear relationship. The variables must be measured at interval scales. It is sensitive to outliers. If pearson correlation coefficient of a variable is close to 0, it means there is no linear relationship between variables.
      • Spearman's correlation is a measure of monotonic relationship. It can be used for ordinal variables. It is less sensitive to outliers. If spearman correlation coefficient of a variable is close to 0, it means there is no monotonic relationship between variables.
      • Hoeffding’s D correlation is a measure of linear, monotonic and non-monotonic relationship. It has values between –0.5 to 1. The signs of Hoeffding coefficient has no interpretation.
      • If a variable has a very low rank for Spearman (coefficient - close to 0) and a very high rank for Hoeffding indicates a non-monotonic relationship.
      • If a variable has a very low rank for Pearson (coefficient - close to 0) and a very high rank for Hoeffding indicates a non-linear relationship.

      Appendix: