Salman_Ahmed_asking_what_is_water

Dear Norma, I hope this email finds you well. I am writing to express my strong interest in the HealthCare roles. I came across the job opening and was immediately drawn to the opportunity to collaborate with diverse lines of business and leverage data analytics and machine learning capabilities to drive actionable insights. With my background in Engineering/Technology and extensive knowledge in Data Engineering, Data Analytics, and Advanced Analytics, I am confident in my ability to uncover valuable enterprise insights and implement data management applications that contribute to operational effectiveness. My passion for Machine Learning and Artificial Intelligence has led me to develop end-to-end ML workflows, including data collection, feature engineering, model training, and deploying models in production. Throughout my career, I have utilized Python, PySpark, and SQL to build robust backend solutions and employed visualization tools such as Power BI and Tableau to effectively communicate data insights. Additionally, I have hands-on experience working with cloud platforms like Azure and have expertise in creating ETL pipelines and leveraging distributed computing for scalability. One of the aspects that excites me the most about this role is the opportunity to operationalize and monitor machine learning models using MLflow and Kubeflow while applying DevOps principles to ensure smooth deployment and management. I am also experienced in designing executive dashboards that provide actionable insights, empowering decision-making at all levels. With a bachelor's degree in mathematics and a master's degree in a quantitative field like Artificial Intelligence, I am well-equipped to tackle complex data challenges and provide innovative solutions. My 11+ years of experience in data science settings My functional and technical competencies encompass a wide array of skills, including data analytics, data engineering, cloud technologies, and data science, making me confident in my ability to contribute effectively to the success of Global Solutions. If possible i would like to discuss further how my qualifications align with the role's requirements and how I can be a valuable addition to the team. I am eagerly looking forward to the opportunity to connect and explore this exciting career prospect further. Best regards, Salman Ahmed

+44-7587652115

AI Prompt Engineering

2023-04-02T20:45:00Z

Understanding Big Language Models:

1. DALL-E 2 (Open.AI)
2. Stable Diffusion (Stability.AI)
3. Midjourney (Midjourney)
4. Codex - Github Copilot (Open.AI)
5. You.com (You.com)
6. Whisper AI (Open.AI)
7. GPT-3 Models (175B?) (Open.AI)
8. OPT (175B and 66B) (Meta)
9. BLOOM (176B) (Hugging Face)

10. GPT-NeoX (20B) (Eleuther.AI)

Topics where user can contribute:

Retrieval augment in-context learning
Better benchmarks
"Last Mile" for productive applications
Faithful, human-interpretable explanations.

Prompt Engineering Overview:

At the very basic we have interface to interact with a language model, where we pass some instruction and the model passes a response. The response is generated by the language model.

A prompt is composed with the following components:

Instructions
Context (this is not always given but is part of more advanced techniques)
Input Data
Output Indicator

Settings to keep in mind:

When prompting a new language model you should keep in mind a few settings
You can get very different results with prompts when using different settings
One important setting is controlling how deterministic the model is when generating completion of prompts:

Temperature and top_p are two important parameters to keep in mind.
Generally, keep these low if you are looking for exact answers like mathematics equation answers
... and keep them high for more diverse responses like text generation, poetry generation.

Designing prompts for Different Tasks:

Tasks Covered:

Text Summarization
Question Answering
Text Classification
Role Playing
Code Generation
Reasoning

Prompt Engineering Techniques: Many advanced prompting techniques have been designed to improve performance on complex tasks.

Few-Shot prompts
Chain-of-Thought (CoT) prompting
Self-Consistency
Knowledge Generation prompting
ReAct

Tools & IDE's : Tools, libraries and platforms with different capabilities and functionalities include:

Developing and Experimenting with Prompts
Evaluating Prompts
Versioning and deploying prompts

Dyno
Dust
LangChain
PROMPTABLE

Example of LLMs with external tools:

The generative capabilities of LLMs can be combined with an external tool to solve complex problems.
The components you need:

An agent powered by LLM to determine which action to take
A tool used by the agent to interact with the world (e.g. search API, Wolfram, Python REPL, database lookup)
The LLM that will power the agent.

Opportunities and Future Directions:

Model Safety: This can be used to not only improve the performance but also the reliability of response from a safety perspective.

Prompt engineering can help identify risky behavior of LLMs which can help to reduce harmful behaviors and risks that may arise from language models.
There is also a part of the community performing prompt inject to understand the vulnerability of LLMs.

Prompt Injection: it turns out that building LLMs, like any other systems comes with safety and challenges and safety considerations. Prompt injection aim to find vulnerabilities in LLMs.

Some common issues include:

Prompt Injection
Prompt Leaking: It aims to force the model to spit out information about its own prompt. This can lead to leaking of either sensitive, private or information that is confidential.
Jailbreaking: Is another form of prompt injection where the goal is to bypass safety and moderation features.

LLMs provided via API's might be coupled with safety features or content moderation which can be bypassed with harmful prompts/attacks.

RLHF: Train LLM's to meet a specific human preference. Involves collecting high-quality prompt datasets.

Popular Examples :
Claude (Anthropic)
ChatGPT (OpenAI)

Future Directions include:

Augmented LLM's
Emergent ability of LLM's
Acting / Planning - Reinforcement Learning
Multimodal Planning
Graph Planning

A token is ChatGPT is roughly 4 words.

LLM Chat GPT

2023-03-30T15:23:59Z

Some notes on Recurrent Neural Network: A neural network which has a high hidden dimension state. When a new observations comes it updates its high hidden dimension state.

In machine learning there is lot of unity in principles to be applied to different data modalities. We use the same neural net architecture, gradients and adam optimizer to fine tune the gradients. For RNN we use some additional tools to reduce the variance of the gradients. For example: using CNN for image learning or Transformers to NLP problems. Years back in NLP for every tiny problem there was a different architecture.

Question : Where does vision stop and language begin

Proposed future is to develop Reinforcement Learning techniques to help supervised learning perform better.
Another are of active research is Spike-timing-dependent plasticity. The concept of STDP has been shown to be a proven learning algorithm for forward-connected artificial neural network in pattern recognition. A general approach, replicated from the core biological principles, is to apply a window function (Δw) to each synapse in a network. The window function will increase the weight (and therefore the connection) of a synapse when the parent neuron fires just before the child neuron, but will decrease otherwise.

With Deep learning we are looking at a static problem with a probability distribution and applying the model to the distribution.

Back Propagation is useful algorithm and not go away, because it helps in finding a neural circuit subject to some constraints.

For Natural Language Modelling it is proven that very large datasets work because we are trying to predict the next word by broad strokes and surface level pattern. Once the language model becomes large, it understand the characters, spacing, punctuations, words, and finally the model learns the semantics and the facts.

Transformers is the most important advance in neural networks. Transformers is a combination of multiple ideas in which attention is one in which attention is a key. Transformers is designed in a way that it runs on a really fast GPU. It is not recurrent, thus it is shallow (less deep) and very easy to optimize.

After Transformers to built AGI, research is going on in Self Play and Active Learning.

GAN's don't have a mathematical cost function which it tries to optimize by gradient descent. Instead there is a game in which through mathematical functions it tries to find equilibrium.

Other example of deep learning models without cost function is reinforcement learning with self-play and surprise actions.

Double Descent:

When we make neural network larger it becomes better which is contrarian to statistical ideas. But there is a problem called the double descent bump as shown below;

Double descent occurs for all practical deep learning systems. Take a neural network and start increasing its size slowly while keeping the dataset size fixed. If you keep increasing the neural network size and don't do early stopping then, there is increase in performance and then it gets worse. It the point the model gets worst is precisely the point at which the model gets zero training error or zero training loss and then when we make it larger it start to get better again. It counter-intuitive because we expect the deep learning phenomenon to be monotonic.

The intuition is as follows:

"When we have a large data and a small model then small model is not sensitive to randomness/uncertainty in the training dataset. As the model gets large it achieves zero training error at approximately the point with the smallest norm in that subspace. At the point the dimensionality of the training data is equal to the dimensionality of the neural network model (one-to-one correspondence or degrees of freedom of dataset is same as degrees of freedom of model) at that point random fluctuation in the data worsens the performance (i.e. small changes in the data leads to noticeable changes in the model). But this double descent bump can be removed by regularization and early stopping."

If we have more data than parameters or more parameters than data, then model will be insensitive to the random changes in the dataset.

Overfitting: When model is very sensitive to small random unimportant stuff in the training dataset.

Early Stop: We train our model and monitor our performance and at some point when the validation performance starts to become worse we stop training (i.e. we determine to stop training and consider the model to be good enough)

ChatGPT:

ChatGPT has become a water-shed moment for organization because all companies are inherently language based companies. Whether it is text, video, audio, financial records all can be described as tokens which can be fed to large language models.

A good example of this is when during training of ChatGPT on amazon reviews, they found that after large amount of training the model became an excellent classifier of sentiment. So the model from predicting the next word (token) in a sentence, started to understand the semantics of the sentence and could tell if the review was a positive or negative.

With Advancement of AI, we have a likeness of a particular person as a separate bot, and the particular person will get a say, cut and licensing opportunities of his likeness.

Great Google Analytics courses and Google Material on Udemy

2023-01-11T23:32:24Z

All the material is for for getting certification for Google Universal Analytics or GA3, but the material will also help to prepare for GA4. Unfortunately GA4 is very new and very few people are using it.

Udemy:

https://www.udemy.com/share/101YUA3@1ZQpoeanMxxthiBi3TRUePtvhK8jpKedLNfathrLsI_5x8FtERy5aZusAp5R/

This one is excellent resource before the exam

https://www.udemy.com/share/1057WK3@B0vqy8cXKsPzaotyxGtf8OMJUbk6LabDRa9MvahhOqCaaXBprgawEPRvwRFK/

Google Material

https://skillshop.exceedlms.com/student/catalog/list?category_ids=6431-google-analytics-4

https://skillshop.exceedlms.com/student/path/2938

job sites for UK

2022-08-30T16:51:28Z

https://calendly.com/yourknowledgebuddyuk/1-2-1?month=2022-08

https://uktiersponsors.co.uk/

https://www.efinancialcareers.com/

https://www.jobs.nhs.uk/xi/search_vacancy/

https://www.gov.uk/find-a-job

https://uk.indeed.com/

https://www.reed.co.uk/

https://ukhired.com/

https://stackoverflow.com/

formulas

2022-03-07T02:01:24Z

Download finalformulas.doc

Statistics Notes 3 - Other Terms

2022-01-04T19:16:56Z

The Pearson / Wald / Score Chi-Square Test can be used to test the association between the independent variables and the dependent variable.

A Wald/Score chi-square test can be used for continuous and categorical variables. Whereas, Pearson chi-square is used for categorical variables. The p-value indicates whether a coefficient is significantly different from zero. In logistic regression, we can select top variables based on their high wald chi-square value.

Gain :Gain at a given decile level is the ratio of cumulative number of targets (events) up to that decile to the total number of targets (events) in the entire data set. This is also called CAP (Cumulative Accuracy Profile) in Finance, Credit Risk Scoring Technique

Interpretation: % of targets (events) covered at a given decile level. For example, 80% of targets covered in top 20% of data based on model. In the case of propensity to buy model, we can say we can identify and target 80% of customers who are likely to buy the product by just sending email to 20% of total customers.

Lift : It measures how much better one can expect to do with the predictive model comparing without a model. It is the ratio of gain % to the random expectation % at a given decile level. The random expectation at the xth decile is x%.

Interpretation: The Cum Lift of 4.03 for top two deciles, means that when selecting 20% of the records based on the model, one can expect 4.03 times the total number of targets (events) found by randomly selecting 20%-of-file without a model.

Gain / Lift Analysis

Randomly split data into two samples: 70% = training sample, 30% = validation sample.
Score (predicted probability) the validation sample using the response model under consideration.
Rank the scored file, in descending order by estimated probability
Split the ranked file into 10 sections (deciles)
Number of observations in each decile
Number of actual events in each decile
Number of cumulative actual events in each decile
Percentage of cumulative actual events in each decile. It is called Gain Score.
Divide the gain score by % of data used in each portion of 10 bins. For example, in second decile, divide gain score by 20.

Decile Rank	Number of cases	Number of Responses	Cumulative Responses	% of Events	Gain	Cumulative Lift	Number of Decile Score to divide Gain
1	2500	2179	2179	44.71%	44.71%	4.47%	10
2	2500	1753	3932	35.97%	80.67%	4.03%	20
3	2500	396	4328	8.12%	88.80%	2.96%	30
4	2500	111	4439	2.28%	91.08%	2.28%	40
5	2500	110	4549	2.26%	93.33%	1.87%	50
6	2500	85	4634	1.74%	95.08%	1.58%	60
7	2500	67	4701	1.37%	96.45%	1.38%	70
8	2500	69	4770	1.42%	97.87%	1.22%	80
9	2500	49	4819	1.01%	98.87%	1.10%	90
10	2500	55	4874	1.13%	100.00%	1.00%	100
	25000	4874

Detecting Outliers

There are two simple ways you can detect outlier problem :

1. Box Plot Method : If a value is higher than the 1.5*IQR above the upper quartile (Q3), the value will be considered as outlier. Similarly, if a value is lower than the 1.5*IQR below the lower quartile (Q1), the value will be considered as outlier.

QR is interquartile range. It measures dispersion or variation. IQR = Q3 -Q1.
Lower limit of acceptable range = Q1 - 1.5* (Q3-Q1)
Upper limit of acceptable range = Q3 + 1.5* (Q3-Q1)

Some researchers use 3 times of interquartile range instead of 1.5 as cutoff. If a high percentage of values are appearing as outliers when you use 1.5*IQR as cutoff, then you can use the following rule

Lower limit of acceptable range = Q1 - 3* (Q3-Q1)
Upper limit of acceptable range = Q3 + 3* (Q3-Q1)

2. Standard Deviation Method: If a value is higher than the mean plus or minus three Standard Deviation is considered as outlier. It is based on the characteristics of a normal distribution for which 99.87% of the data appear within this range.

Acceptable Range : The mean plus or minus three Standard Deviation

This method has several shortcomings :

The mean and standard deviation are strongly affected by outliers.
It assumes that the distribution is normal (outliers included)
It does not detect outliers in small samples

3. Percentile Capping (Winsorization): In layman's terms, Winsorization (Winsorizing) at 1st and 99th percentile implies values that are less than the value at 1st percentile are replaced by the value at 1st percentile, and values that are greater than the value at 99th percentile are replaced by the value at 99th percentile. The winsorization at 5th and 95th percentile is also common.

The box-plot method is less affected by extreme values as compared to Standard Deviation method. If the distribution is skewed, the box-plot method fails. The Winsorization method is a industry standard technique to treat outliers. It works well. In contrast, box-plot and standard deviation methods are traditional methods to treat outliers.

4. Weight of Evidence: Logistic regression model is one of the most commonly used statistical technique for solving binary classification problem. It is an acceptable technique in almost all the domains. These two concepts - weight of evidence (WOE) and information value (IV) evolved from the same logistic regression technique. These two terms have been in existence in credit scoring world for more than 4-5 decades. They have been used as a benchmark to screen variables in the credit risk modeling projects such as probability of default. They help to explore data and screen variables. It is also used in marketing analytics project such as customer attrition model, campaign response model etc.

What is Weight of Evidence (WOE)?

The weight of evidence tells the predictive power of an independent variable in relation to the dependent variable. Since it evolved from credit scoring world, it is generally described as a measure of the separation of good and bad customers. "Bad Customers" refers to the customers who defaulted on a loan. and "Good Customers" refers to the customers who paid back loan.

Formulae - ln(% of Good Customers / % of Bad Customer)

Distribution of Goods - % of Good Customers in a particular group
Distribution of Bads - % of Bad Customers in a particular group
ln - Natural Log
Positive WOE means Distribution of Goods > Distribution of Bads
Negative WOE means Distribution of Goods < Distribution of Bads
Hint : Log of a number > 1 means positive value. If less than 1, it means negative value.

Many people do not understand the terms goods/bads as they are from different background than the credit risk. It's good to understand the concept of WOE in terms of events and non-events. It is calculated by taking the natural logarithm (log to base e) of division of % of non-events and % of events.

Weight of Evidence for a category = log (% events / % non-events) in the category

Weight of Evidence was originated from logistic regression technique. It tells the predictive power of an independent variable in relation to the dependent variable. It is calculated by taking the natural logarithm (log to base e) of division of % of non-events and % of events.
Outlier Treatment with Weight Of Evidence : Outlier classes are grouped with other categories based on Weight of Evidence (WOE).

Steps of Calculating WOE

For a continuous variable, split data into 10 parts (or lesser depending on the distribution).
Calculate the number of events and non-events in each group (bin)
Calculate the % of events and % of non-events in each group.
Calculate WOE by taking natural log of division of % of non-events and % of events

Note : For a categorical variable, you do not need to split the data (Ignore Step 1 and follow the remaining steps)

Home » Credit Risk Modeling » Data Science » Logistic Regression » Weight of Evidence (WOE) and Information Value (IV) Explained

WEIGHT OF EVIDENCE (WOE) AND INFORMATION VALUE (IV) EXPLAINED

Deepanshu Bhalla 121 Comments Credit Risk Modeling, Data Science, Logistic Regression

In this article, we will cover the concept of weight of evidence and information value and how they are used in predictive modeling process along with details of how to compute them using SAS, R and Python.

Logistic regression model is one of the most commonly used statistical technique for solving binary classification problem. It is an acceptable technique in almost all the domains. These two concepts - weight of evidence (WOE) and information value (IV) evolved from the same logistic regression technique. These two terms have been in existence in credit scoring world for more than 4-5 decades. They have been used as a benchmark to screen variables in the credit risk modeling projects such as probability of default. They help to explore data and screen variables. It is also used in marketing analytics project such as customer attrition model, campaign response model etc.

Table of Contents

What is Weight of Evidence (WOE)?

What is Information Value (IV)?

Rules related to Information Value

Weight of Evidence and Information Value in Python, SAS and R

What is Weight of Evidence (WOE)?

WOE Calculation

Distribution of Goods - % of Good Customers in a particular group
Distribution of Bads - % of Bad Customers in a particular group
ln - Natural Log

Positive WOE means Distribution of Goods > Distribution of Bads
Negative WOE means Distribution of Goods < Distribution of Bads

Hint : Log of a number > 1 means positive value. If less than 1, it means negative value.

WOE = In(% of non-events ➗ % of events)

Weight of Evidence Formula

Steps of Calculating WOE

For a continuous variable, split data into 10 parts (or lesser depending on the distribution).
Calculate the number of events and non-events in each group (bin)
Calculate the % of events and % of non-events in each group.
Calculate WOE by taking natural log of division of % of non-events and % of events

Note : For a categorical variable, you do not need to split the data (Ignore Step 1 and follow the remaining steps)

Weight of Evidence and Information Value Calculation

Terminologies related to WOE

1. Fine Classing : Create 10/20 bins/groups for a continuous independent variable and then calculates WOE and IV of the variable

2. Coarse Classing : Combine adjacent categories with similar WOE scores

Usage of WOE

Weight of Evidence (WOE) helps to transform a continuous independent variable into a set of groups or bins based on similarity of dependent variable distribution i.e. number of events and non-events.

For continuous independent variables : First, create bins (categories / groups) for a continuous independent variable and then combine categories with similar WOE values and replace categories with WOE values. Use WOE values rather than input values in your model.

Home » Credit Risk Modeling » Data Science » Logistic Regression » Weight of Evidence (WOE) and Information Value (IV) Explained

WEIGHT OF EVIDENCE (WOE) AND INFORMATION VALUE (IV) EXPLAINED

Deepanshu Bhalla 121 Comments Credit Risk Modeling, Data Science, Logistic Regression

Table of Contents

What is Weight of Evidence (WOE)?

What is Information Value (IV)?

Rules related to Information Value

Weight of Evidence and Information Value in Python, SAS and R

What is Weight of Evidence (WOE)?

WOE Calculation

Distribution of Goods - % of Good Customers in a particular group
Distribution of Bads - % of Bad Customers in a particular group
ln - Natural Log

Positive WOE means Distribution of Goods > Distribution of Bads
Negative WOE means Distribution of Goods < Distribution of Bads

Hint : Log of a number > 1 means positive value. If less than 1, it means negative value.

WOE = In(% of non-events ➗ % of events)

Weight of Evidence Formula

Steps of Calculating WOE

For a continuous variable, split data into 10 parts (or lesser depending on the distribution).
Calculate the number of events and non-events in each group (bin)
Calculate the % of events and % of non-events in each group.
Calculate WOE by taking natural log of division of % of non-events and % of events

Note : For a categorical variable, you do not need to split the data (Ignore Step 1 and follow the remaining steps)

Weight of Evidence and Information Value Calculation

Download : Excel Template for WOE and IV

Terminologies related to WOE

1. Fine Classing

Create 10/20 bins/groups for a continuous independent variable and then calculates WOE and IV of the variable

2. Coarse Classing

Combine adjacent categories with similar WOE scores

Usage of WOE

Categorical independent variables: Combine categories with similar WOE and then create new categories of an independent variable with continuous WOE values. In other words, use WOE values rather than raw categories in your model. The transformed variable will be a continuous variable with WOE values. It is same as any continuous variable.

Why combine categories with similar WOE?

It is because the categories with similar WOE have almost same proportion of events and non-events. In other words, the behavior of both the categories is same.

Rules related to WOE

Each category (bin) should have at least 5% of the observations.
Each category (bin) should be non-zero for both non-events and events.
The WOE should be distinct for each category. Similar groups should be aggregated.
The WOE should be monotonic, i.e. either growing or decreasing with the groupings.
Missing values are binned separately.

FEATURE SELECTION : SELECT IMPORTANT VARIABLES WITH BORUTA PACKAGE

Home » Data Science » Feature Selection » R » Feature Selection : Select Important Variables with Boruta Package

FEATURE SELECTION : SELECT IMPORTANT VARIABLES WITH BORUTA PACKAGE

Deepanshu Bhalla 13 Comments Data Science, Feature Selection, R

This article explains how to select important variables using boruta package in R. Variable Selection is an important step in a predictive modeling project. It is also called 'Feature Selection'. Every private and public agency has started tracking data and collecting information of various attributes. It results to access to too many predictors for a predictive model. But not every variable is important for prediction of a particular task. Hence it is essential to identify important variables and remove redundant variables. Before building a predictive model, it is generally not know the exact list of important variable which returns accurate and robust model.

Why Variable Selection is important?

Removing a redundant variable helps to improve accuracy. Similarly, inclusion of a relevant variable has a positive effect on model accuracy.
Too many variables might result to overfitting which means model is not able to generalize pattern
Too many variables leads to slow computation which in turns requires more memory and hardware.

Why Boruta Package?

There are a lot of packages for feature selection in R. The question arises " What makes boruta package so special". See the following reasons to use boruta package for feature selection.

It works well for both classification and regression problem.
It takes into account multi-variable relationships.
It is an improvement on random forest variable importance measure which is a very popular method for variable selection.
It follows an all-relevant variable selection method in which it considers all features which are relevant to the outcome variable. Whereas, most of the other variable selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier.
It can handle interactions between variables
It can deal with fluctuating nature of random a random forest importance measure

Basic Idea of Boruta Algorithm

Perform shuffling of predictors' values and join them with the original predictors and then build random forest on the merged dataset. Then make comparison of original variables with the randomised variables to measure variable importance. Only variables having higher importance than that of the randomised variables are considered important.

How Boruta Algorithm Works

Follow the steps below to understand the algorithm -

Create duplicate copies of all independent variables. When the number of independent variables in the original data is less than 5, create at least 5 copies using existing variables.
Shuffle the values of added duplicate copies to remove their correlations with the target variable. It is called shadow features or permuted copies.
Combine the original ones with shuffled copies
Run a random forest classifier on the combined dataset and performs a variable importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each variable where higher means more important.
Then Z score is computed. It means mean of accuracy loss divided by standard deviation of accuracy loss.
Find the maximum Z score among shadow attributes (MZSA)
Tag the variables as 'unimportant' when they have importance significantly lower than MZSA. Then we permanently remove them from the process.
Tag the variables as 'important' when they have importance significantly higher than MZSA.
Repeat the above steps for predefined number of iterations (random forest runs), or until all attributes are either tagged 'unimportant' or 'important', whichever comes first.

Major Disadvantages: Boruta does not treat collinearity while selecting important variables. It is because of the way algorithm works.

In Linear Regression:

There are two important metrics that helps evaluate the model - Adjusted R-Square and Mallows' Cp Statistics.

Adjusted R-Square: It penalizes the model for inclusion of each additional variable. Adjusted R-square would increase only if the variable included in the model is significant. The model with the larger adjusted R-square value is considered to be the better model.

Mallows' Cp Statistic: It helps detect model biasness, which refers to either underfitting the model or overfitting the model.

Formulae : Mallows Cp = (SSE/MSE) – (n – 2p)

where SSE is Sum of Squared Error and MSE is Mean Squared Error with all independent variables in model and p is for the number of estimates in model (i.e. number of independent variables plus intercept).

Rules to select best model: Look for models where Cp is less than or equal to p, which is the number of independent variables plus intercept.

A final model should be selected based on the following two criteria's -

First Step : Models in which number of variables where Cp is less than or equal to p

Second Step : Select model in which fewest parameters exist. Suppose two models have Cp less than or equal to p. First Model - 5 Variables, Second Model - 6 Variables. We should select first model as it contains fewer parameters.

Important Note :

To select the best model for parameter estimation, you should use Hocking's criterion for Cp.

For parameter estimation, Hocking recommends a model where Cp<=2p – pfull +1, where p is the number of parameters in the model, including the intercept. pfull - total number of parameters (initial variable list) in the model.

To select the best model for prediction, you should use Mallows' criterion for Cp.

How to check non-linearity

In linear regression analysis, it's an important assumption that there should be a linear relationship between independent variable and dependent variable. Whereas, logistic regression assumes there should be a linear relationship between independent variable and logit function.

Pearson correlation is a measure of linear relationship. The variables must be measured at interval scales. It is sensitive to outliers. If pearson correlation coefficient of a variable is close to 0, it means there is no linear relationship between variables.

Spearman's correlation is a measure of monotonic relationship. It can be used for ordinal variables. It is less sensitive to outliers. If spearman correlation coefficient of a variable is close to 0, it means there is no monotonic relationship between variables.

Hoeffding’s D correlation is a measure of linear, monotonic and non-monotonic relationship. It has values between –0.5 to 1. The signs of Hoeffding coefficient has no interpretation.

If a variable has a very low rank for Spearman (coefficient - close to 0) and a very high rank for Hoeffding indicates a non-monotonic relationship.
If a variable has a very low rank for Pearson (coefficient - close to 0) and a very high rank for Hoeffding indicates a non-linear relationship.

Appendix:

Statistics Tutorials : Beginner to Advanced (listendata.com)

marketing mix notes

2021-12-28T12:02:48Z

Marketing Mix

Product : It includes all product items marketed by the marketer, their features, quality brand, packaging, labelling, product life cycle, and all decision related to product

Product assortment , offered to customers by the entire industry

Product line is a group of similar featured items marketed by a marketer

Total number of lines is referred as breadth (width) of product mix

Product depth or item depth refers to the number of version offered to each product in the line

Distribution channel – is very important to Netflix

Price : brings revenue, act of determining value of a product

Includes pricing objectives, price setting strategies, general pricing policies, discount, allowance, rebate, etc. price mix also includes cash and credit policy, price discrimination, cost and contribution

Place : location distance , transport

Direct marketing no intermediary is there

Promotion: is defined as a combination of all activities concerned with informing and persuading the actuals and potential customers about the merits of a product with an intention to achieve sales goals

Sales promotion involves offering short-term incentive to promote buying and increase sales

Most popular form of sales promotion are free gifts, discounts, exchange offer, free home, delivery , after-sales services, guarantee, warrantee, various purchase schemes, etc.

Favourable relations between organizations and public

Modification and extensions to 4 p’s

Product, price place and promotion (marketed approach)

Consumer oriented approach (4c’s)

Commodity - Product

Cost - Cost

Channel - Place

Communication - Promotion

Services were fundamentally different from products

Process : procedures / mechanisms for delivering services and monitoring

People : human factor as they interact with the consumer using the services

Physical Evidences :

Extension of 4c’s

Consumer solution

Cost convenience

Communication

Elements of marketing mix are mutualy dependant

Marketing mix elements are meant for attaining the target markets

Essence of marketing mix is ensuring profitbality through customer satisfaction

Elements help the marketer in attaining marketing objectives

Customer is the central focus of marketing mix

Purpose and objectives of marketing mix

Marketing mix aims at customer satisfaction

Success of each and every product

Aims at assisting the marketers in creating effective marketing strategy

Profit maximization, image building, creation of goodwill, maintaining better customer relations

Success of each and every product

Marketing mix is the link between business and customers

Marketing mix helps to increase sales and profit

for netflix : reduction in price could be attributed in diminishing returns from advertising

Market Mix Modelling

2021-12-03T20:37:39Z

Marketing Mix Modelling (MMM) is a method that helps quantify the impact of several marketing inputs on sales or market share. the purpose of MMM is to understand how much each marketing input contributes to sales, and how much to spend on each marketing input.

MMM relies on statistical analysis such as multivariate regressions on sales and marketing time series data to estimate the impact of various marketing tactics (marketing mix) on sales and then forecast the impact of future sets of tactics. It is often used to optimize the advertising mix and promotional tactics with respect to sales and profits.

Marketing Mix Modeling (MMM) is one of the most popular analysis under Marketing Analytics which helps organisations in estimating the effects of spent on different advertising channels (TV, Radio, Print, Online Ads etc) as well as other factors (price, competition, weather, inflation, unemployment) on sales. In simple words, it helps companies in optimizing their marketing investments which they spent in different marketing mediums (both online and offline).

Uses of Marketing Mix Modeling

It answers the following questions which management generally wants to know.

Which marketing medium (TV, radio, print, online ads) returns maximum return (ROI)?
How much to spend on marketing activities to increase sales by some percent (15%)?
Predict sales in future from investment spent on marketing activities
Identifying Key drivers of sales (including marketing mediums, price, competition, weather and macro-economic factors)
How to optimize marketing spend?
Is online marketing medium better than offline?

Types of Marketing Mediums
Let's break it into two parts - offline and online.
Offline Marketing	Online Marketing
Print Media : Newspaper, Magazine	Search Engine Marketing like Content Marketing, Backlink building etc.
TV	Pay per Click, Pay per Impression
Radio	Email Marketing
Out-of-home (OOH) Advertising like Billboards, ads in public places.	Social Media Marketing (Facebook, YouTube, Instagram, LinkedIn Ads)
Direct Mail like catalogs, letters	Affiliate Marketing
Telemarketing
Below The Line Promotions like free product samples or vouchers
Sponsorship

Marketing Spend as a percent of companies revenues by industry

Marketing Mix Modeling

MMM has had a place in marketers’ analytics toolkit for decades. This is due to the unique insights marketing mix models can provide. By leveraging regression analysis, MMM provides a “top down” view into the marketing landscape and the high-level insights that indicate where media is driving the most impact.

For example: by gathering long-term, aggregate data over several months, marketers can identify the mediums consumers engage with the most. MMM provides a report of where and when media is engaged over a long stretch of time.

Background: Marketing Mix Modeling (MMM)

The beginning of the offline measurement

Marketing Mix Modelling is a decades-old process developed in the earliest data of modern marketing that applies regression analysis to historical sales data to analyse the effects of changing marketing activities. Many marketers still use MMM for top-level media planning and budgeting; it delivers a broad view into variables both inside and outside of the marketer's control.

Some of the factors are:

Price
Promotions
Competitor Activity
Media Activity
Economic Conditions

Analytical and Statistical Methods used to quantify the effect of media and marketing efforts on a product's performance is called Marketing Mix Modeling

"It helps to maximize investment and grow ROI"

ROI = (Incremental returns from investment) / Cost of Investment

Marketing ROI = (Incremental Dollar Sales from Marketing Investment) / Spend on Marketing Investment

Why is MMM Needed? Guiding Decisions for Improved Effectiveness

How do I change the mix to increase sales with my existing budget?
Where am I over-spending or under-spending?
Which marketing channels are effective but lack the efficiency for positive ROI?
To what degree do non-marketing factors influence sales?

How does MMM work?

Correlate marketing to sales
Factor in lag time
Test interaction effects
Attribute sales by input
Model to most predictive
Maximize significance - to empower decisions

Example Marketing Mix Model Output

Detailed output includes:

Weekly sales lift
More marketing channels
Contribution by tactic
Contribution by campaign
Non-Marketing impact

Market Contribution vs. Base

ROI Assessment:

We measure ROI because not all ads will convert to sales, but because they are cost-effective and most bang for the buck

MMM Strengths:

Complete set of marketing tactics
Impact of non-marketing factors
High Statistical Reliability
Guides change in the marketing mix
Guides change in spend
Optimizes budget allocation

MMM Limitations:

More Tactical than Strategic
Short-Term impact only
Dependant on variance over time
Average Effectiveness
No diagnostics to improve
Hard to refresh frequently

Critical Success Factors of MMM:

Use a Strategic approach (not tactical)
Disclose gaps and limitations
Add Diagnostic measures
Integrate into robust measurement plan
Make marketing more measurable
Create ROI simulation tools

Media Mix Modeling as Econometric Modeling:

Strengths:

It reduces the biases
It correctly or accurately isolates the impact of media on sales from the impact of all other factors that influence sales.

Weaknesses:

If two types of media are highly correlated in the historical record, then isolating and separating each media type on sales gets reduced.

For working with Market Mix Modeling - a good understanding of econometrics types of modelling is needed

The objective before starting this approach is how can we maximize the value and minimize the harm of marketing mix models like store-based models or shopper based multi-user attribution models.

Marketing End Users are the root of the cause of marketing mix models problems.

Tip: Most attribution projects begin long after the strategy has already been set. So it's important to understand what the client did, why they did it, and what they expected to happen. Only then can you answer their questions in a way they'll be happy with. Remember they hired you because the results weren't what they expected... or because they never thought about how to measure them in the first place.

As we all know weekly variation is the lifeblood of marketing mix models.

Some of the problems are continuity bias

Very interesting article on using Market Mix Modelling during COVID-19.

Market Mix Modeling (MMM) in times of Covid-19 | by Ridhima Kumar | Aryma Labs | Medium

In the model, i read that there will be sudden demand of essential items during the pandemic, but this deviance cannot be attributed to existing advertisement factors.

In the regression model we can see that there will be;

Heteroscedasticity: The sales trend could show significant changes from the beginning to end of the series. Hence, the model could have heteroscedasticity. One of the reasons for heteroscedasticity is presence of outliers in the data or due to large range between the largest and smallest observed value.
Autocorrelation: Also, the model could show signs of autocorrelation due to missing independent variable (the missing variable being Covid-19 variable).

Another very interesting article on Marketing Analytics using Markov chain

Marketing Analytics through Markov Chain | LinkedIn

In the article, I read that how we can use transition matrix to understand the change in states. It explains very neatly.

Article on Conjoint Analysis : Conjoint Analysis: What type of chocolates do the Indian customers prefer? | LinkedIn

Marketing Mix Modeling (MMM) is the use of statistical analysis to estimate the past impact and predict the future impact of various marketing tactics on sales. Your Marketing Mix Modeling project needs to have goals, just like your marketing campaigns.

The main goal of any Marketing Mix Modeling project is to measure past marketing performance so you can use it to improve future Marketing Return on Investment (MROI).

The insights you gain from your project can help you reallocate your marketing budget across your tactics, products, segments, time and markets for a better future return. All of the marketing tactics you use should be included in your project, assuming there is high-quality data with sufficient time, product, demographic, and/or market variability. Each project has four distinct phases, starting with data collection and ending with optimization of future strategies. Let’s take a look at each phase in depth:

Phase 1 : Data Collection and Integrity : It can be tempting to request as much data as possible, but it's important to note that every request has a very real cost to the client. In this case the task could be simplified down to just marketing spend by day, by channel, as well as sales revenue.

Phase 2 : Modeling: Before modelling we need to;

Identify Baseline and Incremental Sales

Identify Drivers of Sales

Identify Drivers of Growth

Sales Drivers by Week

Optimal Media Spend

Understanding Brand Context: Understanding the clients marketing strategy & its implementation is key for succeeding in the delivery of the MMM project.

The STP Strategy (Segmentation, Targeting and Positioning) impacts the choice of the target audience and influences the interpretation of the model results.
The company context and 4P's determine the key datasets that needed to be collected and influence the key factors. Eg: Impact of Seasonality , Distribution of Channels

Phase 3 : Model-Based Business Measures

Phase 4 : Optimization & Strategies

Pitfalls in Market Mix Modeling:

1. Why MMX vendors being “personally objective” is not the same as their being “statistically unbiased”.

2. How to clear the distortions that come from viewing “today’s personalized continuity marketing” through “yesterday’s mass-market near-term focused lens”.

3. Why “statistically controlling” for a variable (seasonality, trend, etc.) does NOT mean removing its influence on marketing performance.

Some points about Marketing Mix Modeling:

Your Marketing Return on Investment (MROI) will be a key metric to look at during your Marketing Mix Modeling project, whether that be Marginal Marketing Return on Investment for future planning or Average Marketing Return on Investment for past interpretation. The best projects also gauge the quality of their marketing mix model, using Mean Absolute Percent Error (MAPE) and R^2

1. Ad creative is very important to your sales top line and your MROI, especially if you can tailor it to a segmented audience. This paper presents five best Spanish language creative practices to drive MROI, which should also impact top-of-the-funnel marketing measures.

2. The long-term impact of marketing on sales is hard to nail down, but we have found that ads that don’t generate sales lift in the near-term usually don’t in the long-term either. You can also expect long-term Marketing Return on Investment to be about 1.5 to 2.5 times the near-term Marketing Return on Investment.

3. Modeled sales may not be equivalent to total sales. Understand how marketing to targeted segments will be modeled.

4. Brand size matters. As most brand managers know firsthand, the economics of advertisement favors large brands over small brands. The same brand TV expenditure and TV lift produces larger incremental margin dollars, and thus larger Marketing Return on Investment, for the large brand than the small brand. 5. One media’s Marketing Return on Investment does not dominate consistently. Since flighting, media weight, targeted audience, timing, copy and geographic execution vary by media for a brand, each media’s Marketing Return on Investment can also vary significantly.

Some more background into Marketing Mix Models:

Product : A product can be either a tangible product or an intangible service that meets a specific customer need or demand

Price : Price is the actual amount the customer is expected to pay for the product

Promotion : Promotion includes marketing communication strategies like advertising, offers, public relations etc.

Place : Place refers to where a company sells their product and how it delivers the product to the market.

Marketing Objectives:

For the different marketing types: TV, Radio, Print, Outdoor, Internet, Search Engine, Mobile Apps. We would like to

1. Measure ROI by media type

2. Simulate alternative media plans

Research Objectives:

1. Measure ROI by media type

2. Simulate alternative media plans

3. Build a User-Friendly simulation tool

4. Build User-Friendly optimization tool

First Step: Building the Modeling Data Set

Cross-Sectional Unit :

Regions
Markets
Trade Spaces
Channels
Your brands
Competitor brands

Unit of Time

Months
Weeks

Length of History

At least 5 years of monthly data
At least 2 years of weekly data

Define the Variables

Sales

Dependent Variables
units(not currency)

Media Variables:

TV, Radio, Internet, Social, etc.
Measure as units of activity (e.g., GRPs, impressions)

Control Variables

Macroeconomic factors
Seasonality
Price
Trade Promotions
Retail Promotions
Competitor Activity

Pick Functional Form of Demand Equation

Quantity Demanded = f

Conditions:
Price
Economic Conditions
Size of Market
Customer Preferences
Strength of Competition
Marketing Activity

Most Common Functional Forms

Linear
Log-Linear - strong assumptions
Double Log - more strong assumptions (used by a large percentage of models)

Modelling Issues

Omitted Variables ( try to get as many variables as possible which are considered to have big impact on demand)
Endogeneity Bias (Instrumental variable approach, if the variable is in our predictor variable and also in our dependant variable, this creates bias and we need to account for the bias)
Serial Correlation (all-time series data have serial correlation which creates bias)
Counterintuitive results ( time series is short, we may not have enough data to look back, then we try to go more cross-sectional variables in more granular)
Short Time Series

Market-Mix Modeling Econometrics

Mixed Modeling: fixed effects, random effects
Parks Estimator
Bayesian Methods: Random effects
Adstock variables: can be split up into multiple variables for different types of advertisements like promotion, equity, etc.

Multiple Factors that Affect Outcome (Incremental Sales) :

Campaign
Pricing
Other Campaigns
Competitor Effects
Seasonality
Regulatory Factors

Market Mix modelling: is designed to pick up short term effects, it is not able to model long term effects such as the effect of the brand. Advertisement helps in making a brand but this is difficult to model.

Attribution Modeling: is different Media/Market Mix Modeling as it offers additional insight. In this type of modelling, we measure the contribution of earlier touchpoints of customer digital journey to final sale. Attribution Modeling is bottom-up approach but will be difficult to do because third party cookies are getting phased out

Multi-Touch Attribution modelling is more advanced than top-down Market Mix Modeling because there is an instant feed loop to understand what is working. whereas in Market Mix Modeling we would just determine the percentage of x change to drive sales and then in next year model we will do the adjustment again, without getting any real on the ground feedback to understand that whether we reached the target that we set out to achieve.

Nielson Marketing Mix Modeling is the largest Market Mix Modeling provider in the world.

The Pros and Cons of Marketing Mix Modeling

When it comes to initial marketing strategy or understanding external factors that can influence the success of a campaign, marketing mix modeling shines. Given the fact that MMM leverages long-term data collection to provide its insights, marketers measure the impact of holidays, seasonality, weather, band authority, etc. and their impact on overall marketing success.

As consumers engage with brands across a variety of print, digital, and broadcast channels, marketers need to understand how each touchpoint drives consumers toward conversion. Simply put, marketers need measurements at the person-level that can measure an individual consumer’s engagement across the entire customer journey in order to tailor marketing efforts accordingly.

Unfortunately, marketing mix modeling can’t provide this level of insight. While MMM has a variety of pros and cons, the biggest pitfall of MMM is its inability to keep up with the trends, changes, and online and offline media optimization opportunities for marketing efforts in-campaign.

800 data science questions

2021-11-03T04:33:18Z

Download Data_Science_QA.pdf

Download 800_Data_Science_Questions_.pdf

Research on IT Certifications

2021-10-16T05:14:33Z

Top IT management certifications

Certified in the Governance of Enterprise IT (CGEIT)
Certified ScrumMaster (CSM)
COBIT 5 Foundation
Information Technology Infrastructure Library (ITIL)
PMI Agile Certified Practitioner (PMI-ACP)
Project Management Professional (PMP)
Six Sigma
TOGAF 9 : TOGAF<sup>®</sup> Prep Course (Level 1 and Level 2 Combined) Course Schedules | TOGAF<sup>®</sup> Prep Course (Level 1 and Level 2 Combined) Classroom Training (knowledgehut.com)

The most valuable certifications for 2021

Google Certified Professional Data Engineer
Google Certified Professional Cloud Architect
AWS Certified Solutions Architect Associate
Certified in Risk and Information Systems Control (CRISC)
Project Management Professional (PMP)

Top agile certifications

PMI-ACP

Top 15 data science certifications

Certified Analytics Professional (CAP)
Cloudera Certified Associate (CCA) Data Analyst
Cloudera Certified Professional (CCP) Data Engineer
Data Science Council of America (DASCA) Senior Data Scientist (SDS)
Data Science Council of America (DASCA) Principle Data Scientist (PDS)
Dell EMC Data Science Track (EMCDS)
Google Professional Data Engineer Certification
IBM Data Science Professional Certificate
Microsoft Certified: Azure AI Fundamentals
Microsoft Certified: Azure Data Scientist Associate
Open Certified Data Scientist (Open CDS)
SAS Certified AI & Machine Learning Professional
SAS Certified Big Data Professional
SAS Certified Data Scientist
Tensorflow Developer Certificate
Mining Massive Data Sets Graduate Certificate by Stanford

Top 10 business analyst certifications

Certified Analytics Professional (CAP)
IIBA Entry Certificate in Business Analysis (ECBA)
IIBA Certification of Competency in Business Analysis (CCBA)
IIBA Certified Business Analysis Professional (CBAP)
IIBA Agile Analysis Certification (AAC)
IIBA Certification in Business Data Analytics (CBDA)
IQBBA Certified Foundation Level Business Analyst (CFLBA)
IREB Certified Professional for Requirements Engineering (CPRE)
PMI Professional in Business Analysis (PBA)
SimpliLearn Business Analyst Masters Program

The top 11 data analytics and big data certifications

Associate Certified Analytics Professional (aCAP)
Certification of Professional Achievement in Data Sciences
Certified Analytics Professional
Cloudera Data Platform Generalist
EMC Proven Professional Data Scientist Associate (EMCDSA)
IBM Data Science Professional Certificate
Microsoft Certified Azure Data Scientist Associate
Microsoft Certified Data Analyst Associate
Open Certified Data Scientist
SAS Certified Advanced Analytics Professional Using SAS 9
SAS Certified Data Scientist

Chartered Data ScientistTM

This distinction is provided by the Association of Data Scientists (ADaSci). This designation is awarded to those candidates who pass the CDS exam and hold a minimum of two years of work experience as a data scientist. However, the candidates who do not have experience can also take the exam and carry the results. But their charter, in this case, is put on hold until they attain the two years of experience. There is no training or course required to earn this award. The cost of taking this exam is 250 US Dollar. This charter has lifetime validity and hence it does not expire.

Chartered Financial Data Scientist

The Chartered Financial Data Scientist program is organized by the Society of Investment Professionals in Germany. They first provide a training course conducted by the Swiss Training Centre for Investment Professionals. After completing this training, the candidates are allowed to earn this designation. It costs around 8,690 Euro.

Certified Analytics Professional

This professional certification is offered by INFORMS. It is supported by the Canadian Operational Research Society and 3 more professional societies. There are various levels of certification. Each level has different eligibility requirements, from graduate to postgraduate etc. To earn this certification, the cost starts from 495 US Dollar. To take this exam, the candidate needs to be available in-person in the designated test centres. It is valid for three years only.

Cloudera Certified Associate Data Analyst

This certification program is organized by Cloudera. It is more specific towards SQL and databases and more suitable for Data Analysts. It costs around 295 US Dollar and there is no any specific eligibility requirement for this certification. This certification is valid only for two years.

EMC Proven Professional Data Scientist Associate

This certification program is organized by Dell EMC. To earn this distinction, it is mandatory to attend a training program, either in-class or online. It costs around 230 US Dollar. To take this exam, the candidate needs to be available in-person in the designated test centres.

Open Certified Data Scientist

It is organized by the Open Group. The members of the Open Group include HCL, Huawei, IBM, Oracle etc. There are 3 levels of this certification. Require to have a different level of experience for each level of certification. The cost for this certification starts from 295 US Dollar. To take this exam, the candidate needs to be available in-person at the specified place.

Senior Data Scientist

This certification program is provided by the Data Science Council of America (DASCA). It requires 6+ years of experience of Big Data Analytics / Big Data Engineering. It costs around 650 US Dollar. This certification has 5 years of validity.

Principal Data Scientist

This certification program is provided by the Data Science Council of America (DASCA). It requires 10+ years of experience of Big Data Analytics / Big Data Engineering. There are various tracks of this exam. It costs between 850-950 US Dollar depending on the track.

SAS Certified Data Scientist

It is organized by SAS. To get this certification, you need to pass two more exams first SAS Big Data Professional and SAS Advanced Analytics Professional. Along with this, you need to take 18 courses as well. It costs around 4,400 US Dollar.

Financial Data Professional

Financial Data Professional program is organized by Financial Data Professional Institute (FDPI). It is more suitable for financial professionals who apply AI and data science in finance. It opens the exam window with a fixed registration period. The cost of the FDP exam is 1350 US Dollar. To take this exam, the candidate needs to be available in-person in the designated test centres.

So, here we have listed the top certification exams in data science across the world. To choose from the list, a candidate should analyze the requirements in the coming future, the suitability of certification, contents covered in the exam so that it can meet the job requirements, exam cost, exam dates and time flexibility etc. The candidate should take one such certification which meets all their expectations instead of taking multiple certification exams.

Also there are many more certifications provided by insurance bodies

IFoA and CAS which are in development but need strong insurance domain knowledge

If you are a member of Pega Academy - then Pega has their own Data Science Program

Tools for which we need to know their data science certifications

2021-09-14T15:44:14Z

Snowflake
Collibra
DataBricks
Alteryx
H2O.ai
DataRobot
Dataiku
Domo
Azure
AWS
Google Cloud
Data Bricks
Red Shift
Knime
Air Flow
MLflow

Machine Learning - Basic Starting Notes

2021-09-05T14:08:36Z

Machine Learning Problem Framing -

Define a ML Problem and propose a solution

Articulate a problem
See if any labeled data exists
Design your data for the model
Determine where the data comes from
Determine easily obtained inputs
Determine quantifiable inputs

We have major three types of models:

Supervised Learning
Un-Supervised Learning
Reinforcement Learning : There is no data requirement of labeled data, and the model acts like an agent which learns. It works on foundation of a reward function. Challenges lie in defining a good reward function. Also RL models are less stable and predictable than supervised approaches. Additionally, you need to provide a way for the agent to interact with the game to produce data, which means either building a physical agent that can interact with the real world or a virtual agent and a virtual world, either of which is a big challenge.

Type of ML Problem	Description	Example
Classification	Pick one of N labels	Cat, dog, horse, or bear
Regression	Predict numerical values	Click-through rate
Clustering	Group similar examples	Most relevant documents (unsupervised)
Association rule learning	Infer likely association patterns in data	If you buy hamburger buns, you're likely to buy hamburgers (unsupervised)
Structured output	Create complex output	Natural language parse trees, image recognition bounding boxes
Ranking	Identify position on a scale or status	Search result ranking

In traditional software engineering, you can reason from requirements to a workable design, but with machine learning, it will be necessary to experiment to find a workable model.

Models will make mistakes that are difficult to debug, due to anything from skewed training data to unexpected interpretations of data during training. Furthermore, when machine-learned models are incorporated into products, the interactions can be complicated, making it difficult to predict and test all possible situations. These challenges require product teams to spend a lot of time figuring out what their machine learning systems are doing and how to improve them.

Know the Problem Before Focusing on the Data

If you understand the problem clearly, you should be able to list some potential solutions to test in order to generate the best model. Understand that you will likely have to try out a few solutions before you land on a good working model.

Exploratory data analysis can help you understand your data, but you can't yet claim that patterns you find generalize until you check those patterns against previously unseen data. Failure to check could lead you in the wrong direction or reinforce stereotypes or bias.

AI - 900 Azure AI fundamentals prep notes

2021-07-21T08:00:10Z

The Layers of AI

What is Artificial Intelligence (AI) ?

Machines that perform jobs that mimic human behavior.

What is Machine Learning (ML) ?

Machines that get better at a task without explicit programming. It is a subset of artificial intelligence that uses technologies (such as deep learning) that enable machines to use experience to improve at tasks.

What is Deep Learning (DL) ?

Machines that have an artificial neural network inspired by the human brain to solve complex problems. It is a subset of machine learning that's based on artificial neural network.

What is a Data Scientist ?

A person with Multi-Disciplinary skills in math, statistics, predictive modeling and machine learning to make future predictions.

Principle of AI

Challenges and Risks with AI

Bias can affect results
Errors can cause harm
Data could be exposed
Solutions may not work for everyone
Users must trust a complex system
Who's liable for AI driven decision ?

1. Reliability and Safety : Ensure that AI systems operate as they were originally designed, respond to unanticipated conditions and resist harmful manipulation. If AI is making mistakes it is important to release a report quantified risks and harms to end-users so they are informed of the short comings of an AI solution.

AI-based software application development must be subjected to rigorous testing and deployment management processes to ensure that they work as expected before release.
Good Examples : while developing an AI system for a self-driving car?

2. Fairness : Implementing processes to ensure that decisions made by AI systems can be override by humans.

Harm of Allocation : AI Systems that are used to Allocate or Withhold:

Opportunities
Resources
Information

Harm of Quality-of-Service : AI systems can reinforce existing stereotypes.

An AI system does not work well for one group of people as it does for another. As example is a voice recognition system which works well for men but not well for women.

Reduce bias in the model as we will live in an unfair world.

Fair learn is an open-source python package that allows machine learning systems developers to assess their systems' fairness and mitigate the observed the observed fairness issues.

3. Privacy and Security : Provide customers with information and controls over the collection, use and storage of the data.

Example: On device machine learning
AI security Aspects: Data Origin and Lineage , Data Use : Internal vs External
Anomaly Detection API is good example for the above use case.

4. Inclusiveness: AI systems should empower everyone and engage people especially minority groups based on:

Physical Ability
Gender
Sexual orientation
Ethnicity
Other factors
Microsoft Statement- "We firmly believe everyone should benefit from intelligent technology, meaning it must incorporate and address a broad range of human needs and experiences. For the 1 billion people with disabilities around the world, AI technologies can be a game-changer."

5. Transparency : AI systems should be understandable. Interpretability / intelligently is when end-users can understand the behavior of UI. Adopting an open source framework for AI can provide transparency (at least from the technical perspective) on the internal working of an AI systems.

AI systems should be understandable. Users should be made fully aware of the purpose of the system, how it works, and what limitations may be expected.
Example : Detail Documentation of Code for debugging

6. Accountability : People should be responsible for AI systems. The structure put in place to consistently enact AI principles and taking them into account. AI systems should work with the :

Framework of governance
Organizational principles
Ethical and legal standards
That are clearly defined
AI-Based solutions meets ethical and legal standards that advocate regulations on people civil liberties and works within a framework of governance and organizational principles.
Designers and developers of AI-based solution should work within a framework of governance and organizational principles that ensure the solution meets ethical and legal standards that are clearly defined.
AI-Based solutions meets ethical and legal standards that advocate regulations on people civil liberties and works withing a framework of governance and organizational principles.
Ensure that AI systems are not the final authority on any decision that impacts people's lives and that humans maintain meaningful control over otherwise highly autonomous AI systems.

Dataset : A dataset is a logical grouping of units of data that are closely related and/or share the same data structure.

Data labeling : process of identifying raw data and adding one or more meaningful and informative labels to provide context so that a machine learning model can learn.

Ground Truth : a properly labeled dataset to you use as the objective standard to train and assess a given model is often called as ‘ground truth’. The accuracy of your trained model will depend on the accuracy of the ground truth.

Machine learning in Microsoft Azure

Microsoft Azure provides the Azure Machine Learning service - a cloud-based platform for creating, managing, and publishing machine learning models. Azure Machine Learning provides the following features and capabilities:

Feature	Capability
Automated machine learning	This feature enables non-experts to quickly create an effective machine learning model from data.
Azure Machine Learning designer	A graphical interface enabling no-code development of machine learning solutions.
Data and compute management	Cloud-based data storage and compute resources that professional data scientists can use to run data experiment code at scale.
Pipelines	Data scientists, software engineers, and IT operations professionals can define pipelines to orchestrate model training, deployment, and management tasks.

Other Features of Azure Machine Learning Services :

A service that simplifies running AI/ML related workloads allowing you to build flexible Automated ML Pipelines. Use Python, R, Run DL workloads such as TensorFlow.

1. Jupyter Notebooks

build and document your machine learning models as you build them, share and collaborate.

2. Azure Machine Learning SDK for Python

As SDK designed specifically to interact with Azure Machine Learning Services.

3. MLOps

End to End Automation of ML Model pipelines eg. CI/CD, training, inference.

4. Azure Machine Learning Designer

drag and drop interface to visually build, test, and deploy machine learning models.

5. Data Labeling Service

Ensemble a team of humans to label your training data.

6. Responsible Machine Learning

Model fairness through disparity metrics and mitigate unfairness.

Performance/Evaluation Metrics are used to evaluate different Machine Learning Algorithms

For different types of problems different metrics matters

Classification Metrics (accuracy, precision, recall, F1-Score, ROC, AUC)
Regression Metrics (MSE, RMSE, MAE)
Ranking Metrics (MRR, DCG, NDCG)
Statistical Models (Correlation)
Computer Vision Models (PSNR, SSIM, IoU)
NLP Metrics (Perplexity, BLEU, METEOR, ROUGE)
Deep Learning Related Metrics (Inception Score, Frechet Inception Distance)

There are two categories of evaluation metrics:

Internal Metrics : metrics used to evaluate the internals of the ML Model

The Famous Four - Accuracy, Precision, Recall, F1-Score

External Metrics : metrics used to evaluate the final prediction of the ML Model

Random Forest Model and find the most important variables using R

2021-07-05T17:38:03Z

One of the benefits of using Random Forest Model is

1. In Regression, when the variables may be highly correlated with each other, the approach of Random Forest really help in understanding the feature importance. The trick is Random forest selects explanatory variables at each variable split in the learning process, which means it trains a random subset of the feature instead of all sets of features. This is called feature bagging. This process reduces the correlation between trees; because the strong predictors could be selected by many of the trees, and it could make them correlated.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

How to find the most important variables in R

Find the most important variables that contribute most significantly to a response variable

Selecting the most important predictor variables that explains the major part of variance of the response variable can be key to identify and build high performing models.

1. Random Forest Method

Random forest can be very effective to find a set of predictors that best explains the variance in the response variable.

library(caret)

library(randomForest)

library(varImp)

regressor <- randomForest(Target ~ . , data       = data, importance=TRUE) # fit the random forest with default parameter

varImp(regressor) # get variable importance, based on mean decrease in accuracy

varImp(regressor, conditional=TRUE) # conditional=True, adjusts for correlations between predictors

varimpAUC(regressor) # more robust towards class imbalance.

2. xgboost Method

library(caret)

library(xgboost)

regressor=train(Target~., data        = data, method = "xgbTree",trControl = trainControl("cv", number = 10),scale=T)

varImp(regressor)

3. Relative Importance Method

Using calc.relimp {relaimpo}, the relative importance of variables fed into lm model can be determined as a relative percentage.

library(relaimpo)

regressor <- lm(Target ~ . , data       = data) # fit lm() model

relImportance <- calc.relimp(regressor, type = "lmg", rela = TRUE) # calculate relative importance scaled to 100

sort(relImportance$lmg, decreasing=TRUE) # relative importance

4. MARS (earth package) Method

The earth package implements variable importance based on Generalized cross validation (GCV), number of subset models the variable occurs (nsubsets) and residual sum of squares (RSS).

library(earth)

regressor <- earth(Target ~ . , data       = data) # build model

ev <- evimp (regressor) # estimate variable importance

plot (ev)

5. Step-wise Regression Method

If you have large number of predictors , split the Data in chunks of 10 predictors with each chunk holding the responseVar.

base.mod <- lm(Target ~ 1 , data       = data) # base intercept only model

all.mod <- lm(Target ~ . , data       = data) # full model with all predictors

stepMod <- step(base.mod, scope = list(lower = base.mod, upper = all.mod), direction = "both", trace = 1, steps = 1000) # perform step-wise algorithm

shortlistedVars <- names(unlist(stepMod[[1]])) # get the shortlisted variable.

shortlistedVars <- shortlistedVars[!shortlistedVars %in% "(Intercept)"] # remove intercept

The output might include levels within categorical variables, since ‘stepwise’ is a linear regression based technique.

If you have a large number of predictor variables, the above code may need to be placed in a loop that will run stepwise on sequential chunks of predictors. The shortlisted variables can be accumulated for further analysis towards the end of each iteration. This can be very effective method, if you want to

· Be highly selective about discarding valuable predictor variables.

· Build multiple models on the response variable.

6. Boruta Method

The ‘Boruta’ method can be used to decide if a variable is important or not.

library(Boruta)

# Decide if a variable is important or not using Boruta

boruta_output <- Boruta(Target ~ . , data  = data, doTrace=2) # perform Boruta search

boruta_signif <- names(boruta_output$finalDecision[boruta_output$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables

# for faster calculation(classification only)

library(rFerns)

boruta.train <- Boruta(factor(Target)~., data  =data, doTrace = 2, getImp=getImpFerns, holdHistory = F)
boruta.train
 
boruta_signif <- names(boruta.train$finalDecision[boruta.train$finalDecision %in% c("Confirmed", "Tentative")]) # collect Confirmed and Tentative variables
 
boruta_signif

##
getSelectedAttributes(boruta_signif, withTentative = F)

boruta.df <- attStats(boruta_signif)

print(boruta.df)

7. Information value and Weight of evidence Method

library(devtools)

library(woe)

library(riv)

iv_df <- iv.mult(data, y="Target", summary=TRUE, verbose=TRUE)

iv <- iv.mult(data, y="Target", summary=FALSE, verbose=TRUE)

iv_df

iv.plot.summary(iv_df) # Plot information value summary

Calculate weight of evidence variables

data_iv <- iv.replace.woe(data, iv, verbose=TRUE) # add woe variables to original data frame.

The newly created woe variables can alternatively be in place of the original factor variables.

8. Learning Vector Quantization (LVQ) Method

library(caret)
control <- trainControl(method="repeatedcv", number=10, repeats=3)

# train the model

regressor<- train(Target~., data       =data, method="lvq", preProcess="scale", trControl=control)

# estimate variable importance

importance <- varImp(regressor, scale=FALSE)

9. Recursive Feature Elimination RFE Method

library(caret)

# define the control using a random forest selection function

control <- rfeControl(functions=rfFuncs, method="cv", number=10)

# run the RFE algorithm

results <- rfe(data[,1:n-1], data[,n], sizes=c(1:8), rfeControl=control)

# summarize the results

# list the chosen features
predictors(results)

# plot the results
plot(results, type=c("g", "o"))

10. DALEX Method

library(randomForest)

library(DALEX)

regressor <- randomForest(Target ~ . , data       = data, importance=TRUE) # fit the random forest with default parameter


# Variable importance with DALEX

explained_rf <- explain(regressor, data   =data, y=data$target)



# Get the variable importances

varimps = variable_dropout(explained_rf, type='raw')



print(varimps)

plot(varimps)

11. VITA

library(vita)

regressor <- randomForest(Target ~ . , data    = data, importance=TRUE) # fit the random forest with default parameter

pimp.varImp.reg<-PIMP(data,data$target,regressor,S=10, parallel=TRUE)
pimp.varImp.reg

pimp.varImp.reg$VarImp

pimp.varImp.reg$VarImp
sort(pimp.varImp.reg$VarImp,decreasing = T)

12. Genetic Algorithm

library(caret)

# Define control function

ga_ctrl <- gafsControl(functions = rfGA, # another option is `caretGA`.

            method = "cv",

            repeats = 3)



# Genetic Algorithm feature selection

ga_obj <- gafs(x=data[, 1:n-1], 

        y=data[, n], 

        iters = 3,  # normally much higher (100+)

        gafsControl = ga_ctrl)



ga_obj

# Optimal variables

ga_obj$optVariables

13. Simulated Annealing

library(caret)

# Define control function

sa_ctrl <- safsControl(functions = rfSA,

            method = "repeatedcv",

            repeats = 3,

            improve = 5) # n iterations without improvement before a reset



# Simulated Annealing Feature Selection

set.seed(100)

sa_obj <- safs(x=data[, 1:n-1], 

        y=data[, n],

        safsControl = sa_ctrl)



sa_obj

# Optimal variables

print(sa_obj$optVariables)

14. Correlation Method

library(caret)

# calculate correlation matrix

correlationMatrix <- cor(data [,1:n-1])

# summarize the correlation matrix

print(correlationMatrix)

# find attributes that are highly corrected (ideally >0.75)

highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5)

# print indexes of highly correlated attributes

print(highlyCorrelated)

IT Certifications that will pay off

2021-07-03T06:22:34Z

https://www.cio.com/article/3222879/15-data-science-certifications-that-will-pay-off.html

https://www.codespaces.com/best-data-science-certifications-courses-tutorials.html

https://www.codespaces.com/best-artificial-intelligence-courses-certifications.html

Domo Certificate

Data Specialist
Domo Professional
Major Domo

Tableau Certificate

Tableau Desktop Specialist
Tableau Desktop Certified Associate

bookmarks

2021-06-28T17:27:10Z

Download edge_favorites_6_28_21.html

Download chrome_bookmarks_6_28_21.html

Good Courses that i found

2021-06-28T10:22:51Z

Insofe : https://lms.insofe.com/courses

Coursera : Reinforcement Learning at Alberta

Online Books and Resources on R

2021-06-25T06:38:35Z

1. R for Health Data Science (ed.ac.uk)

2. Telling Stories With Data

3. Data Analysis and Visualization in R for Ecologists (datacarpentry.org)

4. The Effect: An Introduction to Research Design and Causality | The Effect (theeffectbook.net)

5. Chapter 1 Introduction | ISLR tidymodels Labs (emilhvitfeldt.github.io)

6. R for applied epidemiology and public health | The Epidemiologist R Handbook (epirhandbook.com)

7. The lidR package (jean-romain.github.io)

8. Earth Lab: Free, online courses, tutorials and tools | Earth Data Science - Earth Lab

9. Collaborative Data Science for Healthcare

10. https://www.mltut.com/best-online-courses-for-data-science-with-r/

11. https://solutionsreview.com/business-intelligence/the-best-deep-learning-courses-and-online-training/

12. https://www.educateai.org/the-most-popular-machine-learning-courses/

13. https://www.reddit.com/r/learnmachinelearning/comments/mutgi2/data_science_roadmap_with_resources/?utm_medium=android_app&utm_source=share

14. https://github.com/addy1997/Machine_Learning_Resources

15. https://bookdown.org/mwheymans/bookmi/

16. https://www.routledge.com/go/ids -- paid Book Series

17. https://www.routledge.com/Chapman--HallCRC-The-R-Series/book-series/CRCTHERSER -- paid Book Series

Statistics Notes 2 - Bayesian vs Frequentists

2021-06-17T11:27:39Z

On Bayesian Philosophy, Confidence vs. Credibility

for frequentists, a probability is a measure of the frequency of repeated events

→ parameters are fixed (but unknown), and data are random for Bayesians,

a probability is a measure of the degree of certainty about values

→ parameters are random and data are fixed

Bayesians: Given our observed data, there is a 95% probability that the true value of θ falls within the credible region

vs.

Frequentists: There is a 95% probability that when I compute a confidence interval from data of this sort, the true value of θ will fall within it.

Statistics Notes 1 - Hypothesis Testing

2021-06-05T06:58:53Z

Difference between CHI-Square and Proportions Testing

The chi-squared test of independence (or association) and the two-sample proportions test are related. The main difference is that the chi-squared test is more general while the 2-sample proportions test is more specific. And, it happens that the proportions test is more targeted at specifically the type of data you have.

The chi-squared test handles two categorical variables where each one can have two or more values. And, it tests whether there is an association between the categorical variables. However, it does not provide an estimate of the effect size or a CI. If you used the chi-squared test with the Pfizer data, you’d presumably obtain significant results and know that an association exists, but not the nature or strength of that association.

The two proportions test also works with categorical data but you must have two variables that each have two levels. In other words, you’re dealing with binary data and, hence, the binomial distribution. The Pfizer data you had fits this exactly. One of the variables is experimental group: control or vaccine. The other variable is COVID status: infected or not infected. Where it really shines in comparison to the chi-squared test is that it gives you an effect size and a CI for the effect size. Proportions and percentages are basically the same thing, but displayed differently: 0.75 vs. 75%.

Difference between 2-Sample t-test and CHI-Square

CHI-Square is for categorical data and the t-test is for continuous data

data science techniques

2021-05-17T08:17:13Z

Download Data_Science_Material.docx

Download test.zip

patent documentation

2021-05-17T07:59:35Z

Download ACTION_NEEDED_-_Inventor_Signature_Required_For_Foreign_Filing_License__Patent_Appl._for_OptumHealth_R_t__Comparison__IDF-001327__AB_552231_.msg

Download Specimen_GPOA_-_Salman_Ahmed.docx

domo dhasboard color schemes what is available vs what can be used

2021-05-02T15:15:25Z

https://www.color-hex.com/

https://htmlcolorcodes.com/color-picker/

https://www.w3schools.com/colors/colors_hexadecimal.asp

https://sourceforge.net/directory/os:windows/?q=hex+color

https://www.softpedia.com/get/Multimedia/Graphic/Graphic-Others/HEX-RGB-color-codes.shtml

https://www.umsiko.co.za/links/RGB-ColourNamesHex.pdf

http://www.workwithcolor.com/color-chart-full-01.htm

https://weschool.files.wordpress.com/2016/03/rgb-colournameshex.pdf

Links I am planning to Study

2021-04-30T10:56:03Z

Sampling Methods | Types and Techniques Explained: https://www.scribbr.com/methodology/sampling-methods/

Introduction to Machine Learning by Duke University: https://exploreroftruth.medium.com/free-coursera-course-introduction-to-machine-learning-offered-by-duke-university-f229534e1e8e

Zero-Inflated Regression: https://towardsdatascience.com/zero-inflated-regression-c7dfc656d8af

Logistic Regression, Sigmoid Function: https://towardsdatascience.com/logistic-regression-cebee0728cbf

Experiment Guide : https://experimentguide.com/

GMF Tooliing : https://github.com/eclipse/gmf-tooling

Best data science certification : https://www.kdnuggets.com/2020/11/best-data-science-certification-never-heard.html

Introduction to K fold cross validation in R : https://www.analyticsvidhya.com/blog/2021/03/introduction-to-k-fold-cross-validation-in-r/

A Gentle Introduction to PyTorch Library for Deep Learning : https://www.analyticsvidhya.com/blog/2021/04/a-gentle-introduction-to-pytorch-library/

DeepONet: A deep neural network-based model to approximate linear and nonlinear operators : https://techxplore.com/news/2021-04-deeponet-deep-neural-network-based-approximate.html

Deep Neural Network in R : https://www.r-bloggers.com/2021/04/deep-neural-network-in-r/

Best Python Course : https://courseretriever.com/python/

Top AI and ML MOOC : https://towardsdatascience.com/top-20-free-data-science-ml-and-ai-moocs-on-the-internet-4036bd0aac12

MBA Admission essays : https://www.usnews.com/education/best-graduate-schools/top-business-schools/applying/articles/2016-10-25/2-mba-admissions-essays-that-worked

Dropbox for covid-19 : https://www.dropbox.com/sh/akc525jjq3dp485/AADgo6WsT1RBpZqahmj_k-v_a/SIR/italy_fit.py?dl=0

24 best free books for ML : https://www.kdnuggets.com/2020/03/24-best-free-books-understand-machine-learning.html

Data Science Project Portfolio : https://www.kdnuggets.com/2021/02/best-data-science-project-portfolio.html

First Data Science job without experience : https://www.kdnuggets.com/2021/02/first-job-data-science-without-work-experience.html

Data science offers 2 time doubled income : https://www.kdnuggets.com/2021/01/data-science-offers-doubled-income-2-months.html

benford law study :

https://en.wikipedia.org/wiki/An_Economic_Theory_of_Democracy#:~:text=An%20Economic%20Theory%20of%20Democracy%20is%20a%20treatise%20of%20economics,%2Dmarket%20political%20decision%2Dmaking.

file:///C:/Users/sahmed88/Downloads/benfords-law-and-the-detection-of-election-fraud.pdf

https://nigrini.com/benfords-law/

http://www-personal.umich.edu/~wmebane/inapB.pdf

https://pdfs.semanticscholar.org/e667/b8ad9f58992828ff820ddc8a005de754c5f5.pdf

https://www.cambridge.org/core/journals/political-analysis/article/benfords-law-and-the-detection-of-election-fraud/3B1D64E822371C461AF3C61CE91AAF6D#

lesson 8 notes

2020-12-08T07:54:48Z

Download Fast_AI_-_Lesson_8_NLP.pptx

PMI Artefacts

2020-12-01T19:04:08Z

Download Project_Management_Tips.pptx

Download ProjectKickOffChecklist1__1_.docx

Download Project_Change_Log_Template.xlsx

Download Stakeholder_Analysis.xls

Download Project_Charter_Template.docx

Salman_Ahmed_asking_what_is_water

top ai certification in 2024

coursera links on bio informatics

ODHSI

good introduction

AI Prompt Engineering

LLM Chat GPT

Great Google Analytics courses and Google Material on Udemy

job sites for UK

formulas

Statistics Notes 3 - Other Terms

What is Weight of Evidence (WOE)?

Steps of Calculating WOE

WEIGHT OF EVIDENCE (WOE) AND INFORMATION VALUE (IV) EXPLAINED

What is Weight of Evidence (WOE)?

Steps of Calculating WOE

Terminologies related to WOE

Usage of WOE

WEIGHT OF EVIDENCE (WOE) AND INFORMATION VALUE (IV) EXPLAINED

What is Weight of Evidence (WOE)?

Steps of Calculating WOE

Terminologies related to WOE

Usage of WOE

FEATURE SELECTION : SELECT IMPORTANT VARIABLES WITH BORUTA PACKAGE

In Linear Regression:

marketing mix notes

Market Mix Modelling

Marketing Spend as a percent of companies revenues by industry

Marketing Mix Modeling

The Pros and Cons of Marketing Mix Modeling

800 data science questions

Research on IT Certifications

Top IT management certifications

The most valuable certifications for 2021

Top agile certifications

Top 15 data science certifications

Top 10 business analyst certifications

The top 11 data analytics and big data certifications

Tools for which we need to know their data science certifications

Machine Learning - Basic Starting Notes

Know the Problem Before Focusing on the Data

AI - 900 Azure AI fundamentals prep notes

The Layers of AI

Principle of AI

Feature

Capability

Random Forest Model and find the most important variables using R

IT Certifications that will pay off

bookmarks

Good Courses that i found

Online Books and Resources on R

Statistics Notes 2 - Bayesian vs Frequentists

On Bayesian Philosophy, Confidence vs. Credibility

Statistics Notes 1 - Hypothesis Testing

data science techniques

patent documentation

domo dhasboard color schemes what is available vs what can be used

Links I am planning to Study

lesson 8 notes

PMI Artefacts