Some great books to look at

- The Seven Pillars of Statistical Wisdom (Stigler)
- Philosophy of Social Science (Rosenberg)
- Apollo’s Arrow (Orrell)
- The Model Thinker (Page)
- Artificial Intelligence (Russell and Norvig)
- Uncertainty: The Soul of Modeling, Probability & Statistics (Briggs)
- The Oxford Handbook of Causation (Beebee et al.)
- Probability Theory: The Logic of Science (Jaynes)
- Conjectures and Refutations: The Growth of Scientific Knowledge (Popper)
- The Logic of Scientific Discovery (Popper)

- The Structure of Scientific Revolutions (Kuhn)

https://www.phil.vt.edu/dmayo/personal_website/

SPC

Statistical Process Control and Process Capability key highlights


1. Reduction of process variability

2. Monitoring and surveillance of a process

3. Estimation of product or process parameters


If a product is to meet or exceed customer expectations, generally it should be produced by a

process that is stable or repeatable


SPC seven major tools are

1. Histogram or stem-and-leaf plot

2. Check sheet

3. Pareto chart

4. Cause-and-effect diagram

5. Defect concentration diagram

6. Scatter diagram

7. Control chart - Skewart - technically most complicated


The proper deployment of SPC helps create an environment

in which all individuals in an organization seek continuous improvement in quality

and productivity. This environment is best developed when management becomes involved in

the process. Once this environment is established, routine application of the magnificent

seven becomes part of the usual manner of doing business, and the organization is well on its

way to achieving its quality improvement objectives.



the statistical concepts that form the basis of SPC, we must first describe Shewhart’s theory of variability.



In any production process, regardless of how well designed or carefully maintained it is, a certain

amount of inherent or natural variability will always exist. This natural variability or

“background noise” is the cumulative effect of many small, essentially unavoidable causes. In

the framework of statistical quality control, this natural variability is often called a “stable

system of chance causes.” A process that is operating with only chance causes of variation

present is said to be in statistical control. In other words, the chance causes are an inherent

part of the process.



Other kinds of variability may occasionally be present in the output of a process. This

variability in key quality characteristics usually arises from three sources: improperly

adjusted or controlled machines, operator errors, or defective raw material. Such variability is

generally large when compared to the background noise, and it usually represents an unacceptable

level of process performance. We refer to these sources of variability that are not part

of the chance cause pattern as assignable causes of variation. A process that is operating in

the presence of assignable causes is said to be an out-of-control process.1


t1 forward, the presence of assignable causes has resulted in an out-of-control process.

Processes will often operate in the in-control state for relatively long periods of time.

However, no process is truly stable forever, and, eventually, assignable causes will occur,

seemingly at random, resulting in a shift to an out-of-control state where a larger proportion

of the process output does not conform to requirements.



The chart contains a center line that represents the average value of

the quality characteristic corresponding to the in-control state. (That is, only chance

causes are present.) Two other horizontal lines, called the upper control limit (UCL) and

the lower control limit (LCL), are also shown on the chart. These control limits are chosen

so that if the process is in control, nearly all of the sample points will fall between

them.


If the process is in control, all the

plotted points should have an essentially random pattern.

There is a close connection between control charts and hypothesis testing.



For example, the mean could shift instantaneously to a new value and remain there

(this is sometimes called a sustained shift); or it could shift abruptly; but the assignable cause

could be short-lived and the mean could then return to its nominal or in-control value; or the

assignable cause could result in a steady drift or trend in the value of the mean. Only the sustained

shift fits nicely within the usual statistical hypothesis testing model.



The hypothesis testing framework is useful in many ways, but there are some differences

in viewpoint between control charts and hypothesis tests. For example, when testing statistical

hypotheses, we usually check the validity of assumptions, whereas control charts are used to

detect departures from an assumed state of statistical control.



In general, we should not worry

too much about assumptions such as the form of the distribution or independence when we are

applying control charts to a process to reduce variability and achieve statistical control.

Furthermore, an assignable cause can result in many different types of shifts in the process

parameters. For example, the mean could shift instantaneously to a new value and remain there

(this is sometimes called a sustained shift); or it could shift abruptly; but the assignable cause

could be short-lived and the mean could then return to its nominal or in-control value; or the

assignable cause could result in a steady drift or trend in the value of the mean. Only the sustained

shift fits nicely within the usual statistical hypothesis testing model.



One place where the hypothesis testing framework is useful is in analyzing the performance

of a control chart. For example, we may think of the probability of type I error of the

control chart (concluding the process is out of control when it is really in control) and the

probability of type II error of the control chart (concluding the process is in control when it

is really out of control). It is occasionally helpful to use the operating-characteristic curve of

a control chart to display its probability of type II error. This would be an indication of the

ability of the control chart to detect process shifts of different magnitudes. This can be of

value in determining which type of control chart to apply in certain situations. For more discussion

of hypothesis testing, the role of statistical theory, and control charts, see Woodall

(2000).


We may give a general model for a control chart. Let w be a sample statistic that measures

some quality characteristic of interest, and suppose that the mean of w is mw and the

standard deviation of w is sw



A very important part of the corrective action process associated with control chart

usage is the out-of-control-action plan (OCAP). An OCAP is a flow chart or text-based

description of the sequence of activities that must take place following the occurrence of an

activating event. These are usually out-of-control signals from the control chart. The OCAP

consists of checkpoints, which are potential assignable causes, and terminators, which are

actions taken to resolve the out-of-control condition, preferably by eliminating the assignable

cause. It is very important that the OCAP specify as complete a set as possible of checkpoints

and terminators, and that these be arranged in an order that facilitates process diagnostic

activities. Often, analysis of prior failure modes of the process and/or product can be helpful

in designing this aspect of the OCAP. Furthermore, an OCAP is a living document in the sense

that it will be modified over time as more knowledge and understanding of the process is

gained. Consequently, when a control chart is introduced, an initial OCAP should accompany

it. Control charts without an OCAP are not likely to be useful as a process improvement tool.




three things we need:

in the x-bar chart , we specified a sample size of five measurements, three-sigma

control limits, and the sampling frequency to be every hour. Increasing Sample Size will reduce

the probability of type-II error


PMP Notes

implementing integration management

managing scope

both above are chapter 4 of the pmbok guide and we will be getting in more detail in that


domain - initiating - 13 percent


26 questions

conduct project selection methods

define te scope

document project risks, assumptions and constraints

identify and perform stakeholder analysis

develop the project charter

obtain project charter approval


domain - planning - 24 percent


48 questions

define and record requirements, onstraints and assumptions

create the WBS

create a budget plan

develop the project schedule and timeline

create the human resource management plan

crerate the coomunications plan

develop the project procurement plan

establish the project quality management plan

define the change management plan

create the project risk management plan

present the prokject management plan to the key stakeholers

host the project kisk off plan meeting



domain : executing : 30 percent - getting things done most important pmp

60 questions

mange project resource for project execution

enforce the quality management plan

implement approved changes as directed by the change management plan

execute the risk management plan to manage and respond to risk events

develop the project team through mentoring, coach and motivation



domain: moitoring and controlling- 25 percent

50 questions

measure project performance 

verify and manage changes to the project

ensure project deliverables conform to quality standards

monitor all risks and update the risk registry

review corrective actions and assess issues

manage project communications to ensure stakeholder engagement







HEOR - study material

https://www.edx.org/course/healthcare-finance-economics-and-risk 

 

https://ocw.mit.edu/courses/economics/14-01sc-principles-of-microeconomics-fall-2011/index.htm

 

https://www.jhsph.edu/academics/online-learning-and-courses/

 

https://www.pce.uw.edu/certificates/health-care-analytics

 

https://www.pce.uw.edu/degrees/masters-health-informatics-health-information-management

 

https://www.pce.uw.edu/degrees/executive-masters-health-administration

 

https://www.jefferson.edu/university/population-health/degrees-programs/applied-health-economics.html

 

https://sop.washington.edu/choice/graduate-education-training-programs/certificates/health-economics-and-outcomes-research/

 

https://marksmanacademy.org/p/certification-programme-in-fundamentals-of-health-economics-and-outcomes-research-heor

 

https://marksmanhealthcare.com/

 

AI Terminology and References

Modeling Term Description
Features A set of explanatory variables collected on subjects or samples. Commonly referred to as the independent variables or covariates in the statistical and epidemiological literature
Labels The outcome or response of interest. Also referred to as dependent variable or target variable.
Supervised Learning Algorithms that map a set of input variables (e.g. features) to output variables (e.g. labels). Describes the vast majority of tasks in machine learning in healthcare.
Unsupervised Learning Algorithms that attempt to extract hidden or latent structure in a set of features. Popular examples of unsupervised learning include clustering (e.g. k-means clustering) and dimensionality reduction techniques (e.g. principal components analysis (PCA)). In contrast to supervised learning (see above).
Causal Inference Statistical methods that attempt to estimate the effect of an intervention. When using observational (non-experimental data), these methods require additional modeling assumptions drawn from domain knowledge.
Zero-shot Learning Using a model to make predictions for a task despite having no training data for that task.
Bias (Statistical) Systematic difference between the true value of a parameter in a model and the value of that parameter as estimated from data. Can also refer to the systematic difference between the predicted values from a model and the true values of the labels.
Word Sense Disambiguation Learning which similar sounding words might have different meanings. For example, “discharge” can indicate the time a patient leaves the hospital or it might refer to the flow of fluid from part of the body.
Generative Adversarial Networks (GANs) Class of machine learning systems that allows for creation of synthetic data similar to provided dataset through the use of two neural networks functioning as a discriminatory and a generator
Generative Models Class of models that allow for modeling of both the features and label variables together, as opposed to discriminative models which model the conditional probability of the label given the features
Matrix Factorization n Mathematical technique that factorizes one large and dense matrix (e.g. patient biomarker values) into lower-dimensional matrices



Data Term Description
Bias (Fairness) Variation in human or model performance based on features of the data that reflect societal biases.
Confounding Variables (potentially unmeasured) that affect both the treatment and outcome of interest. Confounding can cause bias in the statistical sense if not controlled or accounted for.
Missing Data or Missingness Portions of the data that are unobserved. Missingness can refer to the scenario when values are missing for certain patients (e.g. a missing lab value for a patient) or to the scenario when a potentially relevant variable is not measured at all across every patient.
Training Data Data that was used to build a model.
Measurement Drift When the data gathered of a population may change noticeably over time (e.g. world population becoming more obese),
Imputation Replacing missing values in the dataset (e.g. with the mean) in order to do analysis with data points with missing features
Sparsity Rareness of certain events resulting in few observations of "positive" examples. Sparsity can occur in both the features and labels.


Common Problems in ML Problem Short-Term Solution Long-Term Outlook
Complex Data Challenges Data Quality Matters Sparsity, missingness, and biased sampling make modeling difficult. Data aggregation and imputation techniques, such as sparse encoding methods, or matrix factorization can been used to deal with a lack of ``full'' data. Synthetic data which preserves privacy allows the sharing of EHR data. Creation of high-quality research data containing robust documentation of all aspects of the data generation process.
Disease Data Imbalances Health conditions are the result of sporadic diseases, leading to highly unbalanced data. Modified loss functions for important classes and data subsampling are often quick fixes. Patient self-reporting and passive data collection are needed to create a robust understanding of ``normal'' baselines.
Data Only For The Few Limited access to datasets stymies research. Standardized performance metrics, learning with anonymized data sharing, and privacy-preserving machine learning are all important areas of research growth. Engaging patients can create voluntarily shared data pools, and more datasets can be created that respect medical regulations.
Robustness to the Unseen Same Name, Different Measure Measurement drift as equipment ages or changes. Transfer learning and domain adaptation have attempted to compensate for these trends. Better devices should be made to capture additional signals, or selfdiagnose when the signal is no longer calibrated.
Anticipating New Data Generalizability of models to new input data, e.g., ``X'' values not seen before. Model interpretability, domain adaptation, and manifold learning are used to learn the common spaces that may connect new variables to prior ones. Regulatory incentives should be created to ensure and fund generalizability of data inputs.
Handling the next Zika Zero-shot learning in new disease targets, e.g., ``Y'' values not seen before. Abnormality detection and human in the loop modeling are used to detect when a model may be poorly calibrated for a novel condition. Expedited clinical capture is key for detecting new conditions, especially if they are fast-moving.
Unknown Knowns Difficult Disease Endotyping Diseases have underlying heterogeneity, and
may have undiscovered
subtypes.
Generative modeling and unsupervised clustering with outcome-based loss measures have been previously
attempted.
Additional data sources as well as fundamental biomedical research are needed to create robust clinical endophenotyping
for machine learning
targets.
Creating Common Ground There is no consensus on meaningful model targets or inputs. Causal inference and diagnostic baselines are often employed to understand potential directionalities of process, and establish useful tasks. Patient self-reporting of outcomes combined with traditional expert verified diagnoses may be more meaningful for many conditions of interest




Customer Segmentation and Methods

Methods used : k-Means, RFM Model, K-means clustering algorithm, EM clustering, Generalized Differential RFM Method (GDRFM)

Customer Segmentation is to provide a full range of management perspective, enable to have a great chance for enterprises to communicate with customer and to enhance the return rate of customers.

Commonly used ones are : RFM Method, Customer Value matrix and CLV Method.

It will cost 5 times more to gain a new customer than to keep an existing one, and ten times more to get a dissatisfied customer back (Marcus C., 1998) - Harward

Statistical Way of Clustering algorithms include : partitioned-clustering, density-based clustering, fuzzy clustering , and hierarchical clustering.

In RFM analysis, there is sometime co-linearity found between Frequency and Monetary. Founder of RFM suggested to used Average value rather than total sum as Monetary, and frequency of purchases was converted to number of purchases.


A customer value matrix - used by Boston Consulting Group.

Frequency of Purchase (F) and Average Purchase Amount (M) are used for segmentation in 2*2 matrix used by Boston Consulting Group as Growth-Share.

Data mining consists of more than collecting and managing data; it also includes analysis and prediction. Data mining includes association, sequence or path analysis, classification, clustering and future activities.

Data Mining is the main step of the knowledge discovery in database (KDD) process. Data mining tasks are very distinct and divers because many patterns exist in a huge database. The data mining functionalities and the variety of knowledge they discover are: Characterization, Discrimination, Association Analysis, Classification, Prediction and Clustering.

Clustering Methods can be categorized into two different types of algorithms which are Hierarchical Algorithms and Non-Hierarchical or Partition Algorithms.

In Hierarchical algorithms, number of clusters is unknown in the beginning, which is a strong advantage of these algorithms over non-hierarchical methods. On the other hand once an instance is assigned to a cluster, the assignment is irrevocable. Therefore, we can say that the output of hierarchical methods can be used to generate some interpretations over the data set and may be used as an input for a non-hierarchical method, in order to improve the resulting cluster. (Similar to RFM and then using K-means is what i am proposing).


Non-hierarchical or Partition algorithms (NHC) typically determine all clusters initially, but they can also be used as divisive algorithms in the hierarchical clustering. Here the advantage is that the algorithm iterates for all possible movements of data points between the formed clusters until a stop-ping criterion met. The NHC algorithms are sensitive to initial partitions and due to this fact, there exists too many local minima.