AI Terminology and References

Modeling Term Description
Features A set of explanatory variables collected on subjects or samples. Commonly referred to as the independent variables or covariates in the statistical and epidemiological literature
Labels The outcome or response of interest. Also referred to as dependent variable or target variable.
Supervised Learning Algorithms that map a set of input variables (e.g. features) to output variables (e.g. labels). Describes the vast majority of tasks in machine learning in healthcare.
Unsupervised Learning Algorithms that attempt to extract hidden or latent structure in a set of features. Popular examples of unsupervised learning include clustering (e.g. k-means clustering) and dimensionality reduction techniques (e.g. principal components analysis (PCA)). In contrast to supervised learning (see above).
Causal Inference Statistical methods that attempt to estimate the effect of an intervention. When using observational (non-experimental data), these methods require additional modeling assumptions drawn from domain knowledge.
Zero-shot Learning Using a model to make predictions for a task despite having no training data for that task.
Bias (Statistical) Systematic difference between the true value of a parameter in a model and the value of that parameter as estimated from data. Can also refer to the systematic difference between the predicted values from a model and the true values of the labels.
Word Sense Disambiguation Learning which similar sounding words might have different meanings. For example, “discharge” can indicate the time a patient leaves the hospital or it might refer to the flow of fluid from part of the body.
Generative Adversarial Networks (GANs) Class of machine learning systems that allows for creation of synthetic data similar to provided dataset through the use of two neural networks functioning as a discriminatory and a generator
Generative Models Class of models that allow for modeling of both the features and label variables together, as opposed to discriminative models which model the conditional probability of the label given the features
Matrix Factorization n Mathematical technique that factorizes one large and dense matrix (e.g. patient biomarker values) into lower-dimensional matrices



Data Term Description
Bias (Fairness) Variation in human or model performance based on features of the data that reflect societal biases.
Confounding Variables (potentially unmeasured) that affect both the treatment and outcome of interest. Confounding can cause bias in the statistical sense if not controlled or accounted for.
Missing Data or Missingness Portions of the data that are unobserved. Missingness can refer to the scenario when values are missing for certain patients (e.g. a missing lab value for a patient) or to the scenario when a potentially relevant variable is not measured at all across every patient.
Training Data Data that was used to build a model.
Measurement Drift When the data gathered of a population may change noticeably over time (e.g. world population becoming more obese),
Imputation Replacing missing values in the dataset (e.g. with the mean) in order to do analysis with data points with missing features
Sparsity Rareness of certain events resulting in few observations of "positive" examples. Sparsity can occur in both the features and labels.


Common Problems in ML Problem Short-Term Solution Long-Term Outlook
Complex Data Challenges Data Quality Matters Sparsity, missingness, and biased sampling make modeling difficult. Data aggregation and imputation techniques, such as sparse encoding methods, or matrix factorization can been used to deal with a lack of ``full'' data. Synthetic data which preserves privacy allows the sharing of EHR data. Creation of high-quality research data containing robust documentation of all aspects of the data generation process.
Disease Data Imbalances Health conditions are the result of sporadic diseases, leading to highly unbalanced data. Modified loss functions for important classes and data subsampling are often quick fixes. Patient self-reporting and passive data collection are needed to create a robust understanding of ``normal'' baselines.
Data Only For The Few Limited access to datasets stymies research. Standardized performance metrics, learning with anonymized data sharing, and privacy-preserving machine learning are all important areas of research growth. Engaging patients can create voluntarily shared data pools, and more datasets can be created that respect medical regulations.
Robustness to the Unseen Same Name, Different Measure Measurement drift as equipment ages or changes. Transfer learning and domain adaptation have attempted to compensate for these trends. Better devices should be made to capture additional signals, or selfdiagnose when the signal is no longer calibrated.
Anticipating New Data Generalizability of models to new input data, e.g., ``X'' values not seen before. Model interpretability, domain adaptation, and manifold learning are used to learn the common spaces that may connect new variables to prior ones. Regulatory incentives should be created to ensure and fund generalizability of data inputs.
Handling the next Zika Zero-shot learning in new disease targets, e.g., ``Y'' values not seen before. Abnormality detection and human in the loop modeling are used to detect when a model may be poorly calibrated for a novel condition. Expedited clinical capture is key for detecting new conditions, especially if they are fast-moving.
Unknown Knowns Difficult Disease Endotyping Diseases have underlying heterogeneity, and
may have undiscovered
subtypes.
Generative modeling and unsupervised clustering with outcome-based loss measures have been previously
attempted.
Additional data sources as well as fundamental biomedical research are needed to create robust clinical endophenotyping
for machine learning
targets.
Creating Common Ground There is no consensus on meaningful model targets or inputs. Causal inference and diagnostic baselines are often employed to understand potential directionalities of process, and establish useful tasks. Patient self-reporting of outcomes combined with traditional expert verified diagnoses may be more meaningful for many conditions of interest