This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
Causal structure learning refers to a process of identifying causal structures from observational data, and it can have multiple applications in biomedicine and health care.
This paper provides a practical review and tutorial on scalable causal structure learning models with examples of realworld data to help health care audiences understand and apply them.
We reviewed traditional (combinatorial and scorebased) methods for causal structure discovery and machine learning–based schemes. Various traditional approaches have been studied to tackle this problem, the most important among these being the
We also compared the performance of traditional and machine learning–based algorithms for causal discovery over some benchmark data sets. Directed Acyclic GraphGraph Neural Network has the lowest structural hamming distance (19) and false positive rate (0.13) based on the Sachs data set, whereas Greedy Equivalence Search and MaxMin Hill Climbing have the best false discovery rate (0.68) and true positive rate (0.56), respectively.
Machine learning–based approaches, including deep learning, have many advantages over traditional approaches, such as scalability, including a greater number of variables, and potentially being applied in a wide range of biomedical applications, such as genetics, if sufficient data are available. Furthermore, these models are more flexible than traditional models and are poised to positively affect many applications in the future.
Many applications in biomedicine require the knowledge of the underlying causal relationship between various factors beyond association or correlation. Randomized controlled trials are widely used to uncover causality, but these experiments can be prohibitively expensive or unethical in many cases. Therefore, it has sparked an enormous amount of interest in identifying causal effects from observational data [
In this paper, we discuss causal structure learning; that is, learning causal relationships that are represented as directed graph structures between different factors and its application to biomedicine. The causal structure is represented by a causal graph (also called a causal Bayesian network), which is a directed acyclic graph (DAG), in which the nodes represent variables and edges represent causation (
For example, consider the example of a gene regulatory network [
Many researchers in the biomedical field are interested in causality and not just correlation (eg, whether a particular treatment affects a particular outcome). Unlike association or correlationbased studies that simply indicate that any 2 variables are correlated, this approach seeks to determine the directional relationship between any 2 variables (eg, between a treatment variable and an outcome variable). In biomedicine, causal structure learning can be applied in a variety of applications.
Example of the causal structure. (A) A gene regulatory network is an abstracted structure (given by the directed graph on the right) of the complex biophysical process shown on the left. (B) A gene regulatory structure from the transmiR database for mice [
A gene regulatory network is a network in which molecular regulators and genes are the nodes, and the directed edges denote the interactions among them [
Causal structure learning algorithms have been used to jointly deduce the phenotype network structure and directional genetic architecture [
The tumorspecific causal inference algorithm proposed by Xue et al [
In our comparative analysis of the performance of this data set, we found that machine learning models can also be effective at finding causal structures (details are available in the
Different regions of the brain have distinct functions. Previous studies have used correlationbased methods [
Overview of the methods reviewed and benchmark results sections of the paper. CGNN: Causal Generative Neural Networks; DAGGNN: Directed Acyclic GraphGraph Neural Network; FCI: Fast Causal Interface; GAE: graph autoencoder; GES: Greedy Equivalence Search; GRANDAG: Gradientbased neuraldirected acyclic graph learning; IC: inductive causation; LiNGAM: linear nonGaussian acyclic model; MMHC: MaxMin Hill Climbing; PC: Peter Spirtes and Clark Glymour; RLBIC: Reinforcement LearningBayesian Information Criterion; SAM: Structural Agnostic Modeling.
Causal structure learning has also been used in epidemiology with patients’ medical records. Many complex diseases are multifactorial in which a combination of these factors contributes to disease predisposition. Causal structure learning considers multiple confounders to determine causal effects solely from one factor of interest to another. For example, causal structure has been used to disentangle psychological factors that predispose adolescents to smartphone addiction [
However, there are a few challenges. The general approach to solving this problem of learning a DAG from data, which has been studied for a long time [
Several approaches have been used to solve this problem of intractable time complexity. Traditionally, constraintbased and scorebased methods, which search for the optimal graph from a discrete space of candidate graphs, have been used to learn the DAG from data. Constraintbased methods such as the
Hence, researchers have investigated various scorebased methods that assign scores based on the data to each DAG and select the one with the best score. Although scorebased methods scale better than constraintbased methods, they do not scale well for several thousand variables. On the other hand, patient medical records in electronic health records or claim data raise severe scalability concerns, because they include up to 144,000 International Classification of DiseasesNinth Revision or 69,823 International Classification of DiseasesTenth Revision diagnosis codes, >2000 US Food and Drug Administration–approved drugs, and >10,000 Current Procedural Terminology procedures or laboratory test codes.
To overcome the limited scalability of traditional methods, recent advances in machine learning algorithms have relaxed the problem of finding an optimal DAG into a continuous optimization problem with smooth acyclicity constraints. This enables the use of nonheuristic machine learning (including deep learning) algorithms to determine the optimal causal structure. This is a promising development in the field of biomedicine. In this study, we focus on scalable algorithms.
There are 2 distinct approaches in the context of treatment effect evaluation: the structural approach and potential outcome framework approach [
Summary of various algorithms for causal structure learning.
Algorithm  DS^{a}  CS^{b}  Summary  Remarks  Scalability 
PC^{c}  ✓  ✖  A partially directed acyclic graph (CPDAG^{d}) is produced by iteratively checking the conditional independence conditions of adjacent nodes, conditioned on an allsize subset of neighbors.  Outputs usually converge to the same equivalence class; high FPR^{e} on experimental data  +^{f} 
IC^{g}  ✓  ✖  Returns the equivalent class of the DAG^{h} based on the estimated probability distribution of random variables and an underlying DAG structure.  Outputs usually converge to the same equivalence class.  + 
FCI^{i}  ✓  ✖  Modified PC algorithm to detect unknown confounding variables and produces asymptotically correct results.  Faster than PC with similar TPR^{j}; converges to the same asymptotic result; high experimental FPR  ++ 
GES^{k}  ✓  ✖  Starts with an empty graph and iteratively adds and deletes edges in the graph by optimizing a score function.  Faster than PC with higher TPR; stable result for the same score function  ++ 
Fast GES  ✓  ✖  Improved and parallelized version of GES  Faster than GES; same TPR; stable result for the same score function  ++ 
K2  ✓  ✖  Performs a greedy heuristic search for each nodes’ parents.  Greedy searches might return very suboptimal solutions.  ++ 
MMHC^{l}  ✓  ✖  MMHC to find the skeleton of the network and constrained greedy search for edge orientation.  Greedy searches might return suboptimal solutions.  + 
LiNGAM^{m}  ✓  ✖  Transfer the linear structure model 
Works very well on linear data but not on nonlinear data.  ++ 
NOTEARS  ✖  ✓  Uses smooth function 
Might converge to many different DAGs; GPUs^{n} can speed up the process.  +++ 
NOBEARS  ✖  ✓  Proposed a new acyclicity constraint that allows for faster optimization and scalability, and a polynomial regression loss to infer gene regulatory networks from nonlinear gene expressions.  Might converge to many different DAGs; GPUs can speed up the process.  +++ 
DAGGNN^{o}  ✖  ✓  Uses an autoencoder framework and deep learning to train it and infer the causal structure from the weights of the trained network and is more scalable than NOTEARS.  Might converge to many different DAGs; GPUs can speed up the process.  ++++ 
NOFEARS  ✖  ✓  Modify NOTEARS so the scoring function remains convex to ensure local minima.  Might converge to many different DAGs; GPUs can speed up the process.  ++++ 
GAE^{p}  ✖  ✓  Scalable graph autoencoder framework (GAE) whose training time increases linearly with the number of variable nodes.  Good accuracy; might converge to many different DAGs; GPUs can speed up the process.  ++++ 
GRANDAG^{q}  ✖  ✓  Extends the NOTEARS algorithm for nonlinear relationships.  Works on nonlinear data; better accuracy than NOTEARS; might converge to many different DAGs; GPUs can speed up the process.  ++++ 
CGNN^{r}  ✖  ✓  Generative model of the joint distribution of variables reducing MMD^{s} between the graph and data.  Does not always converge to a single class of equivalent DAGs; GPUs can speed up the process.  ++++ 
SAM^{t}  ✖  ✓  Structurally agnostic model for causal discovery and penalized adversarial learning.  Does not always converge to a single class of equivalent DAGs; GPUs can speed up the process.  ++++ 
RLBIC^{u}  ✖  ✓  Reinforcement learningbased algorithm that uses both the acyclicity constraint and the BIC^{v} score.  Very good accuracy; does not always converge to a single class of equivalent DAGs; GPUs can speed up the process.  ++++ 
^{a}DS: discrete space algorithms.
^{b}CS: continuous space algorithms.
^{c}PC:
^{d}CPDAG: completed partially directed acyclic graph.
^{e}FPR: false positive rate.
^{f}The + symbol for an algorithm indicates its scalability.
^{g}IC: inductive causation.
^{h}DAG: directed acyclic graph.
^{i}FCI: Fast Causal Inference.
^{j}TPR: true positive rate.
^{k}GES: Greedy Equivalence Search.
^{l}MMHC: MaxMin Hill Climbing.
^{m}LiNGAM: linear nonGaussian acyclic model.
^{n}GPUs: graphical processing units.
^{o}DAGGNN: Directed Acyclic GraphGraph Neural Network.
^{p}GAE: graph autoencoder.
^{q}GRANDAG: Gradientbased neural  directed acyclic graph learning.
^{r}CGNN: causal generative neural network.
^{s}MMD: maximum mean discrepancy.
^{t}SAM: Structural Agnostic Modeling.
^{u}RLBIC: Reinforcement LearningBayesian Information Criterion.
^{v}BIC: Bayesian information criterion.
This study attempts to provide a comparative study of various scalable algorithms that are used to discover causal structures from observational data to the biomedicine community. Some of these traditional and scorebased methods have been extensively studied [
This tutorial paper presents algorithms for causal structure identification in biomedical informatics. In the
In this section, we discuss 2 paradigms of algorithms for causal structure learning. First, we consider algorithms that search for the optimal DAG in the discrete space of all possible DAGs (space of all possible discrete DAGs for a given number of variable nodes) or
The first type that we discuss in this section is discrete space algorithms for causal discovery; that is, algorithms that search for the optimal DAG in the discrete space of candidate causal graphs. This is in contrast to continuous space algorithms (discussed in the
The discrete space algorithms can be divided into the following 4 types: combinatorial constraintbased models, scorebased models, hybrid models, and functional models. In combinatorial constrainedbased methods, we consider methods that check the conditional independence relations of 2 adjacent nodes conditioned on all subsets of their neighbors. Such methods can be useful when the number of variables is up to a few hundred. Scorebased methods perform optimization by considering a score representing the goodness of fit and can handle more variables than constraintbased methods. Hybrid methods combine constraint and scorebased algorithms. Functional models find structural equations to describe the causal relationship and are useful mostly when the variables can be assumed to be expressed by some linear or nonlinear equations.
We now focus on combinatorial optimization methods, where conditional independence relationships in the data are used for finding the optimal DAG.
The PC algorithm was proposed by
The PC algorithm is orderdependent; that is, the output of the algorithm can depend on the order in which the variables are provided to the algorithm. To address this problem, Colombo and Maathuis [
The inductive causation (IC) algorithm uses the estimated probability distribution of random variables with an underlying DAG structure and outputs the equivalent class of the DAG. In contrast, PC provides a schematic search method and is thus considered a refinement of the IC.
The IC* algorithm [
The FCI is a modification of the PC algorithm [
In addition to traditional combinatorial methods such as PC and FCI, scorebased methods have also been used to uncover causal structures. In these methods, algorithms determine the optimal DAG by optimizing a particular score.
A typical score function is the Bayesian information criterion (BIC) score. The GES algorithm uses different score functions for different data types as follows: the BIC score (for continuous data), likelihoodequivalence Bayesian Dirichlet uniform joint distribution score (for discrete data), and Conditional Gaussian score (for continuous or discrete mixture data).
where
Scorebased methods include the GES algorithm, the fast GES algorithm, and the K2 algorithm.
The GES algorithm was proposed by Chickering [
Fast GES is an improved and parallelized version of the GES. Significant speedup was achieved by storing the score information during the GES algorithm [
The main idea of the K2 algorithm [
Hybrid algorithms use a combination of scorebased and combinatorial constraintbased optimization methods to determine the optimal DAG. An example is the MaxMin Hill Climbing (MMHC) algorithm
Functional causal models or structural equation models (SEMs) assume structural equations that define the causal relationships. Such structural equations may describe the linear and nonlinear relationships among variables. In addition to the discrete methods discussed here, SEMs are also an important assumption in many machine learning–based methods that use the continuous optimization techniques in the
The linear nonGaussian acyclic model (LiNGAM) was originally proposed by Shimizu [
A nonlinear additive noise model is proposed in [
where
Traditional causal discovery algorithms attempt to discover a causal graph, which is usually a DAG, while searching for an optimal graph in the space of candidate graphs. The scorebased optimization problem of DAG learning (discussed in the
Here,
An alternative approach would be to model the problem as a continuous space optimization problem, which would then allow the application of various learning techniques. Recently, several publications have explored continuous optimization methods that learn DAGs by adding an acyclicity constraint. In these approaches, the discrete acyclicity constraint
The hard constraints on acyclicity can be relaxed and incorporated into the loss function to be optimized. This smooth continuous constraint allows the use of machine learning–based tools, which in turn can make the algorithms scalable in the presence of substantial amounts of data. These algorithms are based on SEMs.
Several other improvements such as the NOBEARS algorithm [
This algorithm considers the acyclicity constraint and comes up with the constraint.
Here,
Here,
The constraint is given by
A Directed Acyclic GraphGraph Neural Network (DAGGNN) [
This variational framework considers
is minimized, where
where
Wei et al [
Ng et al [
Some other similar machine learning–based continuous learning algorithms include gradientbased neural DAG [
Reinforcement learningbased methods have been proposed recently that consider both the acyclicity constraint and BIC score in the reward function and attempt to learn the DAG [
This section provides the results to compare the effectiveness of some causal structure learning algorithms on synthetic and real data.
The synthetic data were generated in the same manner as in the DAGGNN paper [
Here,
and the second is a nonlinear function
We considered 5 data sets for both linear and nonlinear functions. For each data set, we generated n=5000 independent samples according to the above equations. We used 6 algorithms, 4 of which are discrete space algorithms. PC and Greedy Fast Causal Interface (GFCI) are constraintbased methods, GES is a scorebased method, and MMHC is a hybrid method. We also considered 2 continuous space methods, DAGGNN and GAE.
We also evaluated these algorithms on the publicly available Sachs data set [
Benchmark experiments on the Sachs data set. We evaluated 6 algorithms— Peter Spirtes and Clark Glymour (PC), Greedy Equivalence Search (GES), Greedy Fast Causal Interface (GFCI), MaxMin Hill Climbing (MMHC), Directed Acrylic GraphGraph Neural Network (DAGGNN), and graph auto encoder (GAE), on 4 metrics—structural hamming distance (SHD), true positive rate (TPR), false positive rate (FPR), and false discovery rate (FDR)—and show their results in
Metric  PC  GES  GFCI  MMHC  DAGGNN  GAE 
SHD ( 
24.50  26.50  29.50  22.00 

22.00 
FDR ( 
0.77  0.72  0.79 

0.71  0.89 
TPR ( 
0.32 

0.44  0.47  0.11  0.05 
FPR ( 
0.49  0.64  0.72  0.45 

0.21 
^{a}Italicized values represent the best results for each metric.
Accuracy comparison. We evaluated 6 algorithms: Peter Spirtes and Clark Glymour (PC), Greedy Equivalence Search (GES), Greedy Fast Causal Interface (GFCI), MaxMin Hill Climbing (MMHC), Directed Acrylic GraphGraph Neural Network (DAGGNN), and graph auto encoder (GAE) on 4 metrics, structural hamming distance (SHD↓), true positive rate (TPR↑), false positive rate (FPR↓), and false discovery rate (FDR↓). In all these evaluations, we considered any bidirectional edges as half discovered. In experiments (AD), first column, the data are drawn from a distribution according to the underlying causal graph where relationships between nodes are linear, and experiments (EH), second column, is for the nonlinear case. In all experiments, the number of nodes of the graph ranges from 10, 20, 50, to 100. For each graph size, we drew 5 different data sets from the graph structure with a sample size of 1000 and calculated 4 evaluation metrics and obtained the average.
All algorithms were tested on both linear and nonlinear data. The accuracies of some of these algorithms are shown in
The relative scalability of different algorithms is presented in
In our experiments, we observed that in the worstcase scenario, the running time for a maximum of 100 variables was of the order of hours for MMHC and of the order of minutes for the other algorithms. However, as the number of variables increases to a few thousand, machine learning–based methods such as DAGGNN and GAE can provide solutions in a reasonable time. The tradeoff between complexity (number of iterations) and accuracy can provide a choice between a method that is less accurate but faster or vice versa.
It is clear from the results that the algorithms have different advantages and disadvantages. Although the PC algorithm performs well across both linear and nonlinear data, it has a low TPR and is computationally intensive. The GES, GFCI, and MMHC algorithms show a very high FPR, but their TPR is higher than that of the PC algorithm. The SHD of the 2 machine learning–based methods—DAGGNN and GAE—was also considerably lower for both data sets.
Continuous constraintbased algorithms generally exhibit a very low FDR, except for the benchmark Sachs data set. This is generally because both linear and nonlinear models are based on SEMs with the same causal relationship function at every node, which is an algorithm assumption when they learn the causal structure, but one cannot guarantee the same for Sachs data [
This is corroborated by recent results from Zhu et al [
Machine learning for causal structure learning is not without its limitations, which may present several challenges. First, in many applications, there is no ground truth about causal structure, which makes it difficult to evaluate the performance of these algorithms. Furthermore, many scalable methods use stochastic gradient descent; thus, the final output graph is not always deterministic. When the number of data samples or variables is low, traditional or scorebased methods are a better choice, especially when the application requires fewer false positives. For the PC, GES, and GFCI algorithms, we observed that these algorithms require considerable running time, as the number of variables is more than 100 [
However, when it comes to large samples of data (eg, more than 100,000 samples) or hundreds of variables (eg, in many gene networks), machine learning methods can provide a reasonable solution, because other methods fail owing to scalability issues. As machine learning algorithms are highly parallelizable, the solutions can be computed much faster, particularly through the use of a graphical processing unit. These algorithms are potentially useful for many applications related to genetics and biomedicine, especially those with an abundance of observational data.
The continuous space machine learning models are more scalable and might be useful in the era of big data. Traditional methods might have complexities that grow exponentially with the number of attributes. Despite the nonconvexity of the optimization proposed by Zheng et al [
The NOBEARS algorithm reduces the computing complexity of NOTEARS from cubic to quadratic in terms of the number of attributes [
Machine learning techniques for causal discovery, which use continuous space optimization, are an emerging area of research, which can lead to more efficient causal discovery, particularly in applications where directed graphs are used to specify causal relations more clearly. With sufficient data, machine learning models can be robust to certain discrepancies such as sample bias, missing data, and erroneous measurements. Many of these applications have also focused on weaker concepts of causality such as pairwise directionality during the analysis of gene networks and brain connectivity networks [
It is noteworthy that machine learning methods are usually black box methods, which might provide lesser insight into the process of derivability of the causal structures. For higher interpretability, an option that has been explored is to develop parallel versions of these algorithms, such as PC [
Some other challenges can be found in finding causal structure from data. In the case of learning causal structure from electronic health record data, they might have several problems, such as missing values or noise in the data, which are very common [
Furthermore, most causal discovery methods assume that the distribution of data is stationary, which may not be true in certain medical applications [
In this paper, we have discussed the motivation for causal structure discovery in biomedicine as well as some interesting applications. Two paradigms of causal discovery algorithms have been reviewed. Combinatorial or scorebased algorithms are used in the first paradigm for optimizing discrete spaces of candidate causal graphs, whereas machine learning algorithms are used in the second paradigm to solve continuous optimization problems with acyclicity constraints. In addition to listing these methods, we have also included resources that readers can use to find appropriate applications. Furthermore, we tested several algorithms against synthetic benchmark data sets and against the Sachs realworld data set and evaluated their relative performances. We have also discussed their theoretical time complexity. Our discussion of the limitations and challenges of various algorithms is intended to offer readers a guide for choosing an algorithm from among the many available options. Finally, we highlight several challenges associated with finding causal structure from realworld data (eg, missing values, nonstationarity, noise, and sampling bias).
List of tools for causal discovery.
completed partially directed acyclic graph
directed acyclic graph
Directed Acyclic GraphGraph Neural Network
Fast Causal Inference
false discovery rate
false positive rate
graph autoencoder
Greedy Equivalence Search
Greedy Fast Causal Interface
inductive causation
linear nonGaussian acyclic model
MaxMin Hill Climbing
Peter Spirtes and Clark Glymour
structural equation model
structural hamming distance
true positive rate
XJ is a Cancer Prevention and Research Institute of Texas scholar in cancer research (RR180012) and was supported in part by the Christopher Sarofim Family Professorship, University of Texas Stars award, University of Texas Health Science Center startup, and the National Institutes of Health under award numbers R01AG066749 and U01TR002062.
The survey and experiments on deep learning–based methods and the survey on potential applications were conducted by PU. The survey and experiments on traditional methods were conducted by KZ and CL. XJ and YK conceived the study and provided useful inputs for the potential applications of scalable structure learning.
None declared.