Introduction

JMI

JMIR Med Inform

JMIR Medical Informatics

2291-9694

JMIR Publications

Toronto, Canada

v9i5e27778

34042600

10.2196/27778

Viewpoint

A Roadmap for Automating Lineage Tracing to Aid Automatically Explaining Machine Learning Predictions for Clinical Decision Support

Lovis

Christian

Rajan

Vaibhav

Luo

Gang

DPhil 1

Department of Biomedical Informatics and Medical Education University of Washington

UW Medicine South Lake Union

850 Republican Street, Building C, Box 358047

Seattle, WA, 98195

United States 1 206 221 4596 1 206 221 2671 gangluo@cs.wisc.edu

https://orcid.org/0000-0001-7217-4008

1 Department of Biomedical Informatics and Medical Education University of Washington

Seattle, WA

United States

Corresponding Author: Gang Luo gangluo@cs.wisc.edu

5 2021

27 5 2021

9 5

e27778

6 2 2021 21 3 2021 25 3 2021 14 4 2021

©Gang Luo. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 27.05.2021.

2021

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.

Using machine learning predictive models for clinical decision support has great potential in improving patient outcomes and reducing health care costs. However, most machine learning models are black boxes that do not explain their predictions, thereby forming a barrier to clinical adoption. To overcome this barrier, an automated method was recently developed to provide rule-style explanations of any machine learning model’s predictions on tabular data and to suggest customized interventions. Each explanation delineates the association between a feature value pattern and an outcome value. Although the association and intervention information is useful, the user of the automated explaining function often requires more detailed information to better understand the patient’s situation and to aid in decision making. More specifically, consider a feature value in the explanation that is computed by an aggregation function on the raw data, such as the number of emergency department visits related to asthma that the patient had in the prior 12 months. The user often wants to rapidly drill through to see certain parts of the related raw data that produce the feature value. This task is frequently difficult and time-consuming because the few pieces of related raw data are submerged by many pieces of raw data of the patient that are unrelated to the feature value. To address this issue, this paper outlines an automated lineage tracing approach, which adds automated drill-through capability to the automated explaining function, and provides a roadmap for future research.

clinical decision support database management systems forecasting machine learning electronic medical records

Introduction

Machine learning has won almost all data science competitions [1] and is a hot topic these days. It is about computer algorithms that automatically learn from data, such as extreme gradient boosting, support vector machine, and random forest [2]. Using machine learning predictive models for clinical decision support has great potential in improving patient outcomes and reducing health care costs [3-10]. However, most machine learning models are black boxes that do not explain their predictions. This creates a barrier to clinical adoption. To overcome this barrier, we recently developed an automated method to offer rule-style explanations of any machine learning model’s predictions on tabular data and to suggest customized interventions without reducing the model’s performance measures [11-14]. Each rule-style explanation delineates the association between a feature value pattern and an outcome value. A feature is also called an independent variable. For the prediction of future emergency department (ED) visits or inpatient stays for asthma for a patient with asthma, one example of the explanation is as follows:

The patient had 2 ED visits related to asthma in the prior 12 months

AND the patient’s average respiratory rate recorded in the prior 12 months is >25 and ≤28 breaths per minute

→the patient will likely have at least 1 ED visit or inpatient stay for asthma in the next 12 months [13,14].

An ED visit is related to asthma if the ED visit has an asthma diagnosis code. For the item in the explanation showing that the patient had 2 ED visits related to asthma in the prior 12 months, 1 intervention suggested by the automatic explanation method [12-14] is to apply control procedures that decrease the likelihood that the patient will need emergency care.

The association and intervention information provided by the automatic explanation method for machine learning predictions is useful. However, the user of the automated explaining function often requires more detailed information to better understand the patient’s situation and to aid in decision making. More specifically, consider a feature value on the left-hand side of a rule-style explanation that is computed by an aggregation function on the raw data. The user often wants to rapidly drill through to see certain parts of the related raw data producing the feature value. In the context of a relational database, these parts refer to the most relevant attributes of the most essential source tuples producing the feature value. Which attributes are most relevant and which source tuples are most essential depend on both the concrete feature type and the clinical decision support application’s need and are illustrated by several examples throughout this paper. The patterns embedded in these parts could provide additional information on the patient that was lost during the aggregation process to compute the feature value. This drill-through task is frequently difficult and time-consuming because the few pieces of related raw data are submerged by many pieces of raw data of the patient that are unrelated to the feature value. For example, as Table 1 shows, the list of encounters of a patient with asthma displayed on the standard interface of an electronic medical record system includes much information that is irrelevant to the feature value “2 of the number of ED visits related to asthma that the patient had in the prior 12 months.”

Table 1

An example list of encounters of a patient with asthma displayed on the standard interface of an electronic medical record system.^a

Visit date	Primary diagnosis^b	Visit type	Department	Provider	Facility
Dec 20, 2020	Cough (R05)	Outpatient	HMC^c family medicine clinic	John Smith	HMC
Dec 18, 2020	Dysphagia, unspecified (R13.10)	Outpatient	HMC family medicine clinic	David Wong	HMC
…	…	…	…	…	…
Oct 15, 2020	Cystitis, unspecified without hematuria (N30.90)	Inpatient	UWMC^d 8SE	Leslie Hurdle	UWMC
Oct 12, 2020 ^e	Viral infection, unspecified (B34.9)	Emergency	HMC HEDUCC ^f	Patricia Sward	HMC
Oct 09, 2020	Dizziness and giddiness (R42)	Outpatient	HMC family medicine clinic	Eve Johnson	HMC
…	…	…	…	…	…
Feb 11, 2020	Posttraumatic stress disorder, unspecified (F43.10)	Outpatient	HMC psychotherapy clinic	Amy Jiang	HMC
Feb 08, 2020	Syncope and collapse (R55)	Emergency	HMC HEDUCC	Peter Shavlik	HMC
Feb 03, 2020	Headache, unspecified (R51.9)	Outpatient	HMC family medicine clinic	Jude Lake	HMC
…	…	…	…	…	…

^aThis example list is made based on a similar list seen in real electronic medical record data at the University of Washington Medicine.

^bThis column does not show up on the standard interface. This column is included because it will be discussed in this paper.

^cHMC: Harborview Medical Center.

^dUWMC: University of Washington Medical Center.

^eFor the feature value “2 of the number of emergency department visits related to asthma that the patient had in the prior 12 months,” the related rows in the list producing the feature value are marked in italics.

^fHEDUCC: Harborview Emergency Department Urgent Care Center.

For instance, in the rule-style explanation shown above, the first item on the left-hand side is the feature value “2 of the number of ED visits related to asthma that the patient had in the prior 12 months.” Asthma may or may not be the primary diagnosis of either of these 2 visits. For this feature value, the user of the automated explaining function wants to see the relevant parts of these 2 visits (visit date, primary diagnosis, department handling the visit, admitting provider, facility where the visit occurred) in the reverse chronological order (see Table 2), like the way encounters are displayed on the standard interface of an electronic medical record system. The patterns embedded in these parts give additional information on the patient not shown by the feature value, such as the time between these 2 visits, how long ago these 2 visits occurred, the primary diagnoses in these 2 visits, and whether these 2 visits occurred at the same facility. However, finding these parts is nontrivial. As seen in real electronic medical record data at the University of Washington Medicine, Intermountain Healthcare, and Kaiser Permanente Southern California, the patient could have had over 100 encounters in the prior 12 months. Only a few of these encounters are ED visits, and even fewer of them are ED visits related to asthma. To find the ED visits of the patient in the prior 12 months, the user would need some manual effort even if aided by the search function for the electronic medical record system. To figure out which of these visits are related to asthma, a task with which the search function often cannot provide much help, the user would need much more manual effort.

Table 2

An example of the parts of the related raw data that should be displayed for a feature value.^a

Visit date	Primary diagnosis	Department	Provider	Facility
Oct 12, 2020	Viral infection, unspecified (B34.9)	HMC^b HEDUCC^c	Patricia Sward	HMC
Feb 08, 2020	Syncope and collapse (R55)	HMC HEDUCC	Peter Shavlik	HMC

^aFor the example list shown in Table 1 and the feature value “2 of the number of emergency department visits related to asthma that the patient had in the prior 12 months,” the parts that the user of the automated explaining function wants to see are in the related raw data producing the feature value.

^bHMC: Harborview Medical Center.

^cHEDUCC: Harborview Emergency Department Urgent Care Center.

In practice, numerous possible features computed by various aggregation functions on all kinds of longitudinal attributes in the electronic medical records could be used for predictive modeling and automatic explanation. Examples of such features include whether the most recent asthma diagnosis of the patient is a primary diagnosis, the patient’s average respiratory rate recorded in the prior 12 months, the total number of distinct asthma medications ordered for the patient in the prior 12 months, the total number of units of asthma relievers that were ordered for the patient in the prior 12 months and were neither systemic corticosteroids nor short-acting beta-2 agonists, the number of distinct asthma medication prescribers of the patient in the prior 12 months, and the number of no-shows by the patient in the prior 12 months [13,14]. Most of the possible features are unanticipated by the developers of the search function for the electronic medical record system beforehand. The search function supports only a few fixed types of search. For only a small portion of possible features, the search function can aid drilling through the raw data that produce a given feature value.

This creates a problem for the widespread adoption of the automatic explanation method for machine learning predictions. Frequently, this method gives multiple rule-style explanations for a patient predicted to be at high risk of incurring a poor outcome [11,12]. The user of the automated explaining function is typically a busy clinician having no time to do laborious manual drill-through regularly. However, to better understand the patient’s situation and to make better clinical decisions, the user often wants to drill through multiple feature values of the patient appearing in the explanations. If done manually, this is a challenging task. A patient often has extensive records with numerous variables and hundreds of pages of content accumulated over a long period of time [15]. Further, the relevant raw data producing the feature values are frequently scattered in several places in the electronic medical record system.

This study makes 2 contributions toward solving this problem:

We articulate this problem for the first time in the literature. This is done in the “Introduction” section.

To address this problem, an automated lineage tracing approach is outlined to add automated drill-through capability to the automated explaining function. This is done in the “Outline of the proposed automated lineage tracing approach” section. Further, a roadmap for future research is provided in the “Directions for future research” section.

The automated drill-through capability is intended to be offered to help the user of the automated explaining function save time, better understand the patient’s situation, and make better clinical decisions. The discussion in this paper focuses on structured electronic medical record data, a specific method commonly used to build clinical machine learning predictive models, and the automatic explanation method for machine learning predictions [11,12]. Nevertheless, the automated lineage tracing approach is not limited to them. Instead, when automatically explaining machine learning predictions and after appropriate extension, the principle of this approach can be applied to facilitate drilling through any feature value computed by an aggregation function on longitudinal structured data, regardless of whether the data came from electronic medical records, whether the feature is specified by a human expert or semiautomatically extracted from longitudinal data using the method outlined in the prior paper [16], which method is used to build the machine learning predictive model, or which automatic explanation method is used.

Running Example

To illustrate this approach, a running example is used throughout this paper: automatically explaining the predictions of future ED visits or inpatient stays for individual patients with asthma. Our prior papers [12-14,17-19] detail this use case and the features used to make predictions in it.

Base Tables

Below are the schemas of 5 tables in a relational database used in the running example:

The underlined fields mark the key to each table. The encounter table includes 1 row per encounter listing its information. The diagnosis table includes 1 row per diagnosis code of an encounter. Primary diagnoses are signified by dx_sequence_number=1. The diagnosis_code_master table includes 1 row per unique diagnosis code giving its description. The ordered_medication table includes 1 row per medication appearing in a medication order. The medication_master table includes 1 row per unique medication listing its information.

Intermediate Result Tables

Besides the above 5 base tables, 4 intermediate result tables computed on the new data are also used in the running example: enc_features_1, enc_features_2, enc_features_3, and med_features_1. The trained machine learning predictive model is applied to the new data to make predictions on individual patients.

The intermediate result table enc_features_1 contains 3 temporal features on encounters: the number of ED visits, the number of inpatient stays, and the number of outpatient visits that the patient had in the prior 12 months. Let today_date denote today’s date. enc_features_1 is computed from the encounter base table using the following structured query language (SQL) query.

The intermediate result table enc_features_2 contains 1 temporal feature on encounters: the number of outpatient visits with a primary diagnosis of asthma that the patient had in the prior 12 months. Recall that the International Classification of Diseases, Tenth Revision diagnosis codes of asthma are J45.x. enc_features_2 is computed by joining the encounter and diagnosis base tables using the following SQL query.

The intermediate result table enc_features_3 contains 2 temporal features on encounters: the number of ED visits related to asthma and the number of inpatient stays related to asthma that the patient had in the prior 12 months. enc_features_3 is computed by joining the encounter and diagnosis base tables using the following SQL query.

The intermediate result table med_features_1 contains 2 temporal features on medications: the total number of medications and the total number of distinct medications ordered for the patient in the prior 12 months. med_features_1 is computed from the ordered_medication base table using the following SQL query.

Relational Algebra Operators

This paper uses the following relational algebra operators with the bag semantics unless otherwise specified: join , left semijoin , selection σ, projection π, duplicate elimination δ, and grouping γ [20]. Commercial database management systems implement relations using the bag semantics.

Review of a Typical Method to Build a Clinical Machine Learning Predictive Model and Our Automated Method to Explain the Model’s Predictions

In this section, a typical method to build a machine learning predictive model on structured electronic medical record data as well as the automated method to explain the model’s predictions [11-14] are reviewed. In the next section, the automated lineage tracing approach based on these 2 methods is outlined.

A health care system usually has an enterprise data warehouse. It stores in a relational database a copy of the structured electronic medical record data of the health care system, often after some transformations such as pivoting [21,22] and denormalization to facilitate data analysis. For predictive modeling with automated explanation, the overall workflow is to execute database SQL queries to extract features from the electronic medical record data, to build a machine learning predictive model on the training data, to apply the model on new data to make predictions on individual patients, and then to use the automated method to explain the predictions. In the following sections, each of these steps is described sequentially.

Extracting Features From the Electronic Medical Record Data and Building the Clinical Machine Learning Predictive Model

The structured electronic medical record data contain both static attributes (eg, gender) and longitudinal attributes (eg, encounters, diagnoses). Most attributes are longitudinal. As Figure 1 shows, the following operations are performed on the training data:

The static features are computed from the static attribute values. The results are stored in 1 or more intermediate result tables. Typically, each of these intermediate result tables is computed by running a select-project-join SQL query on 1 or more base tables.

By aggregating longitudinal attribute values and sometimes also using some static attribute values, the patient cohort of interest in the training data is computed. The result is stored in 1 intermediate result table. This is typically done by running a complex SQL query on several base tables. An example patient cohort is the set of all patients with asthma who visited any of the facilities of the health care system during a specific time period.

By aggregating longitudinal attribute values, temporal features and the outcome variable are computed and stored in 1 or more intermediate result tables. Typically, each of these intermediate result tables is computed by running a select-project-join-aggregate SQL query on 1 or more base tables. For example, 1 intermediate result table is similar to enc_features_1 and contains multiple temporal features on encounters computed from the encounter base table. A second intermediate result table is similar to enc_features_2 and contains multiple temporal features on encounters computed by joining the encounter and diagnosis base tables. A third intermediate result table contains multiple temporal features on medications computed by joining the ordered_medication and medication_master base tables, such as the total number of distinct asthma medications and the total number of units of asthma medications ordered for the patient in the prior 12 months. The logical query plan for a select-project-join-aggregate query includes 1 or more select-project-join-aggregate segments [23]. Each segment has a grouping or duplicate elimination operator at its end following a bunch of join, selection, and projection operators.

Figure 1

The flow chart for building a clinical machine learning predictive model on the training data, making predictions on the new data, and using our automated method to explain the model’s predictions.

Figure 2 shows the logical query plan for a select-project-join-aggregate query. By joining the intermediate result tables containing the patient cohort of interest, the static and temporal features, and the outcome variable in the training data, a table containing the unified training data frame is obtained. For the patient cohort of interest, this table includes 1 column for the outcome variable and a separate column for each feature. Then a machine learning predictive model is trained on this table.

Figure 2

A logical query plan for the select-project-join-aggregate query Q₃ given in the “Intermediate result tables” section.

Applying the Machine Learning Predictive Model to New Data to Make Predictions on Individual Patients

As Figure 3 shows, similar to the procedure mentioned above, the patient cohort of interest and the static and temporal features in the new data are computed. The results are stored in several intermediate result tables. By joining these tables, a table containing the unified data frame for the new data is obtained. For the patient cohort of interest, this table includes a separate column for each feature. We then apply the machine learning predictive model to this table to make predictions on individual patients.

Figure 3

The high-level logical query plan for computing the unified data frame that contains all the features of the new data. SQL: structured query language.

Automatically Explaining the Machine Learning Model’s Predictions

At the same time of building the clinical machine learning predictive model, the training data are used to create the knowledge base of the automated explaining function. We do automated discretization [24,25] to convert continuous features to categorical features. Then class-based association rules [24,26] are mined from the unified training data frame. Each rule delineates the association between a feature value pattern and a poor outcome value c and is of the form

i₁ AND i₂ AND … AND i_t→c.

This rule shows that a patient satisfying i₁, i₂, …, and i_t tends to have an outcome value c. The values of t and c can change across rules. Each item i_k (1≤k≤t) is a (feature, value) pair showing that a feature has a specific value or a value within a specific range. One example item of the former is that the patient had 2 ED visits related to asthma in the prior 12 months. One example item of the latter is that the patient’s average respiratory rate recorded in the prior 12 months is >25 and ≤28 breaths per minute. An example rule containing both items is given in the Introduction.

For each (feature, value) pair item used to create association rules, 0 or more interventions are precompiled. The interventions precompiled for any item on a rule’s left-hand side are automatically linked to the rule.

At prediction time, to avoid reducing the machine learning predictive model’s performance measures, the model’s predictions are used with no change. The mined association rules are used to explain these predictions rather than to make predictions. More specifically, for each patient whom the model predicts to have a poor outcome value, we find and display the rules that have this value on their right-hand sides and whose left-hand sides are fulfilled by the patient. Each rule offers 1 explanation for the prediction. The interventions linked to the rule are displayed next to it as the suggested candidate interventions.

Our automatic explanation method for machine learning predictions has been successfully applied to multiple clinical predictive modeling problems [11,12,27,28]. It has several advantages. Among all the automatic explanation methods for machine learning predictions in the literature [29,30], our method is the only one that can automatically suggest customized interventions. The rule-style explanations given by our method are easier to comprehend than the non–rule-style explanations given by many other methods. Unlike many other automatic explanation methods that either lower the machine learning predictive model’s performance measures or work for only a specific machine learning algorithm, our automatic explanation method works for any machine learning algorithm on tabular data without lowering the model’s performance measures. Unlike several other methods that use rules computed at prediction time to offer explanations [31,32], our method uses rules mined before prediction time to offer explanations. This is essential for our method to automatically suggest customized interventions at prediction time.

Review of the Existing Automated Lineage Tracing Techniques

In this section, the existing automated lineage tracing techniques are reviewed. An overview of such techniques developed in various fields is provided. Then, a specific set of automated lineage tracing techniques most closely related to this work is reviewed.

Overview of the Existing Automated Lineage Tracing Techniques

The lineage or provenance of a given data item i refers to the source data items producing i and how i was derived [33]. The former is called where-lineage. The latter is called how-lineage. Each type of lineage can be at either the schema level or the instance level. An example of where-lineage at the schema level is the set of base tables producing a specific materialized view. An example of where-lineage at the instance level is the set of tuples in the base tables producing a given temporal feature value in a materialized view. Lineage information can be computed in either an eager way or a lazy way. In the former case, lineage information is computed and stored at the same time of producing the output data. In the latter case, lineage information is computed when needed. This paper focuses on where-lineage that is at the instance level and computed in a lazy way.

Ikeda et al surveyed existing lineage tracing techniques in databases [33,34], e-science [35], and scientific data processing [36]. Among all of the lineage tracing techniques in the literature, the techniques Cui et al [23,37] developed are the most closely related to this work. These techniques are used to trace the lineage of a tuple in a materialized view [38] defined by a select-project-join-aggregate query in a relational database. Cui et al [39,40] described lineage tracing techniques for warehouse data computed via a directed acyclic graph of transformations, some of which could involve complex procedural code. Zhang et al [41] described lineage tracing techniques for data computed by arbitrary functions. In general, the more flexibility is allowed on the transformations or functions, the less efficiently lineage can be traced [39].

In big data systems, Ikeda et al [42,43] described lineage tracing techniques for data computed via a directed acyclic graph of map and reduce functions [44]. Amsterdamer et al [45] described lineage tracing techniques for data computed by using Pig Latin [46].

In scientific data processing, lineage tracing is often done on curated databases, which contain scientific data copied from other databases [36,47].

Schelter et al [48] described a method to trace the schema-level lineage of the data sets, features, models, and predictions produced in machine learning experiments.

Review of Cui et al’s Automated Lineage Tracing Techniques for Relational Databases

To automatically trace the lineage of a tuple t in a materialized view [38] defined by a select-project-join-aggregate query, Cui et al [23,37] proceeded as follows. First, the materialized view’s definition query is transformed into a canonical form of the logical query plan. As Figure 2 shows, the canonical form includes 1 or more select-project-join-aggregate segments. Each segment has 0 or 1 join operator, 0 or 1 selection operator, 0 or 1 projection operator, and a grouping or duplicate elimination operator in this particular order. Second, a separate intermediate materialized view is created for each intermediate select-project-join-aggregate segment of the canonical form. The root node of such a segment is not the root node of the canonical form. Third, we recursively trace through the hierarchy of intermediate materialized views in a top-down way. At each level of the hierarchy, the lineage tracing query for a 1-level select-project-join-aggregate materialized view is used to compute the current traced tuples’ lineage with respect to each base table and each materialized view at the next lower level. For a 1-level select-project-join-aggregate materialized view MV = γ(π_A(σ_C(R₁R₂… R_n))), the lineage of a tuple set T⊆MV with respect to the base table or the materialized view R_i (1≤i≤n) is π_Ri(σ_C(R₁ R₂ …R_n) T). Here, the projection operator π on R_i has the set semantics, making each selected tuple in R_i appear only once. Further, all attributes of R_i appear in the projection operator and subsequently in the lineage traced on R_i. The final traced lineage of tuple t includes the lineage traced on every base table appearing in the canonical form.

We use an example to illustrate Cui et al’s [23,37] automated lineage tracing techniques. If “create table enc_features_3” is replaced by “create materialized view enc_features_3_view” in query Q₃ given in the “Intermediate result tables” section, a query Q_{3_v} defining a materialized view enc_features_3_view is obtained. To trace the lineage of a tuple t in enc_features_3_view whose patient_id is asthma_patient_id, one proceeds as follows.

First, the canonical form of the logical query plan for query Q_{3_v} is obtained. The canonical form is the same as the logical query plan for query Q₃ shown in Figure 2.

Second, an intermediate materialized view asthma_encounter_id is created for the intermediate select-project-join-aggregate segment e_id shown in Figure 2. This is done using the following SQL query.

Figure 4 shows the resulting hierarchy of intermediate materialized views, with the materialized view enc_features_3_view at the top and the encounter and diagnosis base tables at the bottom.

Figure 4

The hierarchy of intermediate materialized views matching the canonical form of the logical query plan for the definition query of the materialized view enc_features_3_view.

Third, at the top level of the hierarchy of intermediate materialized views, the lineage of tuple t with respect to the encounter base table is computed using the following SQL query.

The following SQL query is used to compute the lineage of tuple t with respect to the intermediate materialized view asthma_encounter_id and to store the results in a temporary table temp.

Fourth, at the second level of the hierarchy of intermediate materialized views, the lineage of the tuples in the temporary table temp with respect to the diagnosis base table is computed using the following SQL query.

The final traced lineage of tuple t includes both the results of query Q₆ and the results of query Q₈.

Outline of the Proposed Automated Lineage Tracing Approach

In this section, an automated lineage tracing approach is outlined to add automated drill-through capability to the automated explaining function. Our presentation includes 4 subsections. In the first subsection, an overview of the lineage tracing component of the automated explaining function is provided. In the second subsection, the unique requirements on automated lineage tracing are shown for automatically explaining machine learning predictions for clinical decision support. In the third subsection, the proposed automated lineage tracing techniques fulfilling these requirements is outlined. In the fourth subsection, some considerations are presented for future computer coding implementation of the proposed lineage tracing approach.

Overview of the Lineage Tracing Component

At association rule mining time, all (feature, value) pair items used to create association rules are known. Which items involve temporal features computed by aggregation functions on the raw data is also known. For each item that is related to a temporal feature of a patient and on the left-hand side of a rule, a hyperlink is added to the item in the rule. In addition, a parameterized stored procedure is written for the item in the database to retrieve lineage information. The stored procedure typically has 2 parameters: the patient_id of the patient being examined and the endpoint of the temporal aggregation period, such as today. When the stored procedure is run for the first time, an execution plan is generated. All subsequent runs will use the same execution plan to avoid runtime query optimization overhead.

At automatic explanation time, the user of the automated explaining function is allowed to do lineage tracing for any item that is on the left-hand side of a rule-style explanation and related to a temporal feature value. When the user clicks the item’s hyperlink, the stored procedure prewritten for the item is invoked to retrieve some prespecified parts of the related raw data producing the feature value. Except for the cases with 2 specific aggregation functions described later in the paper, the retrieved data instances are always displayed on a page in the reverse chronological order like that in the electronic medical records.

Unique Requirements for Automated Lineage Tracing

Typically, the user of the automated explaining function is a clinician. To fit the user’s busy schedule and to aid timely decision making, the user wants the lineage tracing process for a temporal feature value to be finished quickly, preferably within 1 second. This goal is partially fulfilled by the existing lineage tracing techniques [23,37], whereas the realized lineage tracing speed can be further improved. In addition, the retrieved lineage information should be easy to scan and include the most essential content needed to facilitate decision making. This enables the user to quickly gain useful insights from the information, ideally within 1 or a few seconds. As summarized in Table 3, that goal translates to 5 unique requirements on automated lineage tracing that are unmet by the existing lineage tracing techniques.

Table 3

The 5 unique requirements of automated lineage tracing for automatically explaining machine learning predictions for clinical decision support.

Requirement	Reason for posing the requirement
Retrieving only a small set of attributes	To prevent the user from being overwhelmed by many nonessential or irrelevant attributes
Adding some essential attributes that do not directly produce the feature value	To make the retrieved lineage information include the most essential content
Sorting the retrieved lineage information in an appropriate order	To make the retrieved lineage information easy to scan
Computing the lineage information based on the semantic meaning of the feature	To avoid including irrelevant or nonessential source tuples in the retrieved lineage information
Performing no lineage tracing for any health care system feature value computed by an aggregation function	To avoid including irrelevant data in the retrieved lineage information

Requirement 1: Retrieving Only a Small Set of Attributes

When tracing the lineage of a temporal feature value, one should retrieve from the base tables only a small set of attributes specific to the temporal feature rather than the many attributes involved in deriving all of the features used for automated explanation. This requirement is posed to prevent the user of the automated explaining function from being overwhelmed by many nonessential or irrelevant attributes.

To aid automatic explanation, we want to allow tracing the lineage of a temporal feature value in the form of a small set of attributes specific to the temporal feature (see Table 2 for an example). This cannot be well done using Cui et al’s lineage tracing techniques [23,37]. These techniques were developed to trace the lineage of a tuple including all of its attribute values in a select-project-join-aggregate materialized view in a relational database. If the retrieved lineage information ever touches a tuple in a base table, all attribute values of the tuple are included in this information. For automatic explanation, both factors would cause the retrieved lineage information to have an excessive volume, overwhelming the user of the automated explaining function.

To see this, the process of making predictions with automatic explanations is reviewed. Usually, many features are used to make predictions and to automatically explain them. All of the items on the left-hand side of a rule-style explanation come from the same tuple in the unified data frame, which contains all features of the new data. As Figure 3 shows, this unified data frame is obtained by joining many intermediate result tables. Each of them falls into 1 of the 3 categories: (1) a table containing the patient cohort of interest in the new data, (2) a table containing 1 or more static features, and (3) a table containing 1 or more temporal features. Each hyperlinked item on the left-hand side of a rule-style explanation comes from exactly 1 intermediate result table in the third category.

When the user of the automated explaining function clicks the hyperlink for an item on the left-hand side of a rule-style explanation, one could use Cui et al’s techniques [23,37] to trace the lineage of the tuple in the unified data frame, from which the item comes. For each intermediate result table mentioned above and each base table used to create it, the retrieved lineage information contains some tuples from the base table including all of their attribute values. Most of the retrieved lineage information is unnecessary for automatic explanation for 3 reasons.

Reason 1

The retrieved lineage information often includes thousands of tuples from several dozen base tables. Most of these base tables are used to compute the other feature values in the tuple in the unified data frame that are unrelated to the item, and include no information that can help the user of the automated explaining function gain useful insights related to the item. In fact, to obtain the lineage information of the item essential for automatic explanation, we need to only trace through the intermediate result table related to the item solely for the item and to examine the base tables used to create this table. The features in this table that are unrelated to the item can be ignored. There is also no need to trace through the intermediate result tables containing the features unrelated to the item. Moreover, at automatic explanation time, we know the patient_id of the patient linked to the item. The user usually does not need to know why this patient is in the patient cohort of interest in the new data. Thus, there is no need to trace through the intermediate result table showing the patient cohort.

Reason 2

A base table often has many attributes, only a few of which are essential for the user of the automated explaining function to gain useful insights related to the item. For instance, the encounter table often has >100 attributes. The lineage information shown in Table 2 covers only 4 of them: admit_time transformed to the date format, department, admitting_provider, and facility.

Reason 3

Certain items are each computed using several base tables and intermediate query results. For the user of the automated explaining function to gain useful insights related to the item, only the attributes and tuples of some of these base tables are essential. Alternatively, none or only some of these intermediate query results need to be traced through.

For example, in query Q₂ given in the “Intermediate result tables” section, both the encounter and diagnosis base tables are used to compute the feature “the number of outpatient visits with a primary diagnosis of asthma that the patient had in the prior 12 months.” For a value of this feature, we need to use the information in the diagnosis table to find the related tuples in the encounter table. Nevertheless, the user would expect each encounter shown in the retrieved lineage information to be an outpatient visit with a primary diagnosis of asthma. Thus, there is no need to include any attribute or tuple from the diagnosis table in the retrieved lineage information, for example, to give the primary diagnosis of each encounter included in that information.

As a second example, in query Q₃ given in the “Intermediate result tables” section, both the encounter base table and the intermediate query result e_id are used to compute the feature “the number of ED visits related to asthma that the patient had in the prior 12 months.” For a value of this feature, the user of the automated explaining function would expect each encounter shown in the retrieved lineage information to be an ED visit related to asthma, like that shown in Table 2. Thus, there is no need to trace through e_id and to obtain the corresponding tuples in the diagnosis table showing that each encounter included in the retrieved lineage information has an asthma diagnosis code.

Requirement 2: Adding Some Essential Attributes That Do Not Directly Produce the Feature Value

For certain temporal features, when acquiring the lineage of a feature value, one should not use only the related raw data that directly produce the feature value. Instead, one needs to add to them some related attributes in the base tables, which are specific to the temporal feature and do not directly produce the feature value. We pose this requirement to make the retrieved lineage information include the most essential content needed to facilitate decision making. For example, as query Q₁ given in the “Intermediate result tables” section shows, the feature “the number of ED visits that the patient had in the prior 12 months” is computed solely from the encounter base table. For a value of this feature, we want the retrieved lineage information to be similar to that shown in Table 2 and include a primary diagnosis column. This column is computed using the diagnosis and diagnosis_code_master base tables unused in Q₁ and is formed by concatenating the diagnosis_code and dx_code_description columns of the diagnosis_code_master base table. The cases for many other temporal features on encounters are similar.

Requirement 3: Sorting the Retrieved Lineage Information in an Appropriate Order

When presenting the lineage information, the related raw data retrieved for a temporal feature value should be sorted in an order specific to the temporal feature. This requirement is posed to make the retrieved lineage information easy to scan. Usually, we want the data instances in the retrieved lineage information to be displayed in the reverse chronological order like that in the electronic medical records. However, there are 2 exceptions. First, when the temporal feature is the maximum value of an attribute of a given patient, we want the related raw data retrieved for a feature value to be displayed in the descending order of the attribute value. For example, for the feature “the highest systolic blood pressure of the patient in the prior 12 months,” we want the lineage information retrieved for a feature value to contain the systolic blood pressure of the patient in the prior 12 months sorted in the descending order. Second, when the temporal feature is the minimum value of an attribute of a given patient, we want the related raw data retrieved for a feature value to be displayed in the ascending order of the attribute value. In either of the 2 cases, a resort button could be added to the retrieved lineage information on display. If the user of the automated explaining function clicks this button, the data instances in the retrieved lineage information are rearranged in the reverse chronological order for display.

Requirement 4: Computing the Lineage Information Based on the Semantic Meaning of the Feature

The lineage information of a temporal feature value should be computed based on the semantic meaning of the feature rather than solely on the literal writing of the SQL query used to compute the feature. We pose this requirement to avoid including irrelevant or nonessential source tuples in the retrieved lineage information. For a select-project-join-aggregate materialized view containing 1 or more temporal features, Cui et al [23,37] compute the lineage of a tuple in it based solely on the literal SQL query used to define it. In certain cases, this literal approach is suboptimal for automatic explanation. Instead, we should consider the semantic meanings of the temporal features during lineage tracing. In the following, 2 such cases are described. Each case is presented as a subrequirement.

Subrequirement 4.1

When the temporal feature is the sum of a variable computed by a case statement in SQL including multiple conditions and some of them return 0, only the lineage information related to the other conditions should be retrieved. In SQL, such a temporal feature is written in the form of

As an example of this subrequirement, for the feature “the number of ED visits that the patient had in the prior 12 months,” the lineage information retrieved for a value of the feature should be the ED visits that the patient had in the prior 12 months, regardless of whether the feature is computed using SQL query Q₉ or Q₁₀ below.

The differences between Q₉ and Q₁₀ are highlighted in italics in Q₁₀. If the feature is computed using Q₉, Cui et al’s techniques [23,37] would retrieve all the encounters of the patient in the prior 12 months as the lineage information. This could easily overwhelm the user of the automated explaining function, as usually most of these encounters are not ED visits.

Subrequirement 4.2

When the temporal feature is the total number of distinct items, the retrieved lineage information should include only 1 representative data instance for each distinct item. For example, query Q₄ given in the “Intermediate result tables” section computes the feature “the total number of distinct medications ordered for the patient in the prior 12 months.” For a value of this feature, Cui et al’s techniques [23,37] would retrieve all medications ordered for the patient in the prior 12 months as the lineage information. This information is often overwhelming and not succinct enough for the user of the automated explaining function to quickly find the distinct medications ordered for the patient in the prior 12 months, as the same medication could be ordered for the patient multiple times in a year. To avoid this problem, one could retrieve only the most recent order of each distinct medication ordered for the patient in the prior 12 months as the lineage information. For the user, these distinct medications typically provide enough insight into the patient’s status related to the feature value.

Requirement 5: Performing No Lineage Tracing for Any Health Care System Feature Value Computed by an Aggregation Function

We do not trace the lineage of any health care system feature value computed by an aggregation function. We pose this requirement to avoid including irrelevant data in the retrieved lineage information. Like temporal features of a patient, certain health care system features [17-19] such as the number of patients with asthma of the primary care provider of a patient are computed by aggregation functions. These health care system features are each computed using multiple patients’ information rather than solely the information of the patient being examined. Since other patients’ detailed information does not help the user of the automated explaining function understand this patient’s situation, we do not trace the lineage of any value of this feature, even if it appears on the left-hand side of a rule-style explanation.

Outline of the Proposed Techniques to Form the Lineage Tracing Query That Computes the Lineage Information

To perform automated lineage tracing for explaining machine learning predictions for clinical decision support, Cui et al’s lineage tracing techniques [23,37] are modified to fulfill the requirements mentioned above. Even without giving any detail on the computer coding implementation and the performance evaluation results, Cui et al [37] already used 49 pages to describe the details of their automated lineage tracing algorithm. The case described in this paper is more complex than Cui et al’s case [37]. In the case described in this paper, which attributes are most relevant and which source tuples are most essential for inclusion in the retrieved lineage information depend on both the concrete feature type and the clinical decision support application’s need. In comparison, no such dependency exists in Cui et al’s case [37]. Thus, it is expected that, once fully worked out, the proposed automated lineage tracing algorithm would be more sophisticated than Cui et al’s algorithm [37]. In this viewpoint paper, the goal is not to enumerate all possible feature types and to provide a detailed design or any computer coding implementation of the proposed automated lineage tracing approach. Rather, the goal is to describe the design approach for the proposed automated lineage tracing module and to provide a roadmap for future research. We achieve this goal by outlining the main steps of forming the lineage tracing query, giving 4 example temporal features, and illustrating at a high level how to form the lineage tracing query for each of these 4 features.

Overview of the Lineage Tracing Query Formation Process

Usually, each intermediate result table shown in Figure 3 has a patient_id column. It is used as the join column in the join operation to produce the unified data frame containing all features of the new data. As explained in “Reason 1” of the “Requirement 1” section, to obtain the lineage information of a temporal feature value, we need to only trace through the intermediate result table containing this value solely for this value. This intermediate result table is usually computed from some base tables by using a select-project-join-aggregate SQL query S₀. To form the lineage tracing query for a temporal feature value of a patient in the intermediate result table, one proceeds in 4 steps. First, the other temporal features, if any, are removed from S₀ to obtain a simplified query S₁. Second, if applicable, S₁ is transformed to query S₂ to fulfill subrequirement 4.1. Third, Cui et al’s techniques [23,37] are modified to address Reasons 2 and 3 given in the “Requirement 1” section. The modified techniques are used to form a preliminary lineage tracing query S₃ based on S₂ and the patient’s patient_id. Fourth, to obtain the final lineage tracing query, S₃ is transformed to fulfill Requirements 2 and 3 and subrequirement 4.2.

In the following, 4 examples are used to illustrate at a high level how to form the lineage tracing query. In each example, the user of the automated explaining function is examining a patient with asthma whose identifier is asthma_patient_id and wants to drill through a temporal feature value of this patient. We outline the main steps of forming the lineage tracing query for the feature value without giving the detailed algorithm.

Example 1: The Number of ED Visits That the Patient Had in the Prior 12 Months

As defined by query Q₁ in the “Intermediate result tables” section, the intermediate result table enc_features_1 contains 3 temporal features. One of them is the number of ED visits that the patient had in the prior 12 months. To form the lineage tracing query for a value of this feature, one proceeds as follows.

First, the other 2 features are removed from query Q₁ to obtain query Q₉ given in the “Subrequirement 4.1” section.

Second, to fulfill subrequirement 4.1 on handling the sum of a variable computed by a case statement, query Q₉ is transformed to query Q₁₀ given in the “Subrequirement 4.1” section.

Third, Cui et al’s lineage tracing techniques [23,37] are used to form a draft lineage tracing query Q₁₁ based on Q₁₀ and asthma_patient_id.

The differences between Q₁₀ and Q₁₁ are highlighted in italics in Q₁₁. To address Reason 2 given in the “Requirement 1” section and retrieve from the encounter table only its attributes essential for automatic explanation, Q₁₁ is transformed to the following preliminary lineage tracing query.

The differences between Q₁₁ and Q₁₂ are highlighted in italics in Q₁₂.

Fourth, to fulfill Requirement 2, a primary diagnosis column needs to be added to the raw data that are retrieved by query Q₁₂ and that directly produce the feature value being examined. To fulfill Requirement 3, the retrieved raw data need to be sorted in the reverse chronological order. To meet both demands, Q₁₂ is transformed to the following final lineage tracing query.

The differences between Q₁₂ and Q₁₃ are highlighted in italics in Q₁₃. || is the string concatenation operator in SQL.

Example 2: The Number of Outpatient Visits With a Primary Diagnosis of Asthma That the Patient Had in the Prior 12 Months

As defined by query Q₂ in the “Intermediate result tables” section, the intermediate result table enc_features_2 contains the temporal feature “the number of outpatient visits with a primary diagnosis of asthma that the patient had in the prior 12 months.” To form the lineage tracing query for a value of this feature, one proceeds as follows.

First, to address Reason 2 given in the “Requirement 1” section, only the attributes essential for automatic explanation should be included from the encounter table. To address Reason 3 given in the “Requirement 1” section, no attribute or tuple from the diagnosis table should be included in the retrieved lineage information. A preliminary lineage tracing query Q₁₄ is formed based on query Q₂ and asthma_patient_id by using a modified version of Cui et al’s lineage tracing techniques [23,37] that meets both demands.

The differences between Q₂ and Q₁₄ are highlighted in italics in Q₁₄.

Second, to fulfill Requirement 3 of sorting the related raw data retrieved for the feature value in the reverse chronological order, query Q₁₄ is transformed to the following final lineage tracing query.

The differences between Q₁₄ and Q₁₅ are highlighted in italics in Q₁₅.

Example 3: The Number of ED Visits Related to Asthma That the Patient Had in the Prior 12 Months

As defined by query Q₃ in the “Intermediate result tables” section, the intermediate result table enc_features_3 contains 2 temporal features. One of them is the number of ED visits related to asthma that the patient had in the prior 12 months. To form the lineage tracing query for a value of this feature, one proceeds as follows.

First, the other feature is removed from query Q₃ to obtain the following simplified query.

Second, to fulfill subrequirement 4.1 on handling the sum of a variable computed by a case statement, query Q₁₆ is transformed to the following query.

The differences between Q₁₆ and Q₁₇ are highlighted in italics in Q₁₇.

Third, to address Reason 2 given in the “Requirement 1” section, only the attributes essential for automatic explanation should be included from the encounter table. To address Reason 3 given in the “Requirement 1” section, the intermediate query result e_id should not be traced through to include any corresponding tuple in the diagnosis table in the retrieved lineage information. A preliminary lineage tracing query Q₁₈ is formed based on query Q₁₇ and asthma_patient_id by using a modified version of Cui et al’s lineage tracing techniques [23,37] that meets both demands.

The differences between Q₁₇ and Q₁₈ are highlighted in italics in Q₁₈.

Cui et al’s lineage tracing techniques [23,37,49] are applied to query Q₃ to create a materialized view asthma_encounter_id, which is defined by query Q₅ in the “Review of Cui et al’s automated lineage tracing techniques for relational databases” section. The asthma_encounter_id is used to rewrite the preliminary lineage tracing query Q₁₈ as follows.

The differences between Q₁₈ and Q₁₉ are highlighted in italics in Q₁₉.

Fourth, to fulfill Requirement 2, a primary diagnosis column needs to be added to the raw data that are retrieved by query Q₁₉ and that directly produce the feature value being examined. To fulfill Requirement 3, the retrieved raw data need to be sorted in the reverse chronological order. To meet both demands, Q₁₉ is transformed to the following final lineage tracing query.

The differences between Q₁₉ and Q₂₀ are highlighted in italics in Q₂₀.

Example 4: The Total Number of Distinct Medications Ordered for the Patient in the Prior 12 Months

As defined by query Q₄ in the “Intermediate result tables” section, the intermediate result table med_features_1 contains 2 temporal features. One of them is the total number of distinct medications ordered for the patient in the prior 12 months. To form the lineage tracing query for a value of this feature, one proceeds as follows.

First, the other feature is removed from query Q₄ to obtain the following simplified query.

Second, to address Reason 2 given in the “Requirement 1” section, only the attributes essential for automatic explanation should be included from the ordered_medication table. A preliminary lineage tracing query Q₂₂ is formed based on query Q₂₁ and asthma_patient_id by using a modified version of Cui et al’s lineage tracing techniques [23,37] that meets this demand.

The differences between Q₂₁ and Q₂₂ are highlighted in italics in Q₂₂.

Third, to fulfill subrequirement 4.2, one could retrieve only the most recent order of each distinct medication ordered for the patient in the prior 12 months as the lineage information. This is done by transforming query Q₂₂ to the following query.

The differences between Q₂₂ and Q₂₃ are highlighted in italics in Q₂₃.

Fourth, to fulfill requirement 2, a medication name column is added to the raw data that are retrieved by query Q₂₃ and directly produce the feature value being examined. To fulfill Requirement 3, the retrieved raw data are sorted in the reverse chronological order. Q₂₃ is transformed to the following final lineage tracing query to meet both demands.

The differences between Q₂₃ and Q₂₄ are highlighted in italics in Q₂₄.

Considerations for Future Computer Coding Implementation of the Proposed Automated Lineage Tracing Approach Maximizing the Automation Degree of the Lineage Tracing Query Formation Process

For a select-project-join-aggregate materialized view, Cui et al [23,37] used a fully automated approach to analyze its definition query to derive a lineage tracing query for a tuple in it. In the case of automatically explaining machine learning predictions, all temporal features used for making predictions and automatic explanation are known at machine learning model building time. In general, for each temporal feature, we can form a lineage tracing query either manually or semiautomatically, but often not fully automatically, beforehand. Nevertheless, once the query is formed and put into the knowledge base of the automated explaining function, we can use the query to automatically retrieve the lineage information of a value of the feature at prediction time.

As mentioned before, automatic explanation poses several unique requirements on automated lineage tracing. Two of them make it difficult to fully automate the lineage tracing query formation process. First, Requirement 1 says that the lineage information retrieved for a temporal feature value should include only a small set of relevant attributes specific to the temporal feature. Almost infinite attributes and temporal features could possibly be used for clinical machine learning. Thus, it is infeasible to precompile the set of relevant attributes for every possible temporal feature. Second, Requirement 2 says that when acquiring the lineage of a value for certain temporal features, we need to include some attributes that are specific to the temporal feature and do not directly produce the feature value. For a reason similar to the above, it is infeasible to precompile the set of such attributes for every possible such temporal feature.

Although the lineage tracing query formation process cannot be fully automated in the most general case, 2 methods can still be used to maximize the process’ automation degree and to reduce the workload of the developers of the automated explaining function. First, for a temporal feature, an approach similar to that of Cui et al [23,37] can be used to automatically form a draft lineage tracing query. The developers of the automated explaining function revise this query as needed to obtain the final lineage tracing query. Second, the same temporal feature is often used for multiple predictive modeling tasks. One can create a library of lineage tracing queries for temporal features to facilitate query reuse across various predictive modeling tasks. This library is formed for a data set in the Observational Medical Outcomes Partnership common data model format [50] using its linked standardized terminologies [51]. This format standardizes administrative and clinical variables from ≥10 large US health care systems [52,53]. For any data set that is put into this format, we can use this library to obtain lineage tracing queries.

Improving the Lineage Tracing Speed

As mentioned before, the user of the automated explaining function wants the lineage tracing process for a temporal feature value to be finished quickly, preferably within 1 second. To expedite tracing the lineage of a tuple in a materialized view defined by a select-project-join-aggregate query S, Cui et al [23,37,49] advocated creating a materialized view for each intermediate select-project-join-aggregate segment of the canonical form of the logical query plan for S. While this boosts the lineage tracing speed, the resulting speed is still not fast enough to reach a subsecond response time [23,39]. To further improve the lineage tracing speed, we can build indices [39,42] on the selection and join attributes of both the base tables and the materialized views created for the intermediate select-project-join-aggregate segments. For instance, in Example 3, we can build 1 index on the encounter_id column of the materialized view asthma_encounter_id and another index on the patient_id column of the encounter base table. We can create indices either manually or by using an automated index design tool provided by a commercial relational database system [54-56]. Typically, each intermediate result table containing 1 or more temporal features is computed on 1 or a few base tables using no more than a small number of join operations. The lineage tracing query for a temporal feature value falls into a similar case. Thus, with appropriate indices, we would expect the lineage tracing query to finish execution quickly. For base tables of moderate sizes and simple materialized views, Cui and Widom [39] showed that lineage tracing can be done within 1 second when indices exist on the keys of the base tables. For large base tables and temporal features computed through more complex procedures, we would expect that more indices are needed to reach a subsecond response time.

The above discussion focuses on the case that the electronic medical record data are stored in a relational database and features are extracted using SQL queries. When the electronic medical record data are stored in a big data system and features are extracted using map and reduce functions [44] or Pig Latin [46], we can modify the corresponding existing lineage tracing techniques [42,43,45] in a similar way to enable lineage tracing to aid automatically explaining machine learning predictions for clinical decision support.

Discussion Directions for Future Research

The above discussion describes the high-level design approach for the proposed automated lineage tracing module. To complete the detailed design of the proposed automated lineage tracing approach, implement the module in computer code, and test the module’s performance, much research is needed along the following directions:

We need to compile a list of attributes and temporal feature types most commonly used in building clinical machine learning predictive models. For these attributes and temporal feature types, we need to complete the detailed design and the computer coding implementation of the proposed automated lineage tracing approach.

We need to come up with an automated approach to design indices needed for improving the lineage tracing speed. The database research community has developed several automated index design approaches [54-56]. We can modify these approaches to fit the database querying workload posed by automated lineage tracing.

We plan to assess the execution speed of the proposed automated lineage tracing approach after implementing it in computer code.

As shown by prior work on automated lineage tracing shown in the “Overview of the existing automated lineage tracing techniques” section, the database research community takes it for granted that automated lineage tracing could help users better understand the data and save time in doing data analysis. To the best of our knowledge, no formal study to date has been published on measuring the impact of automated lineage tracing on users’ data analysis and decision-making process. After implementing the proposed automated lineage tracing module, we plan to choose several clinical predictive modeling tasks and assess for each task, the impact of offering the module on the data analysis and decision-making process of the users of the automated explaining function. In particular, we plan to evaluate whether the addition of the module benefits the user and improves outcomes, for example, by saving the user’s time, making it easier for the user to understand the predictions given by the machine learning predictive model and helping the user better understand the patient’s situation and make better clinical decisions.

Limitations of the Proposed Approach

The proposed automated lineage tracing approach has several limitations:

To build clinical machine learning predictive models, we usually use temporal features that are computed by SQL queries of low or moderate complexities. It is possible that some temporal features used to build certain predictive models are computed by rather complex SQL queries. We may not be able to finish the lineage tracing process for a value of such a temporal feature quickly, regardless of how many indices are built to expedite this process. For example, this could happen if the SQL query uses complex procedural code, which has no property that can be used to simplify the lineage tracing process [39]. Having a long lineage tracing time could make the user of the automated explaining function become impatient. Nevertheless, it is still faster and more convenient to do lineage tracing using the automated approach than to let the user do manual drill-through.

The proposed automated lineage tracing approach works for any feature values computed by the standard aggregation functions in SQL on longitudinal structured data. For certain deep learning predictive models built on longitudinal structured data, the previously proposed method [16] could be used to semiautomatically extract comprehensible and predictive temporal features from the models and the longitudinal structured data, and then apply the automated approach to trace the lineage of the values of these features. For any other deep learning predictive model that is built directly on longitudinal structured data and that uses incomprehensible features hidden in the neurons of the deep neural network, the proposed automated approach can no longer be used to trace the lineage of the values of these features.

Almost infinite attributes and temporal features could possibly be used for clinical machine learning. Further, some attributes are not covered by the Observational Medical Outcomes Partnership common data model. For the reasons given in the “Maximizing the automation degree of the lineage tracing query formation process” section, we could maximize the automation degree of the lineage tracing query formation process for only certain types of temporal features formed on certain attributes. For any other temporal feature, the developers of the automated explaining function could still need a nontrivial amount of time to create the corresponding lineage tracing query.

Conclusions

Automatically explaining machine learning predictions is critical to overcome the model interpretability barrier to using machine learning predictive models in clinical practice. Our previously developed automatic explanation method for machine learning predictions can be used to address this barrier, but a gap remains to fulfill the need of rapidly drilling through a feature value in an explanation that is computed by an aggregation function on the raw data. This paper articulates this gap, outlines an automated lineage tracing approach to close the gap, and provides a roadmap for future research. The automated drill-through capability is intended to be offered to help the user of the automated explaining function save time, better understand the patient’s situation, and make better clinical decisions. It would take several people multiple years to work out the detailed design and the computer coding implementation of the proposed automated lineage tracing approach. We hope this paper will make some researchers become interested in and join the research endeavor on this topic. Only after the detailed design and the computer coding implementation of the proposed automated lineage tracing approach are fully worked out, one could deploy the automated lineage tracing module in clinical practice and measure the module’s impact on clinicians’ decision-making process. The principle of the automated lineage tracing approach generalizes to nonmedical data and other automated methods to explain machine learning predictions.

Abbreviations

emergency department

SQL

structured query language

We thank Xiaoyi Zhang and Brian Kelly for the useful discussions. GL was partially supported by the National Heart, Lung, and Blood Institute of the National Institutes of Health under award number R01HL142503. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

None declared.

Kaggle 2021-04-30

https://www.kaggle.com

Steyerberg

Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating, 2nd ed 2019

New York, USA

Springer

Lee

Wang

Dipuro

Hou

Grover

Low

Liu

Loke

Leveraging on predictive analytics to manage clinic no show and improve accessibility of care

2017

Proceedings of 2017 IEEE International Conference on Data Science and Advanced Analytics

October 19-21, 2017

Tokyo, Japan

429 438

10.1109/dsaa.2017.25

Dean

Jones

Ferraro

Post

Aronsky

Vines

Allen

Haug

Impact of an electronic clinical decision support tool for emergency department patients with pneumonia

Ann Emerg Med 2015 66 5 511 520

10.1016/j.annemergmed.2015.02.003

25725592

S0196-0644(15)00091-8

Hsu

Chen

Chung

Tan

Chen

Chiang

Clinical verification of a clinical decision support system for ventilator weaning

Biomed Eng Online 2013 12 Suppl 1 S4

10.1186/1475-925X-12-S1-S4

24565021

1475-925X-12-S1-S4

PMC4028887

Barbieri

Molina

Ponce

Tothova

Cattinelli

Ion Titapiccolo

Mari

Amato

Leipold

Wehmeyer

Stuard

Stopper

Canaud

An international observational study suggests that artificial intelligence for clinical decision support optimizes anemia management in hemodialysis patients

Kidney Int 2016 90 2 422 429

10.1016/j.kint.2016.03.036

27262365

S0085-2538(16)30132-6

Brier

Gaweda

Dailey

Aronoff

Jacobs

Randomized trial of model predictive control for improved anemia management

Clin J Am Soc Nephrol 2010 05 5 5 814 820

10.2215/CJN.07181009

20185598

CJN.07181009

PMC2863987

Gaweda

Aronoff

Jacobs

Rai

Brier

Individualized anemia management reduces hemoglobin variability in hemodialysis patients

J Am Soc Nephrol 2014 01 25 1 159 166

10.1681/ASN.2013010089

24029429

ASN.2013010089

PMC3871773

Gaweda

Jacobs

Aronoff

Brier

Model predictive control of erythropoietin administration in the anemia of ESRD

Am J Kidney Dis 2008 01 51 1 71 79

10.1053/j.ajkd.2007.10.003

18155535

S0272-6386(07)01353-4

Hamlet

Hobgood

Hamar

Dobbs

Rula

Pope

Impact of predictive model-directed end-of-life counseling for Medicare beneficiaries

Am J Manag Care 2010 05 16 5 379 384

20469958

12641

Luo

Automatically explaining machine learning prediction results: a demonstration on type 2 diabetes risk prediction

Health Inf Sci Syst 2016 4 2

10.1186/s13755-016-0015-4

26958341

PMC4782293

Luo

Johnson

Nkoy

Stone

Automatically explaining machine learning prediction results on asthma hospital visits in asthmatic patients: secondary analysis

JMIR Med Inform 2020 12 31 8 12 e21965

10.2196/21965

33382379

v8i12e21965

PMC7808890

Tong

Messinger

Luo

Testing the generalizability of an automated method for explaining machine learning predictions on asthma patients' asthma hospital visits to an academic health care system

IEEE Access 2020 8 195971 195979

10.1109/access.2020.3032683

33240737

PMC7685253

Luo

Nau

Crawford

Schatz

Zeiger

Koebnick

Generalizability of an automatic explanation method for machine learning prediction results on asthma-related hospital visits in patients with asthma: quantitative analysis

J Med Internet Res 2021 04 15 23 4 e24153

10.2196/24153

33856359

v23i4e24153

Halamka

Early experiences with big data at an academic medical center

Health Aff (Millwood) 2014 07 33 7 1132 1138

10.1377/hlthaff.2014.0031

25006138

33/7/1132

Luo

A roadmap for semi-automatically extracting predictive and clinically meaningful temporal features from medical data for predictive modeling

Glob Transit 2019 1 61 82

10.1016/j.glt.2018.11.001

31032483

PMC6482973

Luo

Nau

Crawford

Schatz

Zeiger

Rozema

Koebnick

Developing a predictive model for asthma-related hospital encounters in patients with asthma in a large, integrated health care system: secondary analysis

JMIR Med Inform 2020 11 09 8 11 e22689

10.2196/22689

33164906

v8i11e22689

PMC7683251

Tong

Messinger

Wilcox

Mooney

Davidson

Suri

Luo

Forecasting future asthma hospital encounters of patients with asthma in an academic health care system: predictive model development and secondary analysis study

J Med Internet Res 2021 04 16 23 4 e22796

10.2196/22796

33861206

v23i4e22796

Luo

Stone

Nkoy

Johnson

Developing a model to predict hospital encounters for asthma in asthmatic patients: secondary analysis

JMIR Med Inform 2020 01 21 8 1 e16080

10.2196/16080

31961332

v8i1e16080

PMC7001050

Garcia-Molina

Ullman

Widom

Database Systems: the Complete Book, 2nd ed 2008

Upper Saddle River, NJ

Pearson

Cunningham

Graefe

Galindo-Legaria

PIVOT and UNPIVOT: optimization and execution strategies in an RDBMS

2004

Proceedings of the 30th International Conference on Very Large Data Bases

August 31-September 3, 2004

Toronto, Canada

998 1009

10.1016/b978-012088469-8.50087-5

Lyman

Scully

Harrison

JH Jr

The development of health care data warehouses to support data mining

Clin Lab Med 2008 03 28 1 55 71

10.1016/j.cll.2007.10.003

18194718

S0272-2712(07)00112-6

Cui

Widom

Practical lineage tracing in data warehouses

2000

Proceedings of the 16th International Conference on Data Engineering

February 28-March 3, 2000

San Diego, CA

367 378

10.1109/icde.2000.839437

Liu

Hsu

Integrating classification and association rule mining

1998

Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining

August 27-31, 1998

New York City, USA

80 86

Fayyad

Irani

Multi-interval discretization of continuous-valued attributes for classification learning

1993

Proceedings of the 13th International Joint Conference on Artificial Intelligence

August 28-September 3, 1993

Chambéry, France

1022 1029

Thabtah

A review of associative classification mining

The Knowledge Engineering Review 2007 03 01 22 1 37 65

10.1017/s0269888907001026

Alaa

van der Schaar

Prognostication and risk factors for cystic fibrosis via automated machine learning

Sci Rep 2018 07 26 8 1 11242

10.1038/s41598-018-29523-2

30050169

10.1038/s41598-018-29523-2

PMC6062529

Alaa

van der Schaar

AutoPrognosis: automated clinical prognostic modeling via Bayesian optimization with structured kernel learning

2018

Proceedings of 35th International Conference on Machine Learning

July 10-15, 2018

Stockholm, Sweden

139 148

Molnar

Interpretable Machine Learning 2020

Morrisville, NC

lulu.com

Guidotti

Monreale

Ruggieri

Turini

Giannotti

Pedreschi

A survey of methods for explaining black box models

ACM Comput Surv 2019 01 23 51 5 93

10.1145/3236009

Rudin

Shaposhnik

Globally-consistent rule-based summary-explanations for machine learning models: application to credit-risk evaluation

2019

Proceedings of INFORMS 11th Conference on Information Systems and Technology

October 19-20, 2019

Seattle, WA

1 19

10.2139/ssrn.3395422

Ribeiro

Singh

Guestrin

Anchors: high-precision model-agnostic explanations

2018

Proceedings of the 32nd AAAI Conference on Artificial Intelligence

February 2-7, 2018

New Orleans, LA

1527 1535

Ikeda

Widom

Data lineage: a survey

Stanford University Technical Report 2021-04-30

http://ilpubs.stanford.edu:8090/918/1/lin_final.pdf

Cheney

Chiticariu

Tan

Provenance in Databases: Why, How, and Where

Found Trends Databases 2009 1 4 379 474

10.1561/1900000006

Simmhan

Plale

Gannon

A survey of data provenance in e-science

SIGMOD Rec 2005 09 34 3 31 36

10.1145/1084805.1084812

Bose

Frew

Lineage retrieval for scientific data processing: a survey

ACM Comput Surv 2005 03 37 1 1 28

10.1145/1057977.1057978

Cui

Widom

Wiener

Tracing the lineage of view data in a warehousing environment

ACM Trans Database Syst 2000 06 25 2 179 227

10.1145/357775.357777

Gupta

Mumick

Materialized Views: Techniques, Implementations, and Applications 1999

Cambridge, MA

The MIT Press

Cui

Widom

Lineage tracing for general data warehouse transformations

The VLDB Journal The International Journal on Very Large Data Bases 2003 5 1 12 1 41 58

10.1007/s00778-002-0083-8

Ikeda

Sarma

Widom

Logical provenance in data-oriented workflows

2013

Proceedings of the 29th IEEE International Conference on Data Engineering

April 8-12, 2013

Brisbane, Australia

877 888

10.1109/icde.2013.6544882

Zhang

Prabhakar

Tracing lineage beyond relational operators

2007

Proceedings of the 33rd International Conference on Very Large Data Bases

September 23-27, 2007

Vienna, Austria

1116 1127

Ikeda

Park

Widom

Provenance for generalized map and reduce workflows

2011

Proceedings of the 5th Biennial Conference on Innovative Data Systems Research

January 9-12, 2011

Asilomar, CA

273 283

Park

Ikeda

Widom

RAMP: a system for capturing and tracing provenance in MapReduce workflows

Proc VLDB Endow 2011 08 4 12 1351 1354

10.14778/3402755.3402768

Dean

Ghemawat

MapReduce: simplified data processing on large clusters

2004

Proceedings of the 6th Symposium on Operating System Design and Implementation

December 6-8, 2004

San Francisco, CA

137 150

Amsterdamer

Davidson

Deutch

Milo

Stoyanovich

Tannen

Putting Lipstick on Pig: enabling database-style workflow provenance

Proc VLDB Endow 2011 12 5 4 346 357

10.14778/2095686.2095693

Olston

Reed

Srivastava

Kumar

Tomkins

Pig Latin: a not-so-foreign language for data processing

2008

Proceedings of the ACM SIGMOD International Conference on Management of Data

June 10-12, 2008

Vancouver, BC, Canada

1099 1110

10.1145/1376616.1376726

Buneman

Chapman

Cheney

Provenance management in curated databases

2006

Proceedings of the ACM SIGMOD International Conference on Management of Data

June 27-29, 2006

Chicago, IL

539 550

10.1145/1142473.1142534

Schelter

Böse

Kirschnick

Klein

Seufert

Automatically tracking metadata and provenance of machine learning experiments

2017

Proceedings of the ML Systems Workshop at NIPS 2017

December 8, 2017

Long Beach, CA

1 8

Cui

Widom

Storing auxiliary data for efficient maintenance and lineage tracing of complex views

2000

Proceedings of the Second Intl Workshop on Design and Management of Data Warehouses

June 5-6, 2000

Stockholm, Sweden

1 19

Data standardization

Observational Health Data Sciences and Informatics 2021-04-30

https://www.ohdsi.org/data-standardization

Standardized vocabularies

Observational Health Data Sciences and Informatics 2021-04-30

https://www.ohdsi.org/web/wiki/doku.php?id=documentation:vocabulary:sidebar

Hripcsak

Duke

Shah

Reich

Huser

Schuemie

Suchard

Park

Wong

ICK

Rijnbeek

van der Lei

Pratt

Norén

Stang

Madigan

Ryan

Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers

Stud Health Technol Inform 2015 216 574 578

26262116

PMC4815923

Overhage

Ryan

Reich

Hartzema

Stang

Validation of a common data model for active safety surveillance research

J Am Med Inform Assoc 2012 19 1 54 60

10.1136/amiajnl-2011-000376

22037893

amiajnl-2011-000376

PMC3240764

Das

Grbic

Ilic

Jovandic

Jovanovic

Narasayya

Radulovic

Stikic

Chaudhuri

Automatically indexing millions of databases in Microsoft Azure SQL database

2019

Proceedings of the ACM SIGMOD International Conference on Management of Data

June 30-July 5, 2019

Amsterdam, Netherlands

666 679

10.1145/3299869.3314035

Dageville

Das

Dias

Yagoub

Zaït

Ziauddin

Automatic SQL tuning in Oracle 10g

2004

Proceedings of the 30th International Conference on Very Large Data Bases

August 31-September 3, 2004

Toronto, Canada

1098 1109

10.1016/b978-012088469-8.50096-6

Zilio

Rao

Lightstone

Lohman

Storm

Garcia-Arellano

Fadden

DB2 Design Advisor: integrated automatic physical database design

2004

Proceedings of the 30th International Conference on Very Large Data Bases

August 31-September 3, 2004

Toronto, Canada

1087 1097

10.1016/b978-012088469-8.50095-4