<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.0 20040830//EN" "http://dtd.nlm.nih.gov/publishing/2.0/journalpublishing.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article" dtd-version="2.0">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">JMI</journal-id>
      <journal-id journal-id-type="nlm-ta">JMIR Med Inform</journal-id>
      <journal-title>JMIR Medical Informatics</journal-title>
      <issn pub-type="epub">2291-9694</issn>
      <publisher>
        <publisher-name>JMIR Publications</publisher-name>
        <publisher-loc>Toronto, Canada</publisher-loc>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="publisher-id">v9i5e27778</article-id>
      <article-id pub-id-type="pmid">34042600</article-id>
      <article-id pub-id-type="doi">10.2196/27778</article-id>
      <article-categories>
        <subj-group subj-group-type="heading">
          <subject>Viewpoint</subject>
        </subj-group>
        <subj-group subj-group-type="article-type">
          <subject>Viewpoint</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>A Roadmap for Automating Lineage Tracing to Aid Automatically Explaining Machine Learning Predictions for Clinical Decision Support</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="editor">
          <name>
            <surname>Lovis</surname>
            <given-names>Christian</given-names>
          </name>
        </contrib>
      </contrib-group>
      <contrib-group>
        <contrib contrib-type="reviewer">
          <name>
            <surname>Rajan</surname>
            <given-names>Vaibhav</given-names>
          </name>
        </contrib>
      </contrib-group>
      <contrib-group>
        <contrib id="contrib1" contrib-type="author" corresp="yes">
          <name name-style="western">
            <surname>Luo</surname>
            <given-names>Gang</given-names>
          </name>
          <degrees>DPhil</degrees>
          <xref rid="aff1" ref-type="aff">1</xref>
          <address>
            <institution>Department of Biomedical Informatics and Medical Education</institution>
            <institution>University of Washington</institution>
            <addr-line>UW Medicine South Lake Union</addr-line>
            <addr-line>850 Republican Street, Building C, Box 358047</addr-line>
            <addr-line>Seattle, WA, 98195</addr-line>
            <country>United States</country>
            <phone>1 206 221 4596</phone>
            <fax>1 206 221 2671</fax>
            <email>gangluo@cs.wisc.edu</email>
          </address>
          <ext-link ext-link-type="orcid">https://orcid.org/0000-0001-7217-4008</ext-link>
        </contrib>
      </contrib-group>
      <aff id="aff1">
        <label>1</label>
        <institution>Department of Biomedical Informatics and Medical Education</institution>
        <institution>University of Washington</institution>
        <addr-line>Seattle, WA</addr-line>
        <country>United States</country>
      </aff>
      <author-notes>
        <corresp>Corresponding Author: Gang Luo <email>gangluo@cs.wisc.edu</email></corresp>
      </author-notes>
      <pub-date pub-type="collection">
        <month>5</month>
        <year>2021</year>
      </pub-date>
      <pub-date pub-type="epub">
        <day>27</day>
        <month>5</month>
        <year>2021</year>
      </pub-date>
      <volume>9</volume>
      <issue>5</issue>
      <elocation-id>e27778</elocation-id>
      <history>
        <date date-type="received">
          <day>6</day>
          <month>2</month>
          <year>2021</year>
        </date>
        <date date-type="rev-request">
          <day>21</day>
          <month>3</month>
          <year>2021</year>
        </date>
        <date date-type="rev-recd">
          <day>25</day>
          <month>3</month>
          <year>2021</year>
        </date>
        <date date-type="accepted">
          <day>14</day>
          <month>4</month>
          <year>2021</year>
        </date>
      </history>
      <copyright-statement>©Gang Luo. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 27.05.2021.</copyright-statement>
      <copyright-year>2021</copyright-year>
      <license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
        <p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.</p>
      </license>
      <self-uri xlink:href="https://medinform.jmir.org/2021/5/e27778" xlink:type="simple"/>
      <abstract>
        <p>Using machine learning predictive models for clinical decision support has great potential in improving patient outcomes and reducing health care costs. However, most machine learning models are black boxes that do not explain their predictions, thereby forming a barrier to clinical adoption. To overcome this barrier, an automated method was recently developed to provide rule-style explanations of any machine learning model’s predictions on tabular data and to suggest customized interventions. Each explanation delineates the association between a feature value pattern and an outcome value. Although the association and intervention information is useful, the user of the automated explaining function often requires more detailed information to better understand the patient’s situation and to aid in decision making. More specifically, consider a feature value in the explanation that is computed by an aggregation function on the raw data, such as the number of emergency department visits related to asthma that the patient had in the prior 12 months. The user often wants to rapidly drill through to see certain parts of the related raw data that produce the feature value. This task is frequently difficult and time-consuming because the few pieces of related raw data are submerged by many pieces of raw data of the patient that are unrelated to the feature value. To address this issue, this paper outlines an automated lineage tracing approach, which adds automated drill-through capability to the automated explaining function, and provides a roadmap for future research.</p>
      </abstract>
      <kwd-group>
        <kwd>clinical decision support</kwd>
        <kwd>database management systems</kwd>
        <kwd>forecasting</kwd>
        <kwd>machine learning</kwd>
        <kwd>electronic medical records</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec sec-type="introduction">
      <title>Introduction</title>
      <p>Machine learning has won almost all data science competitions [<xref ref-type="bibr" rid="ref1">1</xref>] and is a hot topic these days. It is about computer algorithms that automatically learn from data, such as extreme gradient boosting, support vector machine, and random forest [<xref ref-type="bibr" rid="ref2">2</xref>]. Using machine learning predictive models for clinical decision support has great potential in improving patient outcomes and reducing health care costs [<xref ref-type="bibr" rid="ref3">3</xref>-<xref ref-type="bibr" rid="ref10">10</xref>]. However, most machine learning models are black boxes that do not explain their predictions. This creates a barrier to clinical adoption. To overcome this barrier, we recently developed an automated method to offer rule-style explanations of any machine learning model’s predictions on tabular data and to suggest customized interventions without reducing the model’s performance measures [<xref ref-type="bibr" rid="ref11">11</xref>-<xref ref-type="bibr" rid="ref14">14</xref>]. Each rule-style explanation delineates the association between a feature value pattern and an outcome value. A feature is also called an independent variable. For the prediction of future emergency department (ED) visits or inpatient stays for asthma for a patient with asthma, one example of the explanation is as follows:</p>
      <list list-type="bullet">
        <list-item>
          <p>The patient had 2 ED visits related to asthma in the prior 12 months</p>
          <p>AND the patient’s average respiratory rate recorded in the prior 12 months is &#62;25 and ≤28 breaths per minute</p>
          <p>→the patient will likely have at least 1 ED visit or inpatient stay for asthma in the next 12 months [<xref ref-type="bibr" rid="ref13">13</xref>,<xref ref-type="bibr" rid="ref14">14</xref>].</p>
        </list-item>
      </list>
      <p>An ED visit is related to asthma if the ED visit has an asthma diagnosis code. For the item in the explanation showing that the patient had 2 ED visits related to asthma in the prior 12 months, 1 intervention suggested by the automatic explanation method [<xref ref-type="bibr" rid="ref12">12</xref>-<xref ref-type="bibr" rid="ref14">14</xref>] is to apply control procedures that decrease the likelihood that the patient will need emergency care.</p>
      <p>The association and intervention information provided by the automatic explanation method for machine learning predictions is useful. However, the user of the automated explaining function often requires more detailed information to better understand the patient’s situation and to aid in decision making. More specifically, consider a feature value on the left-hand side of a rule-style explanation that is computed by an aggregation function on the raw data. The user often wants to rapidly drill through to see certain parts of the related raw data producing the feature value. In the context of a relational database, these parts refer to the most relevant attributes of the most essential source tuples producing the feature value. Which attributes are most relevant and which source tuples are most essential depend on both the concrete feature type and the clinical decision support application’s need and are illustrated by several examples throughout this paper. The patterns embedded in these parts could provide additional information on the patient that was lost during the aggregation process to compute the feature value. This drill-through task is frequently difficult and time-consuming because the few pieces of related raw data are submerged by many pieces of raw data of the patient that are unrelated to the feature value. For example, as <xref ref-type="table" rid="table1">Table 1</xref> shows, the list of encounters of a patient with asthma displayed on the standard interface of an electronic medical record system includes much information that is irrelevant to the feature value “2 of the number of ED visits related to asthma that the patient had in the prior 12 months.”</p>
      <table-wrap position="float" id="table1">
        <label>Table 1</label>
        <caption>
          <p>An example list of encounters of a patient with asthma displayed on the standard interface of an electronic medical record system.<sup>a</sup></p>
        </caption>
        <table width="1000" cellpadding="5" cellspacing="0" border="1" rules="groups" frame="hsides">
          <col width="110"/>
          <col width="270"/>
          <col width="140"/>
          <col width="240"/>
          <col width="140"/>
          <col width="100"/>
          <thead>
            <tr valign="bottom">
              <td>Visit date</td>
              <td>Primary diagnosis<sup>b</sup></td>
              <td>Visit type</td>
              <td>Department</td>
              <td>Provider</td>
              <td>Facility</td>
            </tr>
          </thead>
          <tbody>
            <tr valign="top">
              <td>Dec 20, 2020</td>
              <td>Cough (R05)</td>
              <td>Outpatient</td>
              <td>HMC<sup>c</sup> family medicine clinic</td>
              <td>John Smith</td>
              <td>HMC</td>
            </tr>
            <tr valign="top">
              <td>Dec 18, 2020</td>
              <td>Dysphagia, unspecified (R13.10)</td>
              <td>Outpatient</td>
              <td>HMC family medicine clinic</td>
              <td>David Wong</td>
              <td>HMC</td>
            </tr>
            <tr valign="top">
              <td>…</td>
              <td>…</td>
              <td>…</td>
              <td>…</td>
              <td>…</td>
              <td>…</td>
            </tr>
            <tr valign="top">
              <td>Oct 15, 2020</td>
              <td>Cystitis, unspecified without hematuria (N30.90)</td>
              <td>Inpatient</td>
              <td>UWMC<sup>d</sup> 8SE</td>
              <td>Leslie Hurdle</td>
              <td>UWMC</td>
            </tr>
            <tr valign="bottom">
              <td>
                <italic>Oct 12, 2020</italic>
                <sup>e</sup>
              </td>
              <td>
                <italic>Viral infection, unspecified (B34.9)</italic>
              </td>
              <td>
                <italic>Emergency</italic>
              </td>
              <td>
                <italic>HMC HEDUCC</italic>
                <sup>f</sup>
              </td>
              <td>
                <italic>Patricia Sward</italic>
              </td>
              <td>
                <italic>HMC</italic>
              </td>
            </tr>
            <tr valign="top">
              <td>Oct 09, 2020</td>
              <td>Dizziness and giddiness (R42)</td>
              <td>Outpatient</td>
              <td>HMC family medicine clinic</td>
              <td>Eve Johnson</td>
              <td>HMC</td>
            </tr>
            <tr valign="top">
              <td>…</td>
              <td>…</td>
              <td>…</td>
              <td>…</td>
              <td>…</td>
              <td>…</td>
            </tr>
            <tr valign="top">
              <td>Feb 11, 2020</td>
              <td>Posttraumatic stress disorder, unspecified (F43.10)</td>
              <td>Outpatient</td>
              <td>HMC psychotherapy clinic</td>
              <td>Amy Jiang</td>
              <td>HMC</td>
            </tr>
            <tr valign="top">
              <td>
                <italic>Feb 08, 2020</italic>
              </td>
              <td>
                <italic>Syncope and collapse (R55)</italic>
              </td>
              <td>
                <italic>Emergency</italic>
              </td>
              <td>
                <italic>HMC HEDUCC</italic>
              </td>
              <td>
                <italic>Peter Shavlik</italic>
              </td>
              <td>
                <italic>HMC</italic>
              </td>
            </tr>
            <tr valign="top">
              <td>Feb 03, 2020</td>
              <td>Headache, unspecified (R51.9)</td>
              <td>Outpatient</td>
              <td>HMC family medicine clinic</td>
              <td>Jude Lake</td>
              <td>HMC</td>
            </tr>
            <tr valign="top">
              <td>…</td>
              <td>…</td>
              <td>…</td>
              <td>…</td>
              <td>…</td>
              <td>…</td>
            </tr>
          </tbody>
        </table>
        <table-wrap-foot>
          <fn id="table1fn1">
            <p><sup>a</sup>This example list is made based on a similar list seen in real electronic medical record data at the University of Washington Medicine.</p>
          </fn>
          <fn id="table1fn2">
            <p><sup>b</sup>This column does not show up on the standard interface. This column is included because it will be discussed in this paper.</p>
          </fn>
          <fn id="table1fn3">
            <p><sup>c</sup>HMC: Harborview Medical Center.</p>
          </fn>
          <fn id="table1fn4">
            <p><sup>d</sup>UWMC: University of Washington Medical Center.</p>
          </fn>
          <fn id="table1fn5">
            <p><sup>e</sup>For the feature value “2 of the number of emergency department visits related to asthma that the patient had in the prior 12 months,” the related rows in the list producing the feature value are marked in italics.</p>
          </fn>
          <fn id="table1fn6">
            <p><sup>f</sup>HEDUCC: Harborview Emergency Department Urgent Care Center.</p>
          </fn>
        </table-wrap-foot>
      </table-wrap>
      <p>For instance, in the rule-style explanation shown above, the first item on the left-hand side is the feature value “2 of the number of ED visits related to asthma that the patient had in the prior 12 months.” Asthma may or may not be the primary diagnosis of either of these 2 visits. For this feature value, the user of the automated explaining function wants to see the relevant parts of these 2 visits (visit date, primary diagnosis, department handling the visit, admitting provider, facility where the visit occurred) in the reverse chronological order (see <xref ref-type="table" rid="table2">Table 2</xref>), like the way encounters are displayed on the standard interface of an electronic medical record system. The patterns embedded in these parts give additional information on the patient not shown by the feature value, such as the time between these 2 visits, how long ago these 2 visits occurred, the primary diagnoses in these 2 visits, and whether these 2 visits occurred at the same facility. However, finding these parts is nontrivial. As seen in real electronic medical record data at the University of Washington Medicine, Intermountain Healthcare, and Kaiser Permanente Southern California, the patient could have had over 100 encounters in the prior 12 months. Only a few of these encounters are ED visits, and even fewer of them are ED visits related to asthma. To find the ED visits of the patient in the prior 12 months, the user would need some manual effort even if aided by the search function for the electronic medical record system. To figure out which of these visits are related to asthma, a task with which the search function often cannot provide much help, the user would need much more manual effort.</p>
      <table-wrap position="float" id="table2">
        <label>Table 2</label>
        <caption>
          <p>An example of the parts of the related raw data that should be displayed for a feature value.<sup>a</sup></p>
        </caption>
        <table width="1000" cellpadding="5" cellspacing="0" border="1" rules="groups" frame="hsides">
          <col width="160"/>
          <col width="350"/>
          <col width="210"/>
          <col width="160"/>
          <col width="120"/>
          <thead>
            <tr valign="top">
              <td>Visit date</td>
              <td>Primary diagnosis</td>
              <td>Department</td>
              <td>Provider</td>
              <td>Facility</td>
            </tr>
          </thead>
          <tbody>
            <tr valign="top">
              <td>Oct 12, 2020</td>
              <td>Viral infection, unspecified (B34.9)</td>
              <td>HMC<sup>b</sup> HEDUCC<sup>c</sup></td>
              <td>Patricia Sward</td>
              <td>HMC</td>
            </tr>
            <tr valign="top">
              <td>Feb 08, 2020</td>
              <td>Syncope and collapse (R55)</td>
              <td>HMC HEDUCC</td>
              <td>Peter Shavlik</td>
              <td>HMC</td>
            </tr>
          </tbody>
        </table>
        <table-wrap-foot>
          <fn id="table2fn1">
            <p><sup>a</sup>For the example list shown in <xref ref-type="table" rid="table1">Table 1</xref> and the feature value “2 of the number of emergency department visits related to asthma that the patient had in the prior 12 months,” the parts that the user of the automated explaining function wants to see are in the related raw data producing the feature value.</p>
          </fn>
          <fn id="table2fn2">
            <p><sup>b</sup>HMC: Harborview Medical Center.</p>
          </fn>
          <fn id="table2fn3">
            <p><sup>c</sup>HEDUCC: Harborview Emergency Department Urgent Care Center.</p>
          </fn>
        </table-wrap-foot>
      </table-wrap>
      <p>In practice, numerous possible features computed by various aggregation functions on all kinds of longitudinal attributes in the electronic medical records could be used for predictive modeling and automatic explanation. Examples of such features include whether the most recent asthma diagnosis of the patient is a primary diagnosis, the patient’s average respiratory rate recorded in the prior 12 months, the total number of distinct asthma medications ordered for the patient in the prior 12 months, the total number of units of asthma relievers that were ordered for the patient in the prior 12 months and were neither systemic corticosteroids nor short-acting beta-2 agonists, the number of distinct asthma medication prescribers of the patient in the prior 12 months, and the number of no-shows by the patient in the prior 12 months [<xref ref-type="bibr" rid="ref13">13</xref>,<xref ref-type="bibr" rid="ref14">14</xref>]. Most of the possible features are unanticipated by the developers of the search function for the electronic medical record system beforehand. The search function supports only a few fixed types of search. For only a small portion of possible features, the search function can aid drilling through the raw data that produce a given feature value.</p>
      <p>This creates a problem for the widespread adoption of the automatic explanation method for machine learning predictions. Frequently, this method gives multiple rule-style explanations for a patient predicted to be at high risk of incurring a poor outcome [<xref ref-type="bibr" rid="ref11">11</xref>,<xref ref-type="bibr" rid="ref12">12</xref>]. The user of the automated explaining function is typically a busy clinician having no time to do laborious manual drill-through regularly. However, to better understand the patient’s situation and to make better clinical decisions, the user often wants to drill through multiple feature values of the patient appearing in the explanations. If done manually, this is a challenging task. A patient often has extensive records with numerous variables and hundreds of pages of content accumulated over a long period of time [<xref ref-type="bibr" rid="ref15">15</xref>]. Further, the relevant raw data producing the feature values are frequently scattered in several places in the electronic medical record system.</p>
      <p>This study makes 2 contributions toward solving this problem:</p>
      <list list-type="order">
        <list-item>
          <p>We articulate this problem for the first time in the literature. This is done in the “Introduction” section.</p>
        </list-item>
        <list-item>
          <p>To address this problem, an automated lineage tracing approach is outlined to add automated drill-through capability to the automated explaining function. This is done in the “Outline of the proposed automated lineage tracing approach” section. Further, a roadmap for future research is provided in the “Directions for future research” section.</p>
        </list-item>
      </list>
      <p>The automated drill-through capability is intended to be offered to help the user of the automated explaining function save time, better understand the patient’s situation, and make better clinical decisions. The discussion in this paper focuses on structured electronic medical record data, a specific method commonly used to build clinical machine learning predictive models, and the automatic explanation method for machine learning predictions [<xref ref-type="bibr" rid="ref11">11</xref>,<xref ref-type="bibr" rid="ref12">12</xref>]. Nevertheless, the automated lineage tracing approach is not limited to them. Instead, when automatically explaining machine learning predictions and after appropriate extension, the principle of this approach can be applied to facilitate drilling through any feature value computed by an aggregation function on longitudinal structured data, regardless of whether the data came from electronic medical records, whether the feature is specified by a human expert or semiautomatically extracted from longitudinal data using the method outlined in the prior paper [<xref ref-type="bibr" rid="ref16">16</xref>], which method is used to build the machine learning predictive model, or which automatic explanation method is used.</p>
    </sec>
    <sec>
      <title>Running Example</title>
      <p>To illustrate this approach, a running example is used throughout this paper: automatically explaining the predictions of future ED visits or inpatient stays for individual patients with asthma. Our prior papers [<xref ref-type="bibr" rid="ref12">12</xref>-<xref ref-type="bibr" rid="ref14">14</xref>,<xref ref-type="bibr" rid="ref17">17</xref>-<xref ref-type="bibr" rid="ref19">19</xref>] detail this use case and the features used to make predictions in it.</p>
      <sec>
        <title>Base Tables</title>
        <p>Below are the schemas of 5 tables in a relational database used in the running example:</p>
        <graphic xlink:href="medinform_v9i5e27778_fig5.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
        <p>The underlined fields mark the key to each table. The <italic>encounter</italic> table includes 1 row per encounter listing its information. The <italic>diagnosis</italic> table includes 1 row per diagnosis code of an encounter. Primary diagnoses are signified by <italic>dx_sequence_number</italic>=1. The <italic>diagnosis_code_master</italic> table includes 1 row per unique diagnosis code giving its description. The <italic>ordered_medication</italic> table includes 1 row per medication appearing in a medication order. The <italic>medication_master</italic> table includes 1 row per unique medication listing its information.</p>
      </sec>
      <sec>
        <title>Intermediate Result Tables</title>
        <p>Besides the above 5 base tables, 4 intermediate result tables computed on the new data are also used in the running example: <italic>enc_features_1</italic>, <italic>enc_features_2</italic>, <italic>enc_features_3</italic>, and <italic>med_features_1</italic>. The trained machine learning predictive model is applied to the new data to make predictions on individual patients.</p>
        <p>The intermediate result table <italic>enc_features_1</italic> contains 3 temporal features on encounters: the number of ED visits, the number of inpatient stays, and the number of outpatient visits that the patient had in the prior 12 months. Let <italic>today_date</italic> denote today’s date. <italic>enc_features_1</italic> is computed from the <italic>encounter</italic> base table using the following structured query language (SQL) query.</p>
        <graphic xlink:href="medinform_v9i5e27778_fig6.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
        <p>The intermediate result table <italic>enc_features_2</italic> contains 1 temporal feature on encounters: the number of outpatient visits with a primary diagnosis of asthma that the patient had in the prior 12 months. Recall that the International Classification of Diseases, Tenth Revision diagnosis codes of asthma are J45.x. <italic>enc_features_2</italic> is computed by joining the <italic>encounter</italic> and <italic>diagnosis</italic> base tables using the following SQL query.</p>
        <graphic xlink:href="medinform_v9i5e27778_fig7.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
        <p>The intermediate result table <italic>enc_features_3</italic> contains 2 temporal features on encounters: the number of ED visits related to asthma and the number of inpatient stays related to asthma that the patient had in the prior 12 months. <italic>enc_features_3</italic> is computed by joining the <italic>encounter</italic> and <italic>diagnosis</italic> base tables using the following SQL query.</p>
        <graphic xlink:href="medinform_v9i5e27778_fig8.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
        <p>The intermediate result table <italic>med_features_1</italic> contains 2 temporal features on medications: the total number of medications and the total number of distinct medications ordered for the patient in the prior 12 months. <italic>med_features_1</italic> is computed from the <italic>ordered_medication</italic> base table using the following SQL query.</p>
        <graphic xlink:href="medinform_v9i5e27778_fig9.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
      </sec>
      <sec>
        <title>Relational Algebra Operators</title>
        <p>This paper uses the following relational algebra operators with the bag semantics unless otherwise specified: join <inline-graphic xlink:href="medinform_v9i5e27778_fig10.png" xlink:type="simple" mimetype="image"/>, left semijoin <inline-graphic xlink:href="medinform_v9i5e27778_fig11.png" xlink:type="simple" mimetype="image"/>, selection σ, projection π, duplicate elimination δ, and grouping γ [<xref ref-type="bibr" rid="ref20">20</xref>]. Commercial database management systems implement relations using the bag semantics.</p>
      </sec>
    </sec>
    <sec>
      <title>Review of a Typical Method to Build a Clinical Machine Learning Predictive Model and Our Automated Method to Explain the Model’s Predictions</title>
      <p>In this section, a typical method to build a machine learning predictive model on structured electronic medical record data as well as the automated method to explain the model’s predictions [<xref ref-type="bibr" rid="ref11">11</xref>-<xref ref-type="bibr" rid="ref14">14</xref>] are reviewed. In the next section, the automated lineage tracing approach based on these 2 methods is outlined.</p>
      <p>A health care system usually has an enterprise data warehouse. It stores in a relational database a copy of the structured electronic medical record data of the health care system, often after some transformations such as pivoting [<xref ref-type="bibr" rid="ref21">21</xref>,<xref ref-type="bibr" rid="ref22">22</xref>] and denormalization to facilitate data analysis. For predictive modeling with automated explanation, the overall workflow is to execute database SQL queries to extract features from the electronic medical record data, to build a machine learning predictive model on the training data, to apply the model on new data to make predictions on individual patients, and then to use the automated method to explain the predictions. In the following sections, each of these steps is described sequentially.</p>
      <sec>
        <title>Extracting Features From the Electronic Medical Record Data and Building the Clinical Machine Learning Predictive Model</title>
        <p>The structured electronic medical record data contain both static attributes (eg, gender) and longitudinal attributes (eg, encounters, diagnoses). Most attributes are longitudinal. As <xref rid="figure1" ref-type="fig">Figure 1</xref> shows, the following operations are performed on the training data:</p>
        <list list-type="order">
          <list-item>
            <p>The static features are computed from the static attribute values. The results are stored in 1 or more intermediate result tables. Typically, each of these intermediate result tables is computed by running a select-project-join SQL query on 1 or more base tables.</p>
          </list-item>
          <list-item>
            <p>By aggregating longitudinal attribute values and sometimes also using some static attribute values, the patient cohort of interest in the training data is computed. The result is stored in 1 intermediate result table. This is typically done by running a complex SQL query on several base tables. An example patient cohort is the set of all patients with asthma who visited any of the facilities of the health care system during a specific time period.</p>
          </list-item>
          <list-item>
            <p>By aggregating longitudinal attribute values, temporal features and the outcome variable are computed and stored in 1 or more intermediate result tables. Typically, each of these intermediate result tables is computed by running a select-project-join-aggregate SQL query on 1 or more base tables. For example, 1 intermediate result table is similar to <italic>enc_features_1</italic> and contains multiple temporal features on encounters computed from the <italic>encounter</italic> base table. A second intermediate result table is similar to <italic>enc_features_2</italic> and contains multiple temporal features on encounters computed by joining the <italic>encounter</italic> and <italic>diagnosis</italic> base tables. A third intermediate result table contains multiple temporal features on medications computed by joining the <italic>ordered_medication</italic> and <italic>medication_master</italic> base tables, such as the total number of distinct asthma medications and the total number of units of asthma medications ordered for the patient in the prior 12 months. The logical query plan for a select-project-join-aggregate query includes 1 or more select-project-join-aggregate segments [<xref ref-type="bibr" rid="ref23">23</xref>]. Each segment has a grouping or duplicate elimination operator at its end following a bunch of join, selection, and projection operators.</p>
          </list-item>
        </list>
        <fig id="figure1" position="float">
          <label>Figure 1</label>
          <caption>
            <p>The flow chart for building a clinical machine learning predictive model on the training data, making predictions on the new data, and using our automated method to explain the model’s predictions.</p>
          </caption>
          <graphic xlink:href="medinform_v9i5e27778_fig1.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
        </fig>
        <p><xref rid="figure2" ref-type="fig">Figure 2</xref> shows the logical query plan for a select-project-join-aggregate query. By joining the intermediate result tables containing the patient cohort of interest, the static and temporal features, and the outcome variable in the training data, a table containing the unified training data frame is obtained. For the patient cohort of interest, this table includes 1 column for the outcome variable and a separate column for each feature. Then a machine learning predictive model is trained on this table.</p>
        <fig id="figure2" position="float">
          <label>Figure 2</label>
          <caption>
            <p>A logical query plan for the select-project-join-aggregate query <italic>Q<sub>3</sub></italic> given in the “Intermediate result tables” section.</p>
          </caption>
          <graphic xlink:href="medinform_v9i5e27778_fig2.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
        </fig>
      </sec>
      <sec>
        <title>Applying the Machine Learning Predictive Model to New Data to Make Predictions on Individual Patients</title>
        <p>As <xref rid="figure3" ref-type="fig">Figure 3</xref> shows, similar to the procedure mentioned above, the patient cohort of interest and the static and temporal features in the new data are computed. The results are stored in several intermediate result tables. By joining these tables, a table containing the unified data frame for the new data is obtained. For the patient cohort of interest, this table includes a separate column for each feature. We then apply the machine learning predictive model to this table to make predictions on individual patients.</p>
        <fig id="figure3" position="float">
          <label>Figure 3</label>
          <caption>
            <p>The high-level logical query plan for computing the unified data frame that contains all the features of the new data. SQL: structured query language.</p>
          </caption>
          <graphic xlink:href="medinform_v9i5e27778_fig3.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
        </fig>
      </sec>
      <sec>
        <title>Automatically Explaining the Machine Learning Model’s Predictions</title>
        <p>At the same time of building the clinical machine learning predictive model, the training data are used to create the knowledge base of the automated explaining function. We do automated discretization [<xref ref-type="bibr" rid="ref24">24</xref>,<xref ref-type="bibr" rid="ref25">25</xref>] to convert continuous features to categorical features. Then class-based association rules [<xref ref-type="bibr" rid="ref24">24</xref>,<xref ref-type="bibr" rid="ref26">26</xref>] are mined from the unified training data frame. Each rule delineates the association between a feature value pattern and a poor outcome value <italic>c</italic> and is of the form</p>
        <disp-formula><italic>i<sub>1</sub></italic> AND <italic>i<sub>2</sub></italic> AND … AND <italic>i<sub>t</sub></italic>→<italic>c</italic>.</disp-formula>
        <p>This rule shows that a patient satisfying <italic>i<sub>1</sub></italic>, <italic>i<sub>2</sub></italic>, …, and <italic>i<sub>t</sub></italic> tends to have an outcome value <italic>c</italic>. The values of <italic>t</italic> and <italic>c</italic> can change across rules. Each item <italic>i<sub>k</sub></italic> (1≤<italic>k</italic>≤<italic>t</italic>) is a (feature, value) pair showing that a feature has a specific value or a value within a specific range. One example item of the former is that the patient had 2 ED visits related to asthma in the prior 12 months. One example item of the latter is that the patient’s average respiratory rate recorded in the prior 12 months is &#62;25 and ≤28 breaths per minute. An example rule containing both items is given in the Introduction.</p>
        <p>For each (feature, value) pair item used to create association rules, 0 or more interventions are precompiled. The interventions precompiled for any item on a rule’s left-hand side are automatically linked to the rule.</p>
        <p>At prediction time, to avoid reducing the machine learning predictive model’s performance measures, the model’s predictions are used with no change. The mined association rules are used to explain these predictions rather than to make predictions. More specifically, for each patient whom the model predicts to have a poor outcome value, we find and display the rules that have this value on their right-hand sides and whose left-hand sides are fulfilled by the patient. Each rule offers 1 explanation for the prediction. The interventions linked to the rule are displayed next to it as the suggested candidate interventions.</p>
        <p>Our automatic explanation method for machine learning predictions has been successfully applied to multiple clinical predictive modeling problems [<xref ref-type="bibr" rid="ref11">11</xref>,<xref ref-type="bibr" rid="ref12">12</xref>,<xref ref-type="bibr" rid="ref27">27</xref>,<xref ref-type="bibr" rid="ref28">28</xref>]. It has several advantages. Among all the automatic explanation methods for machine learning predictions in the literature [<xref ref-type="bibr" rid="ref29">29</xref>,<xref ref-type="bibr" rid="ref30">30</xref>], our method is the only one that can automatically suggest customized interventions. The rule-style explanations given by our method are easier to comprehend than the non–rule-style explanations given by many other methods. Unlike many other automatic explanation methods that either lower the machine learning predictive model’s performance measures or work for only a specific machine learning algorithm, our automatic explanation method works for any machine learning algorithm on tabular data without lowering the model’s performance measures. Unlike several other methods that use rules computed at prediction time to offer explanations [<xref ref-type="bibr" rid="ref31">31</xref>,<xref ref-type="bibr" rid="ref32">32</xref>], our method uses rules mined before prediction time to offer explanations. This is essential for our method to automatically suggest customized interventions at prediction time.</p>
      </sec>
    </sec>
    <sec>
      <title>Review of the Existing Automated Lineage Tracing Techniques</title>
      <p>In this section, the existing automated lineage tracing techniques are reviewed. An overview of such techniques developed in various fields is provided. Then, a specific set of automated lineage tracing techniques most closely related to this work is reviewed.</p>
      <sec>
        <title>Overview of the Existing Automated Lineage Tracing Techniques</title>
        <p>The lineage or provenance of a given data item <italic>i</italic> refers to the source data items producing <italic>i</italic> and how <italic>i</italic> was derived [<xref ref-type="bibr" rid="ref33">33</xref>]. The former is called where-lineage. The latter is called how-lineage. Each type of lineage can be at either the schema level or the instance level. An example of where-lineage at the schema level is the set of base tables producing a specific materialized view. An example of where-lineage at the instance level is the set of tuples in the base tables producing a given temporal feature value in a materialized view. Lineage information can be computed in either an eager way or a lazy way. In the former case, lineage information is computed and stored at the same time of producing the output data. In the latter case, lineage information is computed when needed. This paper focuses on where-lineage that is at the instance level and computed in a lazy way.</p>
        <p>Ikeda et al surveyed existing lineage tracing techniques in databases [<xref ref-type="bibr" rid="ref33">33</xref>,<xref ref-type="bibr" rid="ref34">34</xref>], e-science [<xref ref-type="bibr" rid="ref35">35</xref>], and scientific data processing [<xref ref-type="bibr" rid="ref36">36</xref>]. Among all of the lineage tracing techniques in the literature, the techniques Cui et al [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>] developed are the most closely related to this work. These techniques are used to trace the lineage of a tuple in a materialized view [<xref ref-type="bibr" rid="ref38">38</xref>] defined by a select-project-join-aggregate query in a relational database. Cui et al [<xref ref-type="bibr" rid="ref39">39</xref>,<xref ref-type="bibr" rid="ref40">40</xref>] described lineage tracing techniques for warehouse data computed via a directed acyclic graph of transformations, some of which could involve complex procedural code. Zhang et al [<xref ref-type="bibr" rid="ref41">41</xref>] described lineage tracing techniques for data computed by arbitrary functions. In general, the more flexibility is allowed on the transformations or functions, the less efficiently lineage can be traced [<xref ref-type="bibr" rid="ref39">39</xref>].</p>
        <p>In big data systems, Ikeda et al [<xref ref-type="bibr" rid="ref42">42</xref>,<xref ref-type="bibr" rid="ref43">43</xref>] described lineage tracing techniques for data computed via a directed acyclic graph of map and reduce functions [<xref ref-type="bibr" rid="ref44">44</xref>]. Amsterdamer et al [<xref ref-type="bibr" rid="ref45">45</xref>] described lineage tracing techniques for data computed by using Pig Latin [<xref ref-type="bibr" rid="ref46">46</xref>].</p>
        <p>In scientific data processing, lineage tracing is often done on curated databases, which contain scientific data copied from other databases [<xref ref-type="bibr" rid="ref36">36</xref>,<xref ref-type="bibr" rid="ref47">47</xref>].</p>
        <p>Schelter et al [<xref ref-type="bibr" rid="ref48">48</xref>] described a method to trace the schema-level lineage of the data sets, features, models, and predictions produced in machine learning experiments.</p>
      </sec>
      <sec>
        <title>Review of Cui et al’s Automated Lineage Tracing Techniques for Relational Databases</title>
        <p>To automatically trace the lineage of a tuple <italic>t</italic> in a materialized view [<xref ref-type="bibr" rid="ref38">38</xref>] defined by a select-project-join-aggregate query, Cui et al [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>] proceeded as follows. First, the materialized view’s definition query is transformed into a canonical form of the logical query plan. As <xref rid="figure2" ref-type="fig">Figure 2</xref> shows, the canonical form includes 1 or more select-project-join-aggregate segments. Each segment has 0 or 1 join operator, 0 or 1 selection operator, 0 or 1 projection operator, and a grouping or duplicate elimination operator in this particular order. Second, a separate intermediate materialized view is created for each intermediate select-project-join-aggregate segment of the canonical form. The root node of such a segment is not the root node of the canonical form. Third, we recursively trace through the hierarchy of intermediate materialized views in a top-down way. At each level of the hierarchy, the lineage tracing query for a 1-level select-project-join-aggregate materialized view is used to compute the current traced tuples’ lineage with respect to each base table and each materialized view at the next lower level. For a 1-level select-project-join-aggregate materialized view <italic>MV</italic> = <italic>γ</italic>(<italic>π<sub>A</sub></italic>(<italic>σ<sub>C</sub></italic>(<italic>R<sub>1</sub></italic><inline-graphic xlink:href="medinform_v9i5e27778_fig10.png" xlink:type="simple" mimetype="image"/><italic>R<sub>2</sub></italic><inline-graphic xlink:href="medinform_v9i5e27778_fig10.png" xlink:type="simple" mimetype="image"/>…<inline-graphic xlink:href="medinform_v9i5e27778_fig10.png" xlink:type="simple" mimetype="image"/> <italic>R<sub>n</sub></italic>))), the lineage of a tuple set <italic>T</italic>⊆<italic>MV</italic> with respect to the base table or the materialized view <italic>R<sub>i</sub></italic> (1≤<italic>i</italic>≤<italic>n</italic>) is <italic>π<sub>Ri</sub></italic>(<italic>σ<sub>C</sub></italic>(<italic>R<sub>1</sub></italic> <inline-graphic xlink:href="medinform_v9i5e27778_fig10.png" xlink:type="simple" mimetype="image"/> <italic>R<sub>2</sub></italic> <inline-graphic xlink:href="medinform_v9i5e27778_fig10.png" xlink:type="simple" mimetype="image"/>…<inline-graphic xlink:href="medinform_v9i5e27778_fig10.png" xlink:type="simple" mimetype="image"/><italic>R<sub>n</sub></italic>)<inline-graphic xlink:href="medinform_v9i5e27778_fig11.png" xlink:type="simple" mimetype="image"/> <italic>T</italic>). Here, the projection operator <italic>π</italic> on <italic>R<sub>i</sub></italic> has the set semantics, making each selected tuple in <italic>R<sub>i</sub></italic> appear only once. Further, all attributes of <italic>R<sub>i</sub></italic> appear in the projection operator and subsequently in the lineage traced on <italic>R<sub>i</sub></italic>. The final traced lineage of tuple <italic>t</italic> includes the lineage traced on every base table appearing in the canonical form.</p>
        <p>We use an example to illustrate Cui et al’s [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>] automated lineage tracing techniques. If “create table enc_features_3” is replaced by “create materialized view enc_features_3_view” in query <italic>Q<sub>3</sub></italic> given in the “Intermediate result tables” section, a query <italic>Q<sub>3_v</sub></italic> defining a materialized view <italic>enc_features_3_view</italic> is obtained. To trace the lineage of a tuple <italic>t</italic> in <italic>enc_features_3_view</italic> whose <italic>patient_id</italic> is <italic>asthma_patient_id</italic>, one proceeds as follows.</p>
        <p>First, the canonical form of the logical query plan for query <italic>Q<sub>3_v</sub></italic> is obtained. The canonical form is the same as the logical query plan for query <italic>Q<sub>3</sub></italic> shown in <xref rid="figure2" ref-type="fig">Figure 2</xref>.</p>
        <p>Second, an intermediate materialized view <italic>asthma_encounter_id</italic> is created for the intermediate select-project-join-aggregate segment <italic>e_id</italic> shown in <xref rid="figure2" ref-type="fig">Figure 2</xref>. This is done using the following SQL query.</p>
        <graphic xlink:href="medinform_v9i5e27778_fig12.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
        <p><xref rid="figure4" ref-type="fig">Figure 4</xref> shows the resulting hierarchy of intermediate materialized views, with the materialized view <italic>enc_features_3_view</italic> at the top and the <italic>encounter</italic> and <italic>diagnosis</italic> base tables at the bottom.</p>
        <fig id="figure4" position="float">
          <label>Figure 4</label>
          <caption>
            <p>The hierarchy of intermediate materialized views matching the canonical form of the logical query plan for the definition query of the materialized view <italic>enc_features_3_view</italic>.</p>
          </caption>
          <graphic xlink:href="medinform_v9i5e27778_fig4.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
        </fig>
        <p>Third, at the top level of the hierarchy of intermediate materialized views, the lineage of tuple <italic>t</italic> with respect to the <italic>encounter</italic> base table is computed using the following SQL query.</p>
        <graphic xlink:href="medinform_v9i5e27778_fig13.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
        <p>The following SQL query is used to compute the lineage of tuple <italic>t</italic> with respect to the intermediate materialized view <italic>asthma_encounter_id</italic> and to store the results in a temporary table <italic>temp</italic>.</p>
        <graphic xlink:href="medinform_v9i5e27778_fig14.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
        <p>Fourth, at the second level of the hierarchy of intermediate materialized views, the lineage of the tuples in the temporary table <italic>temp</italic> with respect to the <italic>diagnosis</italic> base table is computed using the following SQL query.</p>
        <graphic xlink:href="medinform_v9i5e27778_fig15.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
        <p>The final traced lineage of tuple <italic>t</italic> includes both the results of query <italic>Q<sub>6</sub></italic> and the results of query <italic>Q<sub>8</sub></italic>.</p>
      </sec>
    </sec>
    <sec>
      <title>Outline of the Proposed Automated Lineage Tracing Approach</title>
      <p>In this section, an automated lineage tracing approach is outlined to add automated drill-through capability to the automated explaining function. Our presentation includes 4 subsections. In the first subsection, an overview of the lineage tracing component of the automated explaining function is provided. In the second subsection, the unique requirements on automated lineage tracing are shown for automatically explaining machine learning predictions for clinical decision support. In the third subsection, the proposed automated lineage tracing techniques fulfilling these requirements is outlined. In the fourth subsection, some considerations are presented for future computer coding implementation of the proposed lineage tracing approach.</p>
      <sec>
        <title>Overview of the Lineage Tracing Component</title>
        <p>At association rule mining time, all (feature, value) pair items used to create association rules are known. Which items involve temporal features computed by aggregation functions on the raw data is also known. For each item that is related to a temporal feature of a patient and on the left-hand side of a rule, a hyperlink is added to the item in the rule. In addition, a parameterized stored procedure is written for the item in the database to retrieve lineage information. The stored procedure typically has 2 parameters: the <italic>patient_id</italic> of the patient being examined and the endpoint of the temporal aggregation period, such as today. When the stored procedure is run for the first time, an execution plan is generated. All subsequent runs will use the same execution plan to avoid runtime query optimization overhead.</p>
        <p>At automatic explanation time, the user of the automated explaining function is allowed to do lineage tracing for any item that is on the left-hand side of a rule-style explanation and related to a temporal feature value. When the user clicks the item’s hyperlink, the stored procedure prewritten for the item is invoked to retrieve some prespecified parts of the related raw data producing the feature value. Except for the cases with 2 specific aggregation functions described later in the paper, the retrieved data instances are always displayed on a page in the reverse chronological order like that in the electronic medical records.</p>
      </sec>
      <sec>
        <title>Unique Requirements for Automated Lineage Tracing</title>
        <p>Typically, the user of the automated explaining function is a clinician. To fit the user’s busy schedule and to aid timely decision making, the user wants the lineage tracing process for a temporal feature value to be finished quickly, preferably within 1 second. This goal is partially fulfilled by the existing lineage tracing techniques [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>], whereas the realized lineage tracing speed can be further improved. In addition, the retrieved lineage information should be easy to scan and include the most essential content needed to facilitate decision making. This enables the user to quickly gain useful insights from the information, ideally within 1 or a few seconds. As summarized in <xref ref-type="table" rid="table3">Table 3</xref>, that goal translates to 5 unique requirements on automated lineage tracing that are unmet by the existing lineage tracing techniques.</p>
        <table-wrap position="float" id="table3">
          <label>Table 3</label>
          <caption>
            <p>The 5 unique requirements of automated lineage tracing for automatically explaining machine learning predictions for clinical decision support.</p>
          </caption>
          <table width="1000" cellpadding="5" cellspacing="0" border="1" rules="groups" frame="hsides">
            <col width="550"/>
            <col width="450"/>
            <thead>
              <tr valign="top">
                <td>Requirement</td>
                <td>Reason for posing the requirement</td>
              </tr>
            </thead>
            <tbody>
              <tr valign="top">
                <td>Retrieving only a small set of attributes</td>
                <td>To prevent the user from being overwhelmed by many nonessential or irrelevant attributes</td>
              </tr>
              <tr valign="top">
                <td>Adding some essential attributes that do not directly produce the feature value</td>
                <td>To make the retrieved lineage information include the most essential content</td>
              </tr>
              <tr valign="top">
                <td>Sorting the retrieved lineage information in an appropriate order</td>
                <td>To make the retrieved lineage information easy to scan</td>
              </tr>
              <tr valign="top">
                <td>Computing the lineage information based on the semantic meaning of the feature</td>
                <td>To avoid including irrelevant or nonessential source tuples in the  <break/>  
            retrieved lineage information</td>
              </tr>
              <tr valign="top">
                <td>Performing no lineage tracing for any health care system feature value computed by an aggregation function</td>
                <td>To avoid including irrelevant data in the retrieved lineage information</td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <sec>
          <title>Requirement 1: Retrieving Only a Small Set of Attributes</title>
          <p>When tracing the lineage of a temporal feature value, one should retrieve from the base tables only a small set of attributes specific to the temporal feature rather than the many attributes involved in deriving all of the features used for automated explanation. This requirement is posed to prevent the user of the automated explaining function from being overwhelmed by many nonessential or irrelevant attributes.</p>
          <p>To aid automatic explanation, we want to allow tracing the lineage of a temporal feature value in the form of a small set of attributes specific to the temporal feature (see <xref ref-type="table" rid="table2">Table 2</xref> for an example). This cannot be well done using Cui et al’s lineage tracing techniques [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>]. These techniques were developed to trace the lineage of a tuple including all of its attribute values in a select-project-join-aggregate materialized view in a relational database. If the retrieved lineage information ever touches a tuple in a base table, all attribute values of the tuple are included in this information. For automatic explanation, both factors would cause the retrieved lineage information to have an excessive volume, overwhelming the user of the automated explaining function.</p>
          <p>To see this, the process of making predictions with automatic explanations is reviewed. Usually, many features are used to make predictions and to automatically explain them. All of the items on the left-hand side of a rule-style explanation come from the same tuple in the unified data frame, which contains all features of the new data. As <xref rid="figure3" ref-type="fig">Figure 3</xref> shows, this unified data frame is obtained by joining many intermediate result tables. Each of them falls into 1 of the 3 categories: (1) a table containing the patient cohort of interest in the new data, (2) a table containing 1 or more static features, and (3) a table containing 1 or more temporal features. Each hyperlinked item on the left-hand side of a rule-style explanation comes from exactly 1 intermediate result table in the third category.</p>
          <p>When the user of the automated explaining function clicks the hyperlink for an item on the left-hand side of a rule-style explanation, one could use Cui et al’s techniques [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>] to trace the lineage of the tuple in the unified data frame, from which the item comes. For each intermediate result table mentioned above and each base table used to create it, the retrieved lineage information contains some tuples from the base table including all of their attribute values. Most of the retrieved lineage information is unnecessary for automatic explanation for 3 reasons.</p>
          <sec>
            <title>Reason 1</title>
            <p>The retrieved lineage information often includes thousands of tuples from several dozen base tables. Most of these base tables are used to compute the other feature values in the tuple in the unified data frame that are unrelated to the item, and include no information that can help the user of the automated explaining function gain useful insights related to the item. In fact, to obtain the lineage information of the item essential for automatic explanation, we need to only trace through the intermediate result table related to the item solely for the item and to examine the base tables used to create this table. The features in this table that are unrelated to the item can be ignored. There is also no need to trace through the intermediate result tables containing the features unrelated to the item. Moreover, at automatic explanation time, we know the <italic>patient_id</italic> of the patient linked to the item. The user usually does not need to know why this patient is in the patient cohort of interest in the new data. Thus, there is no need to trace through the intermediate result table showing the patient cohort.</p>
          </sec>
          <sec>
            <title>Reason 2</title>
            <p>A base table often has many attributes, only a few of which are essential for the user of the automated explaining function to gain useful insights related to the item. For instance, the <italic>encounter</italic> table often has &#62;100 attributes. The lineage information shown in <xref ref-type="table" rid="table2">Table 2</xref> covers only 4 of them: <italic>admit_time</italic> transformed to the date format, <italic>department</italic>, <italic>admitting_provider</italic>, and <italic>facility</italic>.</p>
          </sec>
          <sec>
            <title>Reason 3</title>
            <p>Certain items are each computed using several base tables and intermediate query results. For the user of the automated explaining function to gain useful insights related to the item, only the attributes and tuples of some of these base tables are essential. Alternatively, none or only some of these intermediate query results need to be traced through.</p>
            <p>For example, in query <italic>Q<sub>2</sub></italic> given in the “Intermediate result tables” section, both the <italic>encounter</italic> and <italic>diagnosis</italic> base tables are used to compute the feature “the number of outpatient visits with a primary diagnosis of asthma that the patient had in the prior 12 months.” For a value of this feature, we need to use the information in the <italic>diagnosis</italic> table to find the related tuples in the <italic>encounter</italic> table. Nevertheless, the user would expect each encounter shown in the retrieved lineage information to be an outpatient visit with a primary diagnosis of asthma. Thus, there is no need to include any attribute or tuple from the <italic>diagnosis</italic> table in the retrieved lineage information, for example, to give the primary diagnosis of each encounter included in that information.</p>
            <p>As a second example, in query <italic>Q<sub>3</sub></italic> given in the “Intermediate result tables” section, both the <italic>encounter</italic> base table and the intermediate query result <italic>e_id</italic> are used to compute the feature “the number of ED visits related to asthma that the patient had in the prior 12 months.” For a value of this feature, the user of the automated explaining function would expect each encounter shown in the retrieved lineage information to be an ED visit related to asthma, like that shown in <xref ref-type="table" rid="table2">Table 2</xref>. Thus, there is no need to trace through <italic>e_id</italic> and to obtain the corresponding tuples in the <italic>diagnosis</italic> table showing that each encounter included in the retrieved lineage information has an asthma diagnosis code.</p>
          </sec>
        </sec>
        <sec>
          <title>Requirement 2: Adding Some Essential Attributes That Do Not Directly Produce the Feature Value</title>
          <p>For certain temporal features, when acquiring the lineage of a feature value, one should not use only the related raw data that directly produce the feature value. Instead, one needs to add to them some related attributes in the base tables, which are specific to the temporal feature and do not directly produce the feature value. We pose this requirement to make the retrieved lineage information include the most essential content needed to facilitate decision making. For example, as query <italic>Q<sub>1</sub></italic> given in the “Intermediate result tables” section shows, the feature “the number of ED visits that the patient had in the prior 12 months” is computed solely from the <italic>encounter</italic> base table. For a value of this feature, we want the retrieved lineage information to be similar to that shown in <xref ref-type="table" rid="table2">Table 2</xref> and include a primary diagnosis column. This column is computed using the <italic>diagnosis</italic> and <italic>diagnosis_code_master</italic> base tables unused in <italic>Q<sub>1</sub></italic> and is formed by concatenating the <italic>diagnosis_code</italic> and <italic>dx_code_description</italic> columns of the <italic>diagnosis_code_master</italic> base table. The cases for many other temporal features on encounters are similar.</p>
        </sec>
        <sec>
          <title>Requirement 3: Sorting the Retrieved Lineage Information in an Appropriate Order</title>
          <p>When presenting the lineage information, the related raw data retrieved for a temporal feature value should be sorted in an order specific to the temporal feature. This requirement is posed to make the retrieved lineage information easy to scan. Usually, we want the data instances in the retrieved lineage information to be displayed in the reverse chronological order like that in the electronic medical records. However, there are 2 exceptions. First, when the temporal feature is the maximum value of an attribute of a given patient, we want the related raw data retrieved for a feature value to be displayed in the descending order of the attribute value. For example, for the feature “the highest systolic blood pressure of the patient in the prior 12 months,” we want the lineage information retrieved for a feature value to contain the systolic blood pressure of the patient in the prior 12 months sorted in the descending order. Second, when the temporal feature is the minimum value of an attribute of a given patient, we want the related raw data retrieved for a feature value to be displayed in the ascending order of the attribute value. In either of the 2 cases, a resort button could be added to the retrieved lineage information on display. If the user of the automated explaining function clicks this button, the data instances in the retrieved lineage information are rearranged in the reverse chronological order for display.</p>
        </sec>
        <sec>
          <title>Requirement 4: Computing the Lineage Information Based on the Semantic Meaning of the Feature</title>
          <p>The lineage information of a temporal feature value should be computed based on the semantic meaning of the feature rather than solely on the literal writing of the SQL query used to compute the feature. We pose this requirement to avoid including irrelevant or nonessential source tuples in the retrieved lineage information. For a select-project-join-aggregate materialized view containing 1 or more temporal features, Cui et al [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>] compute the lineage of a tuple in it based solely on the literal SQL query used to define it. In certain cases, this literal approach is suboptimal for automatic explanation. Instead, we should consider the semantic meanings of the temporal features during lineage tracing. In the following, 2 such cases are described. Each case is presented as a subrequirement.</p>
          <sec>
            <title>Subrequirement 4.1</title>
            <p>When the temporal feature is the sum of a variable computed by a case statement in SQL including multiple conditions and some of them return 0, only the lineage information related to the other conditions should be retrieved. In SQL, such a temporal feature is written in the form of</p>
            <graphic xlink:href="medinform_v9i5e27778_fig31.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
            <p>As an example of this subrequirement, for the feature “the number of ED visits that the patient had in the prior 12 months,” the lineage information retrieved for a value of the feature should be the ED visits that the patient had in the prior 12 months, regardless of whether the feature is computed using SQL query <italic>Q<sub>9</sub></italic> or <italic>Q<sub>10</sub></italic> below.</p>
            <graphic xlink:href="medinform_v9i5e27778_fig16.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
            <p>The differences between <italic>Q<sub>9</sub></italic> and <italic>Q<sub>10</sub></italic> are highlighted in italics in <italic>Q<sub>10</sub></italic>. If the feature is computed using <italic>Q<sub>9</sub></italic>, Cui et al’s techniques [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>] would retrieve all the encounters of the patient in the prior 12 months as the lineage information. This could easily overwhelm the user of the automated explaining function, as usually most of these encounters are not ED visits.</p>
          </sec>
          <sec>
            <title>Subrequirement 4.2</title>
            <p>When the temporal feature is the total number of distinct items, the retrieved lineage information should include only 1 representative data instance for each distinct item. For example, query <italic>Q<sub>4</sub></italic> given in the “Intermediate result tables” section computes the feature “the total number of distinct medications ordered for the patient in the prior 12 months.” For a value of this feature, Cui et al’s techniques [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>] would retrieve all medications ordered for the patient in the prior 12 months as the lineage information. This information is often overwhelming and not succinct enough for the user of the automated explaining function to quickly find the distinct medications ordered for the patient in the prior 12 months, as the same medication could be ordered for the patient multiple times in a year. To avoid this problem, one could retrieve only the most recent order of each distinct medication ordered for the patient in the prior 12 months as the lineage information. For the user, these distinct medications typically provide enough insight into the patient’s status related to the feature value.</p>
          </sec>
        </sec>
        <sec>
          <title>Requirement 5: Performing No Lineage Tracing for Any Health Care System Feature Value Computed by an Aggregation Function</title>
          <p>We do not trace the lineage of any health care system feature value computed by an aggregation function. We pose this requirement to avoid including irrelevant data in the retrieved lineage information. Like temporal features of a patient, certain health care system features [<xref ref-type="bibr" rid="ref17">17</xref>-<xref ref-type="bibr" rid="ref19">19</xref>] such as the number of patients with asthma of the primary care provider of a patient are computed by aggregation functions. These health care system features are each computed using multiple patients’ information rather than solely the information of the patient being examined. Since other patients’ detailed information does not help the user of the automated explaining function understand this patient’s situation, we do not trace the lineage of any value of this feature, even if it appears on the left-hand side of a rule-style explanation.</p>
        </sec>
      </sec>
      <sec>
        <title>Outline of the Proposed Techniques to Form the Lineage Tracing Query That Computes the Lineage Information</title>
        <p>To perform automated lineage tracing for explaining machine learning predictions for clinical decision support, Cui et al’s lineage tracing techniques [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>] are modified to fulfill the requirements mentioned above. Even without giving any detail on the computer coding implementation and the performance evaluation results, Cui et al [<xref ref-type="bibr" rid="ref37">37</xref>] already used 49 pages to describe the details of their automated lineage tracing algorithm. The case described in this paper is more complex than Cui et al’s case [<xref ref-type="bibr" rid="ref37">37</xref>]. In the case described in this paper, which attributes are most relevant and which source tuples are most essential for inclusion in the retrieved lineage information depend on both the concrete feature type and the clinical decision support application’s need. In comparison, no such dependency exists in Cui et al’s case [<xref ref-type="bibr" rid="ref37">37</xref>]. Thus, it is expected that, once fully worked out, the proposed automated lineage tracing algorithm would be more sophisticated than Cui et al’s algorithm [<xref ref-type="bibr" rid="ref37">37</xref>]. In this viewpoint paper, the goal is not to enumerate all possible feature types and to provide a detailed design or any computer coding implementation of the proposed automated lineage tracing approach. Rather, the goal is to describe the design approach for the proposed automated lineage tracing module and to provide a roadmap for future research. We achieve this goal by outlining the main steps of forming the lineage tracing query, giving 4 example temporal features, and illustrating at a high level how to form the lineage tracing query for each of these 4 features.</p>
        <sec>
          <title>Overview of the Lineage Tracing Query Formation Process</title>
          <p>Usually, each intermediate result table shown in <xref rid="figure3" ref-type="fig">Figure 3</xref> has a <italic>patient_id</italic> column. It is used as the join column in the join operation to produce the unified data frame containing all features of the new data. As explained in “Reason 1” of the “Requirement 1” section, to obtain the lineage information of a temporal feature value, we need to only trace through the intermediate result table containing this value solely for this value. This intermediate result table is usually computed from some base tables by using a select-project-join-aggregate SQL query <italic>S<sub>0</sub></italic>. To form the lineage tracing query for a temporal feature value of a patient in the intermediate result table, one proceeds in 4 steps. First, the other temporal features, if any, are removed from <italic>S<sub>0</sub></italic> to obtain a simplified query <italic>S<sub>1</sub></italic>. Second, if applicable, <italic>S<sub>1</sub></italic> is transformed to query <italic>S<sub>2</sub></italic> to fulfill subrequirement 4.1. Third, Cui et al’s techniques [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>] are modified to address Reasons 2 and 3 given in the “Requirement 1” section. The modified techniques are used to form a preliminary lineage tracing query <italic>S<sub>3</sub></italic> based on <italic>S<sub>2</sub></italic> and the patient’s <italic>patient_id</italic>. Fourth, to obtain the final lineage tracing query, <italic>S<sub>3</sub></italic> is transformed to fulfill Requirements 2 and 3 and subrequirement 4.2.</p>
          <p>In the following, 4 examples are used to illustrate at a high level how to form the lineage tracing query. In each example, the user of the automated explaining function is examining a patient with asthma whose identifier is <italic>asthma_patient_id</italic> and wants to drill through a temporal feature value of this patient. We outline the main steps of forming the lineage tracing query for the feature value without giving the detailed algorithm.</p>
        </sec>
        <sec>
          <title>Example 1: The Number of ED Visits That the Patient Had in the Prior 12 Months</title>
          <p>As defined by query <italic>Q<sub>1</sub></italic> in the “Intermediate result tables” section, the intermediate result table <italic>enc_features_1</italic> contains 3 temporal features. One of them is the number of ED visits that the patient had in the prior 12 months. To form the lineage tracing query for a value of this feature, one proceeds as follows.</p>
          <p>First, the other 2 features are removed from query <italic>Q<sub>1</sub></italic> to obtain query <italic>Q<sub>9</sub></italic> given in the “Subrequirement 4.1” section.</p>
          <p>Second, to fulfill subrequirement 4.1 on handling the sum of a variable computed by a case statement, query <italic>Q<sub>9</sub></italic> is transformed to query <italic>Q<sub>10</sub></italic> given in the “Subrequirement 4.1” section.</p>
          <p>Third, Cui et al’s lineage tracing techniques [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>] are used to form a draft lineage tracing query <italic>Q<sub>11</sub></italic> based on <italic>Q<sub>10</sub></italic> and <italic>asthma_patient_id</italic>.</p>
          <graphic xlink:href="medinform_v9i5e27778_fig17.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
          <p>The differences between <italic>Q<sub>10</sub></italic> and <italic>Q<sub>11</sub></italic> are highlighted in italics in <italic>Q<sub>11</sub></italic>. To address Reason 2 given in the “Requirement 1” section and retrieve from the <italic>encounter</italic> table only its attributes essential for automatic explanation, <italic>Q<sub>11</sub></italic> is transformed to the following preliminary lineage tracing query.</p>
          <graphic xlink:href="medinform_v9i5e27778_fig18.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
          <p>The differences between <italic>Q<sub>11</sub></italic> and <italic>Q<sub>12</sub></italic> are highlighted in italics in <italic>Q<sub>12</sub></italic>.</p>
          <p>Fourth, to fulfill Requirement 2, a primary diagnosis column needs to be added to the raw data that are retrieved by query <italic>Q<sub>12</sub></italic> and that directly produce the feature value being examined. To fulfill Requirement 3, the retrieved raw data need to be sorted in the reverse chronological order. To meet both demands, <italic>Q<sub>12</sub></italic> is transformed to the following final lineage tracing query.</p>
          <graphic xlink:href="medinform_v9i5e27778_fig19.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
          <p>The differences between <italic>Q<sub>12</sub></italic> and <italic>Q<sub>13</sub></italic> are highlighted in italics in <italic>Q<sub>13</sub></italic>. &#124;&#124; is the string concatenation operator in SQL.</p>
        </sec>
        <sec>
          <title>Example 2: The Number of Outpatient Visits With a Primary Diagnosis of Asthma That the Patient Had in the Prior 12 Months</title>
          <p>As defined by query <italic>Q<sub>2</sub></italic> in the “Intermediate result tables” section, the intermediate result table <italic>enc_features_2</italic> contains the temporal feature “the number of outpatient visits with a primary diagnosis of asthma that the patient had in the prior 12 months.” To form the lineage tracing query for a value of this feature, one proceeds as follows.</p>
          <p>First, to address Reason 2 given in the “Requirement 1” section, only the attributes essential for automatic explanation should be included from the <italic>encounter</italic> table. To address Reason 3 given in the “Requirement 1” section, no attribute or tuple from the <italic>diagnosis</italic> table should be included in the retrieved lineage information. A preliminary lineage tracing query <italic>Q<sub>14</sub></italic> is formed based on query <italic>Q<sub>2</sub></italic> and <italic>asthma_patient_id</italic> by using a modified version of Cui et al’s lineage tracing techniques [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>] that meets both demands.</p>
          <graphic xlink:href="medinform_v9i5e27778_fig20.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
          <p>The differences between <italic>Q<sub>2</sub></italic> and <italic>Q<sub>14</sub></italic> are highlighted in italics in <italic>Q<sub>14</sub></italic>.</p>
          <p>Second, to fulfill Requirement 3 of sorting the related raw data retrieved for the feature value in the reverse chronological order, query <italic>Q<sub>14</sub></italic> is transformed to the following final lineage tracing query.</p>
          <graphic xlink:href="medinform_v9i5e27778_fig21.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
          <p>The differences between <italic>Q<sub>14</sub></italic> and <italic>Q<sub>15</sub></italic> are highlighted in italics in <italic>Q<sub>15</sub></italic>.</p>
        </sec>
        <sec>
          <title>Example 3: The Number of ED Visits Related to Asthma That the Patient Had in the Prior 12 Months</title>
          <p>As defined by query <italic>Q<sub>3</sub></italic> in the “Intermediate result tables” section, the intermediate result table <italic>enc_features_3</italic> contains 2 temporal features. One of them is the number of ED visits related to asthma that the patient had in the prior 12 months. To form the lineage tracing query for a value of this feature, one proceeds as follows.</p>
          <p>First, the other feature is removed from query <italic>Q<sub>3</sub></italic> to obtain the following simplified query.</p>
          <graphic xlink:href="medinform_v9i5e27778_fig22.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
          <p>Second, to fulfill subrequirement 4.1 on handling the sum of a variable computed by a case statement, query <italic>Q<sub>16</sub></italic> is transformed to the following query.</p>
          <graphic xlink:href="medinform_v9i5e27778_fig23.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
          <p>The differences between <italic>Q<sub>16</sub></italic> and <italic>Q<sub>17</sub></italic> are highlighted in italics in <italic>Q<sub>17</sub></italic>.</p>
          <p>Third, to address Reason 2 given in the “Requirement 1” section, only the attributes essential for automatic explanation should be included from the <italic>encounter</italic> table. To address Reason 3 given in the “Requirement 1” section, the intermediate query result <italic>e_id</italic> should not be traced through to include any corresponding tuple in the <italic>diagnosis</italic> table in the retrieved lineage information. A preliminary lineage tracing query <italic>Q<sub>18</sub></italic> is formed based on query <italic>Q<sub>17</sub></italic> and <italic>asthma_patient_id</italic> by using a modified version of Cui et al’s lineage tracing techniques [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>] that meets both demands.</p>
          <graphic xlink:href="medinform_v9i5e27778_fig24.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
          <p>The differences between <italic>Q<sub>17</sub></italic> and <italic>Q<sub>18</sub></italic> are highlighted in italics in <italic>Q<sub>18</sub></italic>.</p>
          <p>Cui et al’s lineage tracing techniques [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>,<xref ref-type="bibr" rid="ref49">49</xref>] are applied to query <italic>Q<sub>3</sub></italic> to create a materialized view <italic>asthma_encounter_id</italic>, which is defined by query <italic>Q<sub>5</sub></italic> in the “Review of Cui et al’s automated lineage tracing techniques for relational databases” section. The <italic>asthma_encounter_id</italic> is used to rewrite the preliminary lineage tracing query <italic>Q<sub>18</sub></italic> as follows.</p>
          <graphic xlink:href="medinform_v9i5e27778_fig25.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
          <p>The differences between <italic>Q<sub>18</sub></italic> and <italic>Q<sub>19</sub></italic> are highlighted in italics in <italic>Q<sub>19</sub></italic>.</p>
          <p>Fourth, to fulfill Requirement 2, a primary diagnosis column needs to be added to the raw data that are retrieved by query <italic>Q<sub>19</sub></italic> and that directly produce the feature value being examined. To fulfill Requirement 3, the retrieved raw data need to be sorted in the reverse chronological order. To meet both demands, <italic>Q<sub>19</sub></italic> is transformed to the following final lineage tracing query.</p>
          <graphic xlink:href="medinform_v9i5e27778_fig26.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
          <p>The differences between <italic>Q<sub>19</sub></italic> and <italic>Q<sub>20</sub></italic> are highlighted in italics in <italic>Q<sub>20</sub></italic>.</p>
        </sec>
        <sec>
          <title>Example 4: The Total Number of Distinct Medications Ordered for the Patient in the Prior 12 Months</title>
          <p>As defined by query <italic>Q<sub>4</sub></italic> in the “Intermediate result tables” section, the intermediate result table <italic>med_features_1</italic> contains 2 temporal features. One of them is the total number of distinct medications ordered for the patient in the prior 12 months. To form the lineage tracing query for a value of this feature, one proceeds as follows.</p>
          <p>First, the other feature is removed from query <italic>Q<sub>4</sub></italic> to obtain the following simplified query.</p>
          <graphic xlink:href="medinform_v9i5e27778_fig27.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
          <p>Second, to address Reason 2 given in the “Requirement 1” section, only the attributes essential for automatic explanation should be included from the <italic>ordered_medication</italic> table. A preliminary lineage tracing query <italic>Q<sub>22</sub></italic> is formed based on query <italic>Q<sub>21</sub></italic> and <italic>asthma_patient_id</italic> by using a modified version of Cui et al’s lineage tracing techniques [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>] that meets this demand.</p>
          <graphic xlink:href="medinform_v9i5e27778_fig28.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
          <p>The differences between <italic>Q<sub>21</sub></italic> and <italic>Q<sub>22</sub></italic> are highlighted in italics in <italic>Q<sub>22</sub></italic>.</p>
          <p>Third, to fulfill subrequirement 4.2, one could retrieve only the most recent order of each distinct medication ordered for the patient in the prior 12 months as the lineage information. This is done by transforming query <italic>Q<sub>22</sub></italic> to the following query.</p>
          <graphic xlink:href="medinform_v9i5e27778_fig29.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
          <p>The differences between <italic>Q<sub>22</sub></italic> and <italic>Q<sub>23</sub></italic> are highlighted in italics in <italic>Q<sub>23</sub></italic>.</p>
          <p>Fourth, to fulfill requirement 2, a medication name column is added to the raw data that are retrieved by query <italic>Q<sub>23</sub></italic> and directly produce the feature value being examined. To fulfill Requirement 3, the retrieved raw data are sorted in the reverse chronological order. <italic>Q<sub>23</sub></italic> is transformed to the following final lineage tracing query to meet both demands.</p>
          <graphic xlink:href="medinform_v9i5e27778_fig30.png" alt-version="no" mimetype="image" position="float" xlink:type="simple"/>
          <p>The differences between <italic>Q<sub>23</sub></italic> and <italic>Q<sub>24</sub></italic> are highlighted in italics in <italic>Q<sub>24</sub></italic>.</p>
        </sec>
      </sec>
      <sec>
        <title>Considerations for Future Computer Coding Implementation of the Proposed Automated Lineage Tracing Approach</title>
        <sec>
          <title>Maximizing the Automation Degree of the Lineage Tracing Query Formation Process</title>
          <p>For a select-project-join-aggregate materialized view, Cui et al [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>] used a fully automated approach to analyze its definition query to derive a lineage tracing query for a tuple in it. In the case of automatically explaining machine learning predictions, all temporal features used for making predictions and automatic explanation are known at machine learning model building time. In general, for each temporal feature, we can form a lineage tracing query either manually or semiautomatically, but often not fully automatically, beforehand. Nevertheless, once the query is formed and put into the knowledge base of the automated explaining function, we can use the query to automatically retrieve the lineage information of a value of the feature at prediction time.</p>
          <p>As mentioned before, automatic explanation poses several unique requirements on automated lineage tracing. Two of them make it difficult to fully automate the lineage tracing query formation process. First, Requirement 1 says that the lineage information retrieved for a temporal feature value should include only a small set of relevant attributes specific to the temporal feature. Almost infinite attributes and temporal features could possibly be used for clinical machine learning. Thus, it is infeasible to precompile the set of relevant attributes for every possible temporal feature. Second, Requirement 2 says that when acquiring the lineage of a value for certain temporal features, we need to include some attributes that are specific to the temporal feature and do not directly produce the feature value. For a reason similar to the above, it is infeasible to precompile the set of such attributes for every possible such temporal feature.</p>
          <p>Although the lineage tracing query formation process cannot be fully automated in the most general case, 2 methods can still be used to maximize the process’ automation degree and to reduce the workload of the developers of the automated explaining function. First, for a temporal feature, an approach similar to that of Cui et al [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>] can be used to automatically form a draft lineage tracing query. The developers of the automated explaining function revise this query as needed to obtain the final lineage tracing query. Second, the same temporal feature is often used for multiple predictive modeling tasks. One can create a library of lineage tracing queries for temporal features to facilitate query reuse across various predictive modeling tasks. This library is formed for a data set in the Observational Medical Outcomes Partnership common data model format [<xref ref-type="bibr" rid="ref50">50</xref>] using its linked standardized terminologies [<xref ref-type="bibr" rid="ref51">51</xref>]. This format standardizes administrative and clinical variables from ≥10 large US health care systems [<xref ref-type="bibr" rid="ref52">52</xref>,<xref ref-type="bibr" rid="ref53">53</xref>]. For any data set that is put into this format, we can use this library to obtain lineage tracing queries.</p>
        </sec>
        <sec>
          <title>Improving the Lineage Tracing Speed</title>
          <p>As mentioned before, the user of the automated explaining function wants the lineage tracing process for a temporal feature value to be finished quickly, preferably within 1 second. To expedite tracing the lineage of a tuple in a materialized view defined by a select-project-join-aggregate query <italic>S</italic>, Cui et al [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref37">37</xref>,<xref ref-type="bibr" rid="ref49">49</xref>] advocated creating a materialized view for each intermediate select-project-join-aggregate segment of the canonical form of the logical query plan for <italic>S</italic>. While this boosts the lineage tracing speed, the resulting speed is still not fast enough to reach a subsecond response time [<xref ref-type="bibr" rid="ref23">23</xref>,<xref ref-type="bibr" rid="ref39">39</xref>]. To further improve the lineage tracing speed, we can build indices [<xref ref-type="bibr" rid="ref39">39</xref>,<xref ref-type="bibr" rid="ref42">42</xref>] on the selection and join attributes of both the base tables and the materialized views created for the intermediate select-project-join-aggregate segments. For instance, in Example 3, we can build 1 index on the <italic>encounter_id</italic> column of the materialized view <italic>asthma_encounter_id</italic> and another index on the <italic>patient_id</italic> column of the <italic>encounter</italic> base table. We can create indices either manually or by using an automated index design tool provided by a commercial relational database system [<xref ref-type="bibr" rid="ref54">54</xref>-<xref ref-type="bibr" rid="ref56">56</xref>]. Typically, each intermediate result table containing 1 or more temporal features is computed on 1 or a few base tables using no more than a small number of join operations. The lineage tracing query for a temporal feature value falls into a similar case. Thus, with appropriate indices, we would expect the lineage tracing query to finish execution quickly. For base tables of moderate sizes and simple materialized views, Cui and Widom [<xref ref-type="bibr" rid="ref39">39</xref>] showed that lineage tracing can be done within 1 second when indices exist on the keys of the base tables. For large base tables and temporal features computed through more complex procedures, we would expect that more indices are needed to reach a subsecond response time.</p>
          <p>The above discussion focuses on the case that the electronic medical record data are stored in a relational database and features are extracted using SQL queries. When the electronic medical record data are stored in a big data system and features are extracted using map and reduce functions [<xref ref-type="bibr" rid="ref44">44</xref>] or Pig Latin [<xref ref-type="bibr" rid="ref46">46</xref>], we can modify the corresponding existing lineage tracing techniques [<xref ref-type="bibr" rid="ref42">42</xref>,<xref ref-type="bibr" rid="ref43">43</xref>,<xref ref-type="bibr" rid="ref45">45</xref>] in a similar way to enable lineage tracing to aid automatically explaining machine learning predictions for clinical decision support.</p>
        </sec>
      </sec>
    </sec>
    <sec sec-type="discussion">
      <title>Discussion</title>
      <sec>
        <title>Directions for Future Research</title>
        <p>The above discussion describes the high-level design approach for the proposed automated lineage tracing module. To complete the detailed design of the proposed automated lineage tracing approach, implement the module in computer code, and test the module’s performance, much research is needed along the following directions:</p>
        <list list-type="order">
          <list-item>
            <p>We need to compile a list of attributes and temporal feature types most commonly used in building clinical machine learning predictive models. For these attributes and temporal feature types, we need to complete the detailed design and the computer coding implementation of the proposed automated lineage tracing approach.</p>
          </list-item>
          <list-item>
            <p>We need to come up with an automated approach to design indices needed for improving the lineage tracing speed. The database research community has developed several automated index design approaches [<xref ref-type="bibr" rid="ref54">54</xref>-<xref ref-type="bibr" rid="ref56">56</xref>]. We can modify these approaches to fit the database querying workload posed by automated lineage tracing.</p>
          </list-item>
          <list-item>
            <p>We plan to assess the execution speed of the proposed automated lineage tracing approach after implementing it in computer code.</p>
          </list-item>
          <list-item>
            <p>As shown by prior work on automated lineage tracing shown in the “Overview of the existing automated lineage tracing techniques” section, the database research community takes it for granted that automated lineage tracing could help users better understand the data and save time in doing data analysis. To the best of our knowledge, no formal study to date has been published on measuring the impact of automated lineage tracing on users’ data analysis and decision-making process. After implementing the proposed automated lineage tracing module, we plan to choose several clinical predictive modeling tasks and assess for each task, the impact of offering the module on the data analysis and decision-making process of the users of the automated explaining function. In particular, we plan to evaluate whether the addition of the module benefits the user and improves outcomes, for example, by saving the user’s time, making it easier for the user to understand the predictions given by the machine learning predictive model and helping the user better understand the patient’s situation and make better clinical decisions.</p>
          </list-item>
        </list>
      </sec>
      <sec>
        <title>Limitations of the Proposed Approach</title>
        <p>The proposed automated lineage tracing approach has several limitations:</p>
        <list list-type="order">
          <list-item>
            <p>To build clinical machine learning predictive models, we usually use temporal features that are computed by SQL queries of low or moderate complexities. It is possible that some temporal features used to build certain predictive models are computed by rather complex SQL queries. We may not be able to finish the lineage tracing process for a value of such a temporal feature quickly, regardless of how many indices are built to expedite this process. For example, this could happen if the SQL query uses complex procedural code, which has no property that can be used to simplify the lineage tracing process [<xref ref-type="bibr" rid="ref39">39</xref>]. Having a long lineage tracing time could make the user of the automated explaining function become impatient. Nevertheless, it is still faster and more convenient to do lineage tracing using the automated approach than to let the user do manual drill-through.</p>
          </list-item>
          <list-item>
            <p>The proposed automated lineage tracing approach works for any feature values computed by the standard aggregation functions in SQL on longitudinal structured data. For certain deep learning predictive models built on longitudinal structured data, the previously proposed method [<xref ref-type="bibr" rid="ref16">16</xref>] could be used to semiautomatically extract comprehensible and predictive temporal features from the models and the longitudinal structured data, and then apply the automated approach to trace the lineage of the values of these features. For any other deep learning predictive model that is built directly on longitudinal structured data and that uses incomprehensible features hidden in the neurons of the deep neural network, the proposed automated approach can no longer be used to trace the lineage of the values of these features.</p>
          </list-item>
          <list-item>
            <p>Almost infinite attributes and temporal features could possibly be used for clinical machine learning. Further, some attributes are not covered by the Observational Medical Outcomes Partnership common data model. For the reasons given in the “Maximizing the automation degree of the lineage tracing query formation process” section, we could maximize the automation degree of the lineage tracing query formation process for only certain types of temporal features formed on certain attributes. For any other temporal feature, the developers of the automated explaining function could still need a nontrivial amount of time to create the corresponding lineage tracing query.</p>
          </list-item>
        </list>
      </sec>
      <sec>
        <title>Conclusions</title>
        <p>Automatically explaining machine learning predictions is critical to overcome the model interpretability barrier to using machine learning predictive models in clinical practice. Our previously developed automatic explanation method for machine learning predictions can be used to address this barrier, but a gap remains to fulfill the need of rapidly drilling through a feature value in an explanation that is computed by an aggregation function on the raw data. This paper articulates this gap, outlines an automated lineage tracing approach to close the gap, and provides a roadmap for future research. The automated drill-through capability is intended to be offered to help the user of the automated explaining function save time, better understand the patient’s situation, and make better clinical decisions. It would take several people multiple years to work out the detailed design and the computer coding implementation of the proposed automated lineage tracing approach. We hope this paper will make some researchers become interested in and join the research endeavor on this topic. Only after the detailed design and the computer coding implementation of the proposed automated lineage tracing approach are fully worked out, one could deploy the automated lineage tracing module in clinical practice and measure the module’s impact on clinicians’ decision-making process. The principle of the automated lineage tracing approach generalizes to nonmedical data and other automated methods to explain machine learning predictions.</p>
      </sec>
    </sec>
  </body>
  <back>
    <app-group/>
    <glossary>
      <title>Abbreviations</title>
      <def-list>
        <def-item>
          <term id="abb1">ED</term>
          <def>
            <p>emergency department</p>
          </def>
        </def-item>
        <def-item>
          <term id="abb2">SQL</term>
          <def>
            <p>structured query language</p>
          </def>
        </def-item>
      </def-list>
    </glossary>
    <ack>
      <p>We thank Xiaoyi Zhang and Brian Kelly for the useful discussions. GL was partially supported by the National Heart, Lung, and Blood Institute of the National Institutes of Health under award number R01HL142503. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</p>
    </ack>
    <fn-group>
      <fn fn-type="conflict">
        <p>None declared.</p>
      </fn>
    </fn-group>
    <ref-list>
      <ref id="ref1">
        <label>1</label>
        <nlm-citation citation-type="web">
          <source>Kaggle</source>
          <access-date>2021-04-30</access-date>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://www.kaggle.com">https://www.kaggle.com</ext-link>
          </comment>
        </nlm-citation>
      </ref>
      <ref id="ref2">
        <label>2</label>
        <nlm-citation citation-type="book">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Steyerberg</surname>
              <given-names>EW</given-names>
            </name>
          </person-group>
          <source>Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating, 2nd ed</source>
          <year>2019</year>
          <publisher-loc>New York, USA</publisher-loc>
          <publisher-name>Springer</publisher-name>
        </nlm-citation>
      </ref>
      <ref id="ref3">
        <label>3</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Lee</surname>
              <given-names>G</given-names>
            </name>
            <name name-style="western">
              <surname>Wang</surname>
              <given-names>S</given-names>
            </name>
            <name name-style="western">
              <surname>Dipuro</surname>
              <given-names>F</given-names>
            </name>
            <name name-style="western">
              <surname>Hou</surname>
              <given-names>J</given-names>
            </name>
            <name name-style="western">
              <surname>Grover</surname>
              <given-names>P</given-names>
            </name>
            <name name-style="western">
              <surname>Low</surname>
              <given-names>LL</given-names>
            </name>
            <name name-style="western">
              <surname>Liu</surname>
              <given-names>N</given-names>
            </name>
            <name name-style="western">
              <surname>Loke</surname>
              <given-names>CY</given-names>
            </name>
          </person-group>
          <article-title>Leveraging on predictive analytics to manage clinic no show and improve accessibility of care</article-title>
          <year>2017</year>
          <conf-name>Proceedings of 2017 IEEE International Conference on Data Science and Advanced Analytics</conf-name>
          <conf-date>October 19-21, 2017</conf-date>
          <conf-loc>Tokyo, Japan</conf-loc>
          <fpage>429</fpage>
          <lpage>438</lpage>
          <pub-id pub-id-type="doi">10.1109/dsaa.2017.25</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref4">
        <label>4</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Dean</surname>
              <given-names>NC</given-names>
            </name>
            <name name-style="western">
              <surname>Jones</surname>
              <given-names>BE</given-names>
            </name>
            <name name-style="western">
              <surname>Jones</surname>
              <given-names>JP</given-names>
            </name>
            <name name-style="western">
              <surname>Ferraro</surname>
              <given-names>JP</given-names>
            </name>
            <name name-style="western">
              <surname>Post</surname>
              <given-names>HB</given-names>
            </name>
            <name name-style="western">
              <surname>Aronsky</surname>
              <given-names>D</given-names>
            </name>
            <name name-style="western">
              <surname>Vines</surname>
              <given-names>CG</given-names>
            </name>
            <name name-style="western">
              <surname>Allen</surname>
              <given-names>TL</given-names>
            </name>
            <name name-style="western">
              <surname>Haug</surname>
              <given-names>PJ</given-names>
            </name>
          </person-group>
          <article-title>Impact of an electronic clinical decision support tool for emergency department patients with pneumonia</article-title>
          <source>Ann Emerg Med</source>
          <year>2015</year>
          <volume>66</volume>
          <issue>5</issue>
          <fpage>511</fpage>
          <lpage>520</lpage>
          <pub-id pub-id-type="doi">10.1016/j.annemergmed.2015.02.003</pub-id>
          <pub-id pub-id-type="medline">25725592</pub-id>
          <pub-id pub-id-type="pii">S0196-0644(15)00091-8</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref5">
        <label>5</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Hsu</surname>
              <given-names>JC</given-names>
            </name>
            <name name-style="western">
              <surname>Chen</surname>
              <given-names>YF</given-names>
            </name>
            <name name-style="western">
              <surname>Chung</surname>
              <given-names>WS</given-names>
            </name>
            <name name-style="western">
              <surname>Tan</surname>
              <given-names>TH</given-names>
            </name>
            <name name-style="western">
              <surname>Chen</surname>
              <given-names>T</given-names>
            </name>
            <name name-style="western">
              <surname>Chiang</surname>
              <given-names>JY</given-names>
            </name>
          </person-group>
          <article-title>Clinical verification of a clinical decision support system for ventilator weaning</article-title>
          <source>Biomed Eng Online</source>
          <year>2013</year>
          <volume>12 Suppl 1</volume>
          <fpage>S4</fpage>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://www.biomedcentral.com/1475-925X/12/S1/S4"/>
          </comment>
          <pub-id pub-id-type="doi">10.1186/1475-925X-12-S1-S4</pub-id>
          <pub-id pub-id-type="medline">24565021</pub-id>
          <pub-id pub-id-type="pii">1475-925X-12-S1-S4</pub-id>
          <pub-id pub-id-type="pmcid">PMC4028887</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref6">
        <label>6</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Barbieri</surname>
              <given-names>C</given-names>
            </name>
            <name name-style="western">
              <surname>Molina</surname>
              <given-names>M</given-names>
            </name>
            <name name-style="western">
              <surname>Ponce</surname>
              <given-names>P</given-names>
            </name>
            <name name-style="western">
              <surname>Tothova</surname>
              <given-names>M</given-names>
            </name>
            <name name-style="western">
              <surname>Cattinelli</surname>
              <given-names>I</given-names>
            </name>
            <name name-style="western">
              <surname>Ion Titapiccolo</surname>
              <given-names>J</given-names>
            </name>
            <name name-style="western">
              <surname>Mari</surname>
              <given-names>F</given-names>
            </name>
            <name name-style="western">
              <surname>Amato</surname>
              <given-names>C</given-names>
            </name>
            <name name-style="western">
              <surname>Leipold</surname>
              <given-names>F</given-names>
            </name>
            <name name-style="western">
              <surname>Wehmeyer</surname>
              <given-names>W</given-names>
            </name>
            <name name-style="western">
              <surname>Stuard</surname>
              <given-names>S</given-names>
            </name>
            <name name-style="western">
              <surname>Stopper</surname>
              <given-names>A</given-names>
            </name>
            <name name-style="western">
              <surname>Canaud</surname>
              <given-names>B</given-names>
            </name>
          </person-group>
          <article-title>An international observational study suggests that artificial intelligence for clinical decision support optimizes anemia management in hemodialysis patients</article-title>
          <source>Kidney Int</source>
          <year>2016</year>
          <volume>90</volume>
          <issue>2</issue>
          <fpage>422</fpage>
          <lpage>429</lpage>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://linkinghub.elsevier.com/retrieve/pii/S0085-2538(16)30132-6"/>
          </comment>
          <pub-id pub-id-type="doi">10.1016/j.kint.2016.03.036</pub-id>
          <pub-id pub-id-type="medline">27262365</pub-id>
          <pub-id pub-id-type="pii">S0085-2538(16)30132-6</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref7">
        <label>7</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Brier</surname>
              <given-names>ME</given-names>
            </name>
            <name name-style="western">
              <surname>Gaweda</surname>
              <given-names>AE</given-names>
            </name>
            <name name-style="western">
              <surname>Dailey</surname>
              <given-names>A</given-names>
            </name>
            <name name-style="western">
              <surname>Aronoff</surname>
              <given-names>GR</given-names>
            </name>
            <name name-style="western">
              <surname>Jacobs</surname>
              <given-names>AA</given-names>
            </name>
          </person-group>
          <article-title>Randomized trial of model predictive control for improved anemia management</article-title>
          <source>Clin J Am Soc Nephrol</source>
          <year>2010</year>
          <month>05</month>
          <volume>5</volume>
          <issue>5</issue>
          <fpage>814</fpage>
          <lpage>820</lpage>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://cjasn.asnjournals.org/cgi/pmidlookup?view=long&#38;pmid=20185598"/>
          </comment>
          <pub-id pub-id-type="doi">10.2215/CJN.07181009</pub-id>
          <pub-id pub-id-type="medline">20185598</pub-id>
          <pub-id pub-id-type="pii">CJN.07181009</pub-id>
          <pub-id pub-id-type="pmcid">PMC2863987</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref8">
        <label>8</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Gaweda</surname>
              <given-names>AE</given-names>
            </name>
            <name name-style="western">
              <surname>Aronoff</surname>
              <given-names>GR</given-names>
            </name>
            <name name-style="western">
              <surname>Jacobs</surname>
              <given-names>AA</given-names>
            </name>
            <name name-style="western">
              <surname>Rai</surname>
              <given-names>SN</given-names>
            </name>
            <name name-style="western">
              <surname>Brier</surname>
              <given-names>ME</given-names>
            </name>
          </person-group>
          <article-title>Individualized anemia management reduces hemoglobin variability in hemodialysis patients</article-title>
          <source>J Am Soc Nephrol</source>
          <year>2014</year>
          <month>01</month>
          <volume>25</volume>
          <issue>1</issue>
          <fpage>159</fpage>
          <lpage>166</lpage>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://jasn.asnjournals.org/cgi/pmidlookup?view=long&#38;pmid=24029429"/>
          </comment>
          <pub-id pub-id-type="doi">10.1681/ASN.2013010089</pub-id>
          <pub-id pub-id-type="medline">24029429</pub-id>
          <pub-id pub-id-type="pii">ASN.2013010089</pub-id>
          <pub-id pub-id-type="pmcid">PMC3871773</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref9">
        <label>9</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Gaweda</surname>
              <given-names>AE</given-names>
            </name>
            <name name-style="western">
              <surname>Jacobs</surname>
              <given-names>AA</given-names>
            </name>
            <name name-style="western">
              <surname>Aronoff</surname>
              <given-names>GR</given-names>
            </name>
            <name name-style="western">
              <surname>Brier</surname>
              <given-names>ME</given-names>
            </name>
          </person-group>
          <article-title>Model predictive control of erythropoietin administration in the anemia of ESRD</article-title>
          <source>Am J Kidney Dis</source>
          <year>2008</year>
          <month>01</month>
          <volume>51</volume>
          <issue>1</issue>
          <fpage>71</fpage>
          <lpage>79</lpage>
          <pub-id pub-id-type="doi">10.1053/j.ajkd.2007.10.003</pub-id>
          <pub-id pub-id-type="medline">18155535</pub-id>
          <pub-id pub-id-type="pii">S0272-6386(07)01353-4</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref10">
        <label>10</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Hamlet</surname>
              <given-names>KS</given-names>
            </name>
            <name name-style="western">
              <surname>Hobgood</surname>
              <given-names>A</given-names>
            </name>
            <name name-style="western">
              <surname>Hamar</surname>
              <given-names>GB</given-names>
            </name>
            <name name-style="western">
              <surname>Dobbs</surname>
              <given-names>AC</given-names>
            </name>
            <name name-style="western">
              <surname>Rula</surname>
              <given-names>EY</given-names>
            </name>
            <name name-style="western">
              <surname>Pope</surname>
              <given-names>JE</given-names>
            </name>
          </person-group>
          <article-title>Impact of predictive model-directed end-of-life counseling for Medicare beneficiaries</article-title>
          <source>Am J Manag Care</source>
          <year>2010</year>
          <month>05</month>
          <volume>16</volume>
          <issue>5</issue>
          <fpage>379</fpage>
          <lpage>384</lpage>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://www.ajmc.com/pubMed.php?pii=12641"/>
          </comment>
          <pub-id pub-id-type="medline">20469958</pub-id>
          <pub-id pub-id-type="pii">12641</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref11">
        <label>11</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Luo</surname>
              <given-names>G</given-names>
            </name>
          </person-group>
          <article-title>Automatically explaining machine learning prediction results: a demonstration on type 2 diabetes risk prediction</article-title>
          <source>Health Inf Sci Syst</source>
          <year>2016</year>
          <volume>4</volume>
          <fpage>2</fpage>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="http://europepmc.org/abstract/MED/26958341"/>
          </comment>
          <pub-id pub-id-type="doi">10.1186/s13755-016-0015-4</pub-id>
          <pub-id pub-id-type="medline">26958341</pub-id>
          <pub-id pub-id-type="pii">15</pub-id>
          <pub-id pub-id-type="pmcid">PMC4782293</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref12">
        <label>12</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Luo</surname>
              <given-names>G</given-names>
            </name>
            <name name-style="western">
              <surname>Johnson</surname>
              <given-names>MD</given-names>
            </name>
            <name name-style="western">
              <surname>Nkoy</surname>
              <given-names>FL</given-names>
            </name>
            <name name-style="western">
              <surname>He</surname>
              <given-names>S</given-names>
            </name>
            <name name-style="western">
              <surname>Stone</surname>
              <given-names>BL</given-names>
            </name>
          </person-group>
          <article-title>Automatically explaining machine learning prediction results on asthma hospital visits in asthmatic patients: secondary analysis</article-title>
          <source>JMIR Med Inform</source>
          <year>2020</year>
          <month>12</month>
          <day>31</day>
          <volume>8</volume>
          <issue>12</issue>
          <fpage>e21965</fpage>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://medinform.jmir.org/2020/12/e21965/"/>
          </comment>
          <pub-id pub-id-type="doi">10.2196/21965</pub-id>
          <pub-id pub-id-type="medline">33382379</pub-id>
          <pub-id pub-id-type="pii">v8i12e21965</pub-id>
          <pub-id pub-id-type="pmcid">PMC7808890</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref13">
        <label>13</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Tong</surname>
              <given-names>Y</given-names>
            </name>
            <name name-style="western">
              <surname>Messinger</surname>
              <given-names>AI</given-names>
            </name>
            <name name-style="western">
              <surname>Luo</surname>
              <given-names>G</given-names>
            </name>
          </person-group>
          <article-title>Testing the generalizability of an automated method for explaining machine learning predictions on asthma patients' asthma hospital visits to an academic health care system</article-title>
          <source>IEEE Access</source>
          <year>2020</year>
          <volume>8</volume>
          <fpage>195971</fpage>
          <lpage>195979</lpage>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="http://europepmc.org/abstract/MED/33240737"/>
          </comment>
          <pub-id pub-id-type="doi">10.1109/access.2020.3032683</pub-id>
          <pub-id pub-id-type="medline">33240737</pub-id>
          <pub-id pub-id-type="pmcid">PMC7685253</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref14">
        <label>14</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Luo</surname>
              <given-names>G</given-names>
            </name>
            <name name-style="western">
              <surname>Nau</surname>
              <given-names>CL</given-names>
            </name>
            <name name-style="western">
              <surname>Crawford</surname>
              <given-names>WW</given-names>
            </name>
            <name name-style="western">
              <surname>Schatz</surname>
              <given-names>M</given-names>
            </name>
            <name name-style="western">
              <surname>Zeiger</surname>
              <given-names>RS</given-names>
            </name>
            <name name-style="western">
              <surname>Koebnick</surname>
              <given-names>C</given-names>
            </name>
          </person-group>
          <article-title>Generalizability of an automatic explanation method for machine learning prediction results on asthma-related hospital visits in patients with asthma: quantitative analysis</article-title>
          <source>J Med Internet Res</source>
          <year>2021</year>
          <month>04</month>
          <day>15</day>
          <volume>23</volume>
          <issue>4</issue>
          <fpage>e24153</fpage>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://www.jmir.org/2021/4/e24153/"/>
          </comment>
          <pub-id pub-id-type="doi">10.2196/24153</pub-id>
          <pub-id pub-id-type="medline">33856359</pub-id>
          <pub-id pub-id-type="pii">v23i4e24153</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref15">
        <label>15</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Halamka</surname>
              <given-names>JD</given-names>
            </name>
          </person-group>
          <article-title>Early experiences with big data at an academic medical center</article-title>
          <source>Health Aff (Millwood)</source>
          <year>2014</year>
          <month>07</month>
          <volume>33</volume>
          <issue>7</issue>
          <fpage>1132</fpage>
          <lpage>1138</lpage>
          <pub-id pub-id-type="doi">10.1377/hlthaff.2014.0031</pub-id>
          <pub-id pub-id-type="medline">25006138</pub-id>
          <pub-id pub-id-type="pii">33/7/1132</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref16">
        <label>16</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Luo</surname>
              <given-names>G</given-names>
            </name>
          </person-group>
          <article-title>A roadmap for semi-automatically extracting predictive and clinically meaningful temporal features from medical data for predictive modeling</article-title>
          <source>Glob Transit</source>
          <year>2019</year>
          <volume>1</volume>
          <fpage>61</fpage>
          <lpage>82</lpage>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="http://europepmc.org/abstract/MED/31032483"/>
          </comment>
          <pub-id pub-id-type="doi">10.1016/j.glt.2018.11.001</pub-id>
          <pub-id pub-id-type="medline">31032483</pub-id>
          <pub-id pub-id-type="pmcid">PMC6482973</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref17">
        <label>17</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Luo</surname>
              <given-names>G</given-names>
            </name>
            <name name-style="western">
              <surname>Nau</surname>
              <given-names>CL</given-names>
            </name>
            <name name-style="western">
              <surname>Crawford</surname>
              <given-names>WW</given-names>
            </name>
            <name name-style="western">
              <surname>Schatz</surname>
              <given-names>M</given-names>
            </name>
            <name name-style="western">
              <surname>Zeiger</surname>
              <given-names>RS</given-names>
            </name>
            <name name-style="western">
              <surname>Rozema</surname>
              <given-names>E</given-names>
            </name>
            <name name-style="western">
              <surname>Koebnick</surname>
              <given-names>C</given-names>
            </name>
          </person-group>
          <article-title>Developing a predictive model for asthma-related hospital encounters in patients with asthma in a large, integrated health care system: secondary analysis</article-title>
          <source>JMIR Med Inform</source>
          <year>2020</year>
          <month>11</month>
          <day>09</day>
          <volume>8</volume>
          <issue>11</issue>
          <fpage>e22689</fpage>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://medinform.jmir.org/2020/11/e22689/"/>
          </comment>
          <pub-id pub-id-type="doi">10.2196/22689</pub-id>
          <pub-id pub-id-type="medline">33164906</pub-id>
          <pub-id pub-id-type="pii">v8i11e22689</pub-id>
          <pub-id pub-id-type="pmcid">PMC7683251</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref18">
        <label>18</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Tong</surname>
              <given-names>Y</given-names>
            </name>
            <name name-style="western">
              <surname>Messinger</surname>
              <given-names>AI</given-names>
            </name>
            <name name-style="western">
              <surname>Wilcox</surname>
              <given-names>AB</given-names>
            </name>
            <name name-style="western">
              <surname>Mooney</surname>
              <given-names>SD</given-names>
            </name>
            <name name-style="western">
              <surname>Davidson</surname>
              <given-names>GH</given-names>
            </name>
            <name name-style="western">
              <surname>Suri</surname>
              <given-names>P</given-names>
            </name>
            <name name-style="western">
              <surname>Luo</surname>
              <given-names>G</given-names>
            </name>
          </person-group>
          <article-title>Forecasting future asthma hospital encounters of patients with asthma in an academic health care system: predictive model development and secondary analysis study</article-title>
          <source>J Med Internet Res</source>
          <year>2021</year>
          <month>04</month>
          <day>16</day>
          <volume>23</volume>
          <issue>4</issue>
          <fpage>e22796</fpage>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://www.jmir.org/2021/4/e22796/"/>
          </comment>
          <pub-id pub-id-type="doi">10.2196/22796</pub-id>
          <pub-id pub-id-type="medline">33861206</pub-id>
          <pub-id pub-id-type="pii">v23i4e22796</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref19">
        <label>19</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Luo</surname>
              <given-names>G</given-names>
            </name>
            <name name-style="western">
              <surname>He</surname>
              <given-names>S</given-names>
            </name>
            <name name-style="western">
              <surname>Stone</surname>
              <given-names>BL</given-names>
            </name>
            <name name-style="western">
              <surname>Nkoy</surname>
              <given-names>FL</given-names>
            </name>
            <name name-style="western">
              <surname>Johnson</surname>
              <given-names>MD</given-names>
            </name>
          </person-group>
          <article-title>Developing a model to predict hospital encounters for asthma in asthmatic patients: secondary analysis</article-title>
          <source>JMIR Med Inform</source>
          <year>2020</year>
          <month>01</month>
          <day>21</day>
          <volume>8</volume>
          <issue>1</issue>
          <fpage>e16080</fpage>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://medinform.jmir.org/2020/1/e16080/"/>
          </comment>
          <pub-id pub-id-type="doi">10.2196/16080</pub-id>
          <pub-id pub-id-type="medline">31961332</pub-id>
          <pub-id pub-id-type="pii">v8i1e16080</pub-id>
          <pub-id pub-id-type="pmcid">PMC7001050</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref20">
        <label>20</label>
        <nlm-citation citation-type="book">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Garcia-Molina</surname>
              <given-names>H</given-names>
            </name>
            <name name-style="western">
              <surname>Ullman</surname>
              <given-names>JD</given-names>
            </name>
            <name name-style="western">
              <surname>Widom</surname>
              <given-names>J</given-names>
            </name>
          </person-group>
          <source>Database Systems: the Complete Book, 2nd ed</source>
          <year>2008</year>
          <publisher-loc>Upper Saddle River, NJ</publisher-loc>
          <publisher-name>Pearson</publisher-name>
        </nlm-citation>
      </ref>
      <ref id="ref21">
        <label>21</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Cunningham</surname>
              <given-names>C</given-names>
            </name>
            <name name-style="western">
              <surname>Graefe</surname>
              <given-names>G</given-names>
            </name>
            <name name-style="western">
              <surname>Galindo-Legaria</surname>
              <given-names>CA</given-names>
            </name>
          </person-group>
          <article-title>PIVOT and UNPIVOT: optimization and execution strategies in an RDBMS</article-title>
          <year>2004</year>
          <conf-name>Proceedings of the 30th International Conference on Very Large Data Bases</conf-name>
          <conf-date>August 31-September 3, 2004</conf-date>
          <conf-loc>Toronto, Canada</conf-loc>
          <fpage>998</fpage>
          <lpage>1009</lpage>
          <pub-id pub-id-type="doi">10.1016/b978-012088469-8.50087-5</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref22">
        <label>22</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Lyman</surname>
              <given-names>JA</given-names>
            </name>
            <name name-style="western">
              <surname>Scully</surname>
              <given-names>K</given-names>
            </name>
            <name name-style="western">
              <surname>Harrison</surname>
              <given-names>JH Jr</given-names>
            </name>
          </person-group>
          <article-title>The development of health care data warehouses to support data mining</article-title>
          <source>Clin Lab Med</source>
          <year>2008</year>
          <month>03</month>
          <volume>28</volume>
          <issue>1</issue>
          <fpage>55</fpage>
          <lpage>71</lpage>
          <pub-id pub-id-type="doi">10.1016/j.cll.2007.10.003</pub-id>
          <pub-id pub-id-type="medline">18194718</pub-id>
          <pub-id pub-id-type="pii">S0272-2712(07)00112-6</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref23">
        <label>23</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Cui</surname>
              <given-names>Y</given-names>
            </name>
            <name name-style="western">
              <surname>Widom</surname>
              <given-names>J</given-names>
            </name>
          </person-group>
          <article-title>Practical lineage tracing in data warehouses</article-title>
          <year>2000</year>
          <conf-name>Proceedings of the 16th International Conference on Data Engineering</conf-name>
          <conf-date>February 28-March 3, 2000</conf-date>
          <conf-loc>San Diego, CA</conf-loc>
          <fpage>367</fpage>
          <lpage>378</lpage>
          <pub-id pub-id-type="doi">10.1109/icde.2000.839437</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref24">
        <label>24</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Liu</surname>
              <given-names>B</given-names>
            </name>
            <name name-style="western">
              <surname>Hsu</surname>
              <given-names>W</given-names>
            </name>
            <name name-style="western">
              <surname>Ma</surname>
              <given-names>Y</given-names>
            </name>
          </person-group>
          <article-title>Integrating classification and association rule mining</article-title>
          <year>1998</year>
          <conf-name>Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining</conf-name>
          <conf-date>August 27-31, 1998</conf-date>
          <conf-loc>New York City, USA</conf-loc>
          <fpage>80</fpage>
          <lpage>86</lpage>
        </nlm-citation>
      </ref>
      <ref id="ref25">
        <label>25</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Fayyad</surname>
              <given-names>UM</given-names>
            </name>
            <name name-style="western">
              <surname>Irani</surname>
              <given-names>KB</given-names>
            </name>
          </person-group>
          <article-title>Multi-interval discretization of continuous-valued attributes for classification learning</article-title>
          <year>1993</year>
          <conf-name>Proceedings of the 13th International Joint Conference on Artificial Intelligence</conf-name>
          <conf-date>August 28-September 3, 1993</conf-date>
          <conf-loc>Chambéry, France</conf-loc>
          <fpage>1022</fpage>
          <lpage>1029</lpage>
        </nlm-citation>
      </ref>
      <ref id="ref26">
        <label>26</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Thabtah</surname>
              <given-names>FA</given-names>
            </name>
          </person-group>
          <article-title>A review of associative classification mining</article-title>
          <source>The Knowledge Engineering Review</source>
          <year>2007</year>
          <month>03</month>
          <day>01</day>
          <volume>22</volume>
          <issue>1</issue>
          <fpage>37</fpage>
          <lpage>65</lpage>
          <pub-id pub-id-type="doi">10.1017/s0269888907001026</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref27">
        <label>27</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Alaa</surname>
              <given-names>AM</given-names>
            </name>
            <name name-style="western">
              <surname>van der Schaar</surname>
              <given-names>M</given-names>
            </name>
          </person-group>
          <article-title>Prognostication and risk factors for cystic fibrosis via automated machine learning</article-title>
          <source>Sci Rep</source>
          <year>2018</year>
          <month>07</month>
          <day>26</day>
          <volume>8</volume>
          <issue>1</issue>
          <fpage>11242</fpage>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://doi.org/10.1038/s41598-018-29523-2"/>
          </comment>
          <pub-id pub-id-type="doi">10.1038/s41598-018-29523-2</pub-id>
          <pub-id pub-id-type="medline">30050169</pub-id>
          <pub-id pub-id-type="pii">10.1038/s41598-018-29523-2</pub-id>
          <pub-id pub-id-type="pmcid">PMC6062529</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref28">
        <label>28</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Alaa</surname>
              <given-names>AM</given-names>
            </name>
            <name name-style="western">
              <surname>van der Schaar</surname>
              <given-names>M</given-names>
            </name>
          </person-group>
          <article-title>AutoPrognosis: automated clinical prognostic modeling via Bayesian optimization with structured kernel learning</article-title>
          <year>2018</year>
          <conf-name>Proceedings of 35th International Conference on Machine Learning</conf-name>
          <conf-date>July 10-15, 2018</conf-date>
          <conf-loc>Stockholm, Sweden</conf-loc>
          <fpage>139</fpage>
          <lpage>148</lpage>
        </nlm-citation>
      </ref>
      <ref id="ref29">
        <label>29</label>
        <nlm-citation citation-type="book">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Molnar</surname>
              <given-names>C</given-names>
            </name>
          </person-group>
          <source>Interpretable Machine Learning</source>
          <year>2020</year>
          <publisher-loc>Morrisville, NC</publisher-loc>
          <publisher-name>lulu.com</publisher-name>
        </nlm-citation>
      </ref>
      <ref id="ref30">
        <label>30</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Guidotti</surname>
              <given-names>R</given-names>
            </name>
            <name name-style="western">
              <surname>Monreale</surname>
              <given-names>A</given-names>
            </name>
            <name name-style="western">
              <surname>Ruggieri</surname>
              <given-names>S</given-names>
            </name>
            <name name-style="western">
              <surname>Turini</surname>
              <given-names>F</given-names>
            </name>
            <name name-style="western">
              <surname>Giannotti</surname>
              <given-names>F</given-names>
            </name>
            <name name-style="western">
              <surname>Pedreschi</surname>
              <given-names>D</given-names>
            </name>
          </person-group>
          <article-title>A survey of methods for explaining black box models</article-title>
          <source>ACM Comput Surv</source>
          <year>2019</year>
          <month>01</month>
          <day>23</day>
          <volume>51</volume>
          <issue>5</issue>
          <fpage>93</fpage>
          <pub-id pub-id-type="doi">10.1145/3236009</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref31">
        <label>31</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Rudin</surname>
              <given-names>C</given-names>
            </name>
            <name name-style="western">
              <surname>Shaposhnik</surname>
              <given-names>Y</given-names>
            </name>
          </person-group>
          <article-title>Globally-consistent rule-based summary-explanations for machine learning models: application to credit-risk evaluation</article-title>
          <year>2019</year>
          <conf-name>Proceedings of INFORMS 11th Conference on Information Systems and Technology</conf-name>
          <conf-date>October 19-20, 2019</conf-date>
          <conf-loc>Seattle, WA</conf-loc>
          <fpage>1</fpage>
          <lpage>19</lpage>
          <pub-id pub-id-type="doi">10.2139/ssrn.3395422</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref32">
        <label>32</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Ribeiro</surname>
              <given-names>MT</given-names>
            </name>
            <name name-style="western">
              <surname>Singh</surname>
              <given-names>S</given-names>
            </name>
            <name name-style="western">
              <surname>Guestrin</surname>
              <given-names>C</given-names>
            </name>
          </person-group>
          <article-title>Anchors: high-precision model-agnostic explanations</article-title>
          <year>2018</year>
          <conf-name>Proceedings of the 32nd AAAI Conference on Artificial Intelligence</conf-name>
          <conf-date>February 2-7, 2018</conf-date>
          <conf-loc>New Orleans, LA</conf-loc>
          <fpage>1527</fpage>
          <lpage>1535</lpage>
        </nlm-citation>
      </ref>
      <ref id="ref33">
        <label>33</label>
        <nlm-citation citation-type="web">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Ikeda</surname>
              <given-names>R</given-names>
            </name>
            <name name-style="western">
              <surname>Widom</surname>
              <given-names>J</given-names>
            </name>
          </person-group>
          <article-title>Data lineage: a survey</article-title>
          <source>Stanford University Technical Report</source>
          <access-date>2021-04-30</access-date>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="http://ilpubs.stanford.edu:8090/918/1/lin_final.pdf">http://ilpubs.stanford.edu:8090/918/1/lin_final.pdf</ext-link>
          </comment>
        </nlm-citation>
      </ref>
      <ref id="ref34">
        <label>34</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Cheney</surname>
              <given-names>J</given-names>
            </name>
            <name name-style="western">
              <surname>Chiticariu</surname>
              <given-names>L</given-names>
            </name>
            <name name-style="western">
              <surname>Tan</surname>
              <given-names>WC</given-names>
            </name>
          </person-group>
          <article-title>Provenance in Databases: Why, How, and Where</article-title>
          <source>Found Trends Databases</source>
          <year>2009</year>
          <volume>1</volume>
          <issue>4</issue>
          <fpage>379</fpage>
          <lpage>474</lpage>
          <pub-id pub-id-type="doi">10.1561/1900000006</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref35">
        <label>35</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Simmhan</surname>
              <given-names>Y</given-names>
            </name>
            <name name-style="western">
              <surname>Plale</surname>
              <given-names>B</given-names>
            </name>
            <name name-style="western">
              <surname>Gannon</surname>
              <given-names>D</given-names>
            </name>
          </person-group>
          <article-title>A survey of data provenance in e-science</article-title>
          <source>SIGMOD Rec</source>
          <year>2005</year>
          <month>09</month>
          <volume>34</volume>
          <issue>3</issue>
          <fpage>31</fpage>
          <lpage>36</lpage>
          <pub-id pub-id-type="doi">10.1145/1084805.1084812</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref36">
        <label>36</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Bose</surname>
              <given-names>R</given-names>
            </name>
            <name name-style="western">
              <surname>Frew</surname>
              <given-names>J</given-names>
            </name>
          </person-group>
          <article-title>Lineage retrieval for scientific data processing: a survey</article-title>
          <source>ACM Comput Surv</source>
          <year>2005</year>
          <month>03</month>
          <volume>37</volume>
          <issue>1</issue>
          <fpage>1</fpage>
          <lpage>28</lpage>
          <pub-id pub-id-type="doi">10.1145/1057977.1057978</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref37">
        <label>37</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Cui</surname>
              <given-names>Y</given-names>
            </name>
            <name name-style="western">
              <surname>Widom</surname>
              <given-names>J</given-names>
            </name>
            <name name-style="western">
              <surname>Wiener</surname>
              <given-names>JL</given-names>
            </name>
          </person-group>
          <article-title>Tracing the lineage of view data in a warehousing environment</article-title>
          <source>ACM Trans Database Syst</source>
          <year>2000</year>
          <month>06</month>
          <volume>25</volume>
          <issue>2</issue>
          <fpage>179</fpage>
          <lpage>227</lpage>
          <pub-id pub-id-type="doi">10.1145/357775.357777</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref38">
        <label>38</label>
        <nlm-citation citation-type="book">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Gupta</surname>
              <given-names>A</given-names>
            </name>
            <name name-style="western">
              <surname>Mumick</surname>
              <given-names>IS</given-names>
            </name>
          </person-group>
          <source>Materialized Views: Techniques, Implementations, and Applications</source>
          <year>1999</year>
          <publisher-loc>Cambridge, MA</publisher-loc>
          <publisher-name>The MIT Press</publisher-name>
        </nlm-citation>
      </ref>
      <ref id="ref39">
        <label>39</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Cui</surname>
              <given-names>Y</given-names>
            </name>
            <name name-style="western">
              <surname>Widom</surname>
              <given-names>J</given-names>
            </name>
          </person-group>
          <article-title>Lineage tracing for general data warehouse transformations</article-title>
          <source>The VLDB Journal The International Journal on Very Large Data Bases</source>
          <year>2003</year>
          <month>5</month>
          <day>1</day>
          <volume>12</volume>
          <issue>1</issue>
          <fpage>41</fpage>
          <lpage>58</lpage>
          <pub-id pub-id-type="doi">10.1007/s00778-002-0083-8</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref40">
        <label>40</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Ikeda</surname>
              <given-names>R</given-names>
            </name>
            <name name-style="western">
              <surname>Sarma</surname>
              <given-names>AD</given-names>
            </name>
            <name name-style="western">
              <surname>Widom</surname>
              <given-names>J</given-names>
            </name>
          </person-group>
          <article-title>Logical provenance in data-oriented workflows</article-title>
          <year>2013</year>
          <conf-name>Proceedings of the 29th IEEE International Conference on Data Engineering</conf-name>
          <conf-date>April 8-12, 2013</conf-date>
          <conf-loc>Brisbane, Australia</conf-loc>
          <fpage>877</fpage>
          <lpage>888</lpage>
          <pub-id pub-id-type="doi">10.1109/icde.2013.6544882</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref41">
        <label>41</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Zhang</surname>
              <given-names>M</given-names>
            </name>
            <name name-style="western">
              <surname>Zhang</surname>
              <given-names>X</given-names>
            </name>
            <name name-style="western">
              <surname>Prabhakar</surname>
              <given-names>S</given-names>
            </name>
          </person-group>
          <article-title>Tracing lineage beyond relational operators</article-title>
          <year>2007</year>
          <conf-name>Proceedings of the 33rd International Conference on Very Large Data Bases</conf-name>
          <conf-date>September 23-27, 2007</conf-date>
          <conf-loc>Vienna, Austria</conf-loc>
          <fpage>1116</fpage>
          <lpage>1127</lpage>
        </nlm-citation>
      </ref>
      <ref id="ref42">
        <label>42</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Ikeda</surname>
              <given-names>R</given-names>
            </name>
            <name name-style="western">
              <surname>Park</surname>
              <given-names>H</given-names>
            </name>
            <name name-style="western">
              <surname>Widom</surname>
              <given-names>J</given-names>
            </name>
          </person-group>
          <article-title>Provenance for generalized map and reduce workflows</article-title>
          <year>2011</year>
          <conf-name>Proceedings of the 5th Biennial Conference on Innovative Data Systems Research</conf-name>
          <conf-date>January 9-12, 2011</conf-date>
          <conf-loc>Asilomar, CA</conf-loc>
          <fpage>273</fpage>
          <lpage>283</lpage>
        </nlm-citation>
      </ref>
      <ref id="ref43">
        <label>43</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Park</surname>
              <given-names>H</given-names>
            </name>
            <name name-style="western">
              <surname>Ikeda</surname>
              <given-names>R</given-names>
            </name>
            <name name-style="western">
              <surname>Widom</surname>
              <given-names>J</given-names>
            </name>
          </person-group>
          <article-title>RAMP: a system for capturing and tracing provenance in MapReduce workflows</article-title>
          <source>Proc VLDB Endow</source>
          <year>2011</year>
          <month>08</month>
          <volume>4</volume>
          <issue>12</issue>
          <fpage>1351</fpage>
          <lpage>1354</lpage>
          <pub-id pub-id-type="doi">10.14778/3402755.3402768</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref44">
        <label>44</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Dean</surname>
              <given-names>J</given-names>
            </name>
            <name name-style="western">
              <surname>Ghemawat</surname>
              <given-names>S</given-names>
            </name>
          </person-group>
          <article-title>MapReduce: simplified data processing on large clusters</article-title>
          <year>2004</year>
          <conf-name>Proceedings of the 6th Symposium on Operating System Design and Implementation</conf-name>
          <conf-date>December 6-8, 2004</conf-date>
          <conf-loc>San Francisco, CA</conf-loc>
          <fpage>137</fpage>
          <lpage>150</lpage>
        </nlm-citation>
      </ref>
      <ref id="ref45">
        <label>45</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Amsterdamer</surname>
              <given-names>Y</given-names>
            </name>
            <name name-style="western">
              <surname>Davidson</surname>
              <given-names>SB</given-names>
            </name>
            <name name-style="western">
              <surname>Deutch</surname>
              <given-names>D</given-names>
            </name>
            <name name-style="western">
              <surname>Milo</surname>
              <given-names>T</given-names>
            </name>
            <name name-style="western">
              <surname>Stoyanovich</surname>
              <given-names>J</given-names>
            </name>
            <name name-style="western">
              <surname>Tannen</surname>
              <given-names>V</given-names>
            </name>
          </person-group>
          <article-title>Putting Lipstick on Pig: enabling database-style workflow provenance</article-title>
          <source>Proc VLDB Endow</source>
          <year>2011</year>
          <month>12</month>
          <volume>5</volume>
          <issue>4</issue>
          <fpage>346</fpage>
          <lpage>357</lpage>
          <pub-id pub-id-type="doi">10.14778/2095686.2095693</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref46">
        <label>46</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Olston</surname>
              <given-names>C</given-names>
            </name>
            <name name-style="western">
              <surname>Reed</surname>
              <given-names>B</given-names>
            </name>
            <name name-style="western">
              <surname>Srivastava</surname>
              <given-names>U</given-names>
            </name>
            <name name-style="western">
              <surname>Kumar</surname>
              <given-names>R</given-names>
            </name>
            <name name-style="western">
              <surname>Tomkins</surname>
              <given-names>A</given-names>
            </name>
          </person-group>
          <article-title>Pig Latin: a not-so-foreign language for data processing</article-title>
          <year>2008</year>
          <conf-name>Proceedings of the ACM SIGMOD International Conference on Management of Data</conf-name>
          <conf-date>June 10-12, 2008</conf-date>
          <conf-loc>Vancouver, BC, Canada</conf-loc>
          <fpage>1099</fpage>
          <lpage>1110</lpage>
          <pub-id pub-id-type="doi">10.1145/1376616.1376726</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref47">
        <label>47</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Buneman</surname>
              <given-names>P</given-names>
            </name>
            <name name-style="western">
              <surname>Chapman</surname>
              <given-names>A</given-names>
            </name>
            <name name-style="western">
              <surname>Cheney</surname>
              <given-names>J</given-names>
            </name>
          </person-group>
          <article-title>Provenance management in curated databases</article-title>
          <year>2006</year>
          <conf-name>Proceedings of the ACM SIGMOD International Conference on Management of Data</conf-name>
          <conf-date>June 27-29, 2006</conf-date>
          <conf-loc>Chicago, IL</conf-loc>
          <fpage>539</fpage>
          <lpage>550</lpage>
          <pub-id pub-id-type="doi">10.1145/1142473.1142534</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref48">
        <label>48</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Schelter</surname>
              <given-names>S</given-names>
            </name>
            <name name-style="western">
              <surname>Böse</surname>
              <given-names>J</given-names>
            </name>
            <name name-style="western">
              <surname>Kirschnick</surname>
              <given-names>J</given-names>
            </name>
            <name name-style="western">
              <surname>Klein</surname>
              <given-names>T</given-names>
            </name>
            <name name-style="western">
              <surname>Seufert</surname>
              <given-names>S</given-names>
            </name>
          </person-group>
          <article-title>Automatically tracking metadata and provenance of machine learning experiments</article-title>
          <year>2017</year>
          <conf-name>Proceedings of the ML Systems Workshop at NIPS 2017</conf-name>
          <conf-date>December 8, 2017</conf-date>
          <conf-loc>Long Beach, CA</conf-loc>
          <fpage>1</fpage>
          <lpage>8</lpage>
        </nlm-citation>
      </ref>
      <ref id="ref49">
        <label>49</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Cui</surname>
              <given-names>Y</given-names>
            </name>
            <name name-style="western">
              <surname>Widom</surname>
              <given-names>J</given-names>
            </name>
          </person-group>
          <article-title>Storing auxiliary data for efficient maintenance and lineage tracing of complex views</article-title>
          <year>2000</year>
          <conf-name>Proceedings of the Second Intl Workshop on Design and Management of Data Warehouses</conf-name>
          <conf-date>June 5-6, 2000</conf-date>
          <conf-loc>Stockholm, Sweden</conf-loc>
          <fpage>1</fpage>
          <lpage>19</lpage>
        </nlm-citation>
      </ref>
      <ref id="ref50">
        <label>50</label>
        <nlm-citation citation-type="web">
          <article-title>Data standardization</article-title>
          <source>Observational Health Data Sciences and Informatics</source>
          <access-date>2021-04-30</access-date>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://www.ohdsi.org/data-standardization">https://www.ohdsi.org/data-standardization</ext-link>
          </comment>
        </nlm-citation>
      </ref>
      <ref id="ref51">
        <label>51</label>
        <nlm-citation citation-type="web">
          <article-title>Standardized vocabularies</article-title>
          <source>Observational Health Data Sciences and Informatics</source>
          <access-date>2021-04-30</access-date>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://www.ohdsi.org/web/wiki/doku.php?id=documentation:vocabulary:sidebar">https://www.ohdsi.org/web/wiki/doku.php?id=documentation:vocabulary:sidebar</ext-link>
          </comment>
        </nlm-citation>
      </ref>
      <ref id="ref52">
        <label>52</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Hripcsak</surname>
              <given-names>G</given-names>
            </name>
            <name name-style="western">
              <surname>Duke</surname>
              <given-names>JD</given-names>
            </name>
            <name name-style="western">
              <surname>Shah</surname>
              <given-names>NH</given-names>
            </name>
            <name name-style="western">
              <surname>Reich</surname>
              <given-names>CG</given-names>
            </name>
            <name name-style="western">
              <surname>Huser</surname>
              <given-names>V</given-names>
            </name>
            <name name-style="western">
              <surname>Schuemie</surname>
              <given-names>MJ</given-names>
            </name>
            <name name-style="western">
              <surname>Suchard</surname>
              <given-names>MA</given-names>
            </name>
            <name name-style="western">
              <surname>Park</surname>
              <given-names>RW</given-names>
            </name>
            <name name-style="western">
              <surname>Wong</surname>
              <given-names>ICK</given-names>
            </name>
            <name name-style="western">
              <surname>Rijnbeek</surname>
              <given-names>PR</given-names>
            </name>
            <name name-style="western">
              <surname>van der Lei</surname>
              <given-names>J</given-names>
            </name>
            <name name-style="western">
              <surname>Pratt</surname>
              <given-names>N</given-names>
            </name>
            <name name-style="western">
              <surname>Norén</surname>
              <given-names>GN</given-names>
            </name>
            <name name-style="western">
              <surname>Li</surname>
              <given-names>Y</given-names>
            </name>
            <name name-style="western">
              <surname>Stang</surname>
              <given-names>PE</given-names>
            </name>
            <name name-style="western">
              <surname>Madigan</surname>
              <given-names>D</given-names>
            </name>
            <name name-style="western">
              <surname>Ryan</surname>
              <given-names>PB</given-names>
            </name>
          </person-group>
          <article-title>Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers</article-title>
          <source>Stud Health Technol Inform</source>
          <year>2015</year>
          <volume>216</volume>
          <fpage>574</fpage>
          <lpage>578</lpage>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="http://europepmc.org/abstract/MED/26262116"/>
          </comment>
          <pub-id pub-id-type="medline">26262116</pub-id>
          <pub-id pub-id-type="pmcid">PMC4815923</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref53">
        <label>53</label>
        <nlm-citation citation-type="journal">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Overhage</surname>
              <given-names>JM</given-names>
            </name>
            <name name-style="western">
              <surname>Ryan</surname>
              <given-names>PB</given-names>
            </name>
            <name name-style="western">
              <surname>Reich</surname>
              <given-names>CG</given-names>
            </name>
            <name name-style="western">
              <surname>Hartzema</surname>
              <given-names>AG</given-names>
            </name>
            <name name-style="western">
              <surname>Stang</surname>
              <given-names>PE</given-names>
            </name>
          </person-group>
          <article-title>Validation of a common data model for active safety surveillance research</article-title>
          <source>J Am Med Inform Assoc</source>
          <year>2012</year>
          <volume>19</volume>
          <issue>1</issue>
          <fpage>54</fpage>
          <lpage>60</lpage>
          <comment>
            <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="http://europepmc.org/abstract/MED/22037893"/>
          </comment>
          <pub-id pub-id-type="doi">10.1136/amiajnl-2011-000376</pub-id>
          <pub-id pub-id-type="medline">22037893</pub-id>
          <pub-id pub-id-type="pii">amiajnl-2011-000376</pub-id>
          <pub-id pub-id-type="pmcid">PMC3240764</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref54">
        <label>54</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Das</surname>
              <given-names>S</given-names>
            </name>
            <name name-style="western">
              <surname>Grbic</surname>
              <given-names>M</given-names>
            </name>
            <name name-style="western">
              <surname>Ilic</surname>
              <given-names>I</given-names>
            </name>
            <name name-style="western">
              <surname>Jovandic</surname>
              <given-names>I</given-names>
            </name>
            <name name-style="western">
              <surname>Jovanovic</surname>
              <given-names>A</given-names>
            </name>
            <name name-style="western">
              <surname>Narasayya</surname>
              <given-names>VR</given-names>
            </name>
            <name name-style="western">
              <surname>Radulovic</surname>
              <given-names>M</given-names>
            </name>
            <name name-style="western">
              <surname>Stikic</surname>
              <given-names>M</given-names>
            </name>
            <name name-style="western">
              <surname>Xu</surname>
              <given-names>G</given-names>
            </name>
            <name name-style="western">
              <surname>Chaudhuri</surname>
              <given-names>S</given-names>
            </name>
          </person-group>
          <article-title>Automatically indexing millions of databases in Microsoft Azure SQL database</article-title>
          <year>2019</year>
          <conf-name>Proceedings of the ACM SIGMOD International Conference on Management of Data</conf-name>
          <conf-date>June 30-July 5, 2019</conf-date>
          <conf-loc>Amsterdam, Netherlands</conf-loc>
          <fpage>666</fpage>
          <lpage>679</lpage>
          <pub-id pub-id-type="doi">10.1145/3299869.3314035</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref55">
        <label>55</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Dageville</surname>
              <given-names>B</given-names>
            </name>
            <name name-style="western">
              <surname>Das</surname>
              <given-names>D</given-names>
            </name>
            <name name-style="western">
              <surname>Dias</surname>
              <given-names>K</given-names>
            </name>
            <name name-style="western">
              <surname>Yagoub</surname>
              <given-names>K</given-names>
            </name>
            <name name-style="western">
              <surname>Zaït</surname>
              <given-names>M</given-names>
            </name>
            <name name-style="western">
              <surname>Ziauddin</surname>
              <given-names>M</given-names>
            </name>
          </person-group>
          <article-title>Automatic SQL tuning in Oracle 10g</article-title>
          <year>2004</year>
          <conf-name>Proceedings of the 30th International Conference on Very Large Data Bases</conf-name>
          <conf-date>August 31-September 3, 2004</conf-date>
          <conf-loc>Toronto, Canada</conf-loc>
          <fpage>1098</fpage>
          <lpage>1109</lpage>
          <pub-id pub-id-type="doi">10.1016/b978-012088469-8.50096-6</pub-id>
        </nlm-citation>
      </ref>
      <ref id="ref56">
        <label>56</label>
        <nlm-citation citation-type="confproc">
          <person-group person-group-type="author">
            <name name-style="western">
              <surname>Zilio</surname>
              <given-names>DC</given-names>
            </name>
            <name name-style="western">
              <surname>Rao</surname>
              <given-names>J</given-names>
            </name>
            <name name-style="western">
              <surname>Lightstone</surname>
              <given-names>S</given-names>
            </name>
            <name name-style="western">
              <surname>Lohman</surname>
              <given-names>GM</given-names>
            </name>
            <name name-style="western">
              <surname>Storm</surname>
              <given-names>AJ</given-names>
            </name>
            <name name-style="western">
              <surname>Garcia-Arellano</surname>
              <given-names>C</given-names>
            </name>
            <name name-style="western">
              <surname>Fadden</surname>
              <given-names>S</given-names>
            </name>
          </person-group>
          <article-title>DB2 Design Advisor: integrated automatic physical database design</article-title>
          <year>2004</year>
          <conf-name>Proceedings of the 30th International Conference on Very Large Data Bases</conf-name>
          <conf-date>August 31-September 3, 2004</conf-date>
          <conf-loc>Toronto, Canada</conf-loc>
          <fpage>1087</fpage>
          <lpage>1097</lpage>
          <pub-id pub-id-type="doi">10.1016/b978-012088469-8.50095-4</pub-id>
        </nlm-citation>
      </ref>
    </ref-list>
  </back>
</article>
