^{1}

^{2}

^{3}

^{4}

^{5}

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.

There is an urgent need for the development of global analytic frameworks that can perform analyses in a privacy-preserving federated environment across multiple institutions without privacy leakage. A few studies on the topic of federated medical analysis have been conducted recently with the focus on several algorithms. However, none of them have solved similar patient matching, which is useful for applications such as cohort construction for cross-institution observational studies, disease surveillance, and clinical trials recruitment.

The aim of this study was to present a privacy-preserving platform in a federated setting for patient similarity learning across institutions. Without sharing patient-level information, our model can find similar patients from one hospital to another.

We proposed a federated patient hashing framework and developed a novel algorithm to learn context-specific hash codes to represent patients across institutions. The similarities between patients can be efficiently computed using the resulting hash codes of corresponding patients. To avoid security attack from reverse engineering on the model, we applied homomorphic encryption to patient similarity search in a federated setting.

We used sequential medical events extracted from the Multiparameter Intelligent Monitoring in Intensive Care-III database to evaluate the proposed algorithm in predicting the incidence of five diseases independently. Our algorithm achieved averaged area under the curves of 0.9154 and 0.8012 with balanced and imbalanced data, respectively, in

The proposed algorithm can help search similar patients across institutions effectively to support federated data analysis in a privacy-preserving manner.

Electronic health records (EHRs) are becoming ubiquitous across almost all medical institutions. They provide insight into diagnoses [

Previously, many biomedical studies were conducted within a single institution having limited EHR data because of the lack of federated data analysis framework and the institutional privacy concerns on data sharing. However, such an approach has many limitations. For example, it has been demonstrated that genome-wide association studies on EHR data often failed to discover known biomarkers from a single institution because of limited sample size [

Patient similarity learning aims to develop computational algorithms for defining and locating clinically similar patients to a query patient under a specific clinical context [

Cohort construction: cross-institution observational studies are challenging but necessary as many studies require a large and specific patient cohort that does not exist within a single institution. To conduct such a study, an efficient similarity search needs to be conducted across institutions to identify the focused patient cohort.

Disease surveillance: The Centers for Disease Control and Prevention monitors thousands of hospitals for potential epidemics. When a suspicious case is reported, there is a need to find similar cases across geographies.

Clinical trial recruitment: pharmaceutical companies often need to spend significant amount of time and resources to identify targeted patients through many different clinical institutions. Ideally, they would like to be able to perform patient similarity search across all clinical institutions to identify where those relevant patients are. Then they can quickly focus on recruiting patients from the right clinical institutions.

Patient similarity learning involves two computational phases: (1) patient representation learning is to learn the context-specific representation of patients based on their EHR data. For example, patients may be given different representations in heart disease management versus cancer management and (2) patient similarity search is to find similar patients based on their corresponding representations. In a federated environment where multiple institutions exist, patient similarity learning has many unique challenges: (1) how to design an efficient but flexible patient representation that enables fast similarity search? (2) how to learn patient representation from heterogeneous data sources? and (3) how to preserve privacy while still allowing the computation of the patient representation and the search of similar patients across institutions?

The main objective of this paper was to develop a privacy-preserving analytic platform for patient similarity learning in a distributed manner. We propose to learn context-specific binary hash codes to represent patients across institutions. The similarities between patients can be efficiently computed as the hamming distance using the resulting hash codes of corresponding patients; the hamming distance is defined to be the number of places where two binary codes differ. As patient data are heterogeneous from multiple sources such as diagnosis, medication, and lab results, we propose a multi-hash approach that learns a hash function for each data source. Then, the patient similarity is calculated by hash codes from data sources. To avoid the potential security risk because of the attack from malicious users, we also adopt homomorphic encryption [

For ^{|C}^{|} with the number of codes |

In general, hashing is an approach of transforming the data item to a low-dimensional representation, or equivalently a short code consisting of a sequence of bits (

Hashing technologies can be applied in many applications such as Bloom filter [

However, one challenge in our scenario is that the patient features are highly heterogeneous, that is, the features for characterizing the patients are of different types. In this case, it may not be effective to represent each patient as a single vector (simple concatenation will not work as different features are of different types and have different value range). There are some existing multi-modal hashing methods [

Example of feature construction. Prescription, lab test, and diagnosis are denoted by p, l, and d, respectively.

Example of hashing.

Symbols used in this paper are listed in

^{i} which owns a patient population ^{i}. We use ^{i}_{j} to represent the ^{i}. Then, our problem is, given a query patient, how to retrieve similar patients from those

Without the loss of generality, we assume there are ^{i}_{j}, and we use ^{i}_{jk} (^{i}_{j}. The goal is to derive an effective computational framework for patient matching in a federated environment, and the key idea is to learn a good hash function that can transform the patient features into binary hash codes. A uni-hash table approach shown in ^{d}→{-1,+1}^{b}, where _{k}: ^{d}_{k}→{-1,+1}^{b}_{k} for every patient feature type _{k} is the dimensionality of the _{k} is the number of bits of the learned hash codes for the _{k} (^{i}_{k})∈{-1,+1}^{b}_{k}^{ⅹN}_{i} , where ^{i}_{k} is transformed numerical data from original data of ^{i}_{k}∈ ^{d}_{k}^{ⅹN}_{i} by a hash function _{k} that incorporates function coefficients for the _{k}∈^{d}_{k}^{ⅹb}_{k}; _{i} is the population size of _{k}^{i}=sign(^{i}_{k}) to denote the hash codes of

The _{k}^{i}, ^{i}_{uk}∈{-1,+1}^{b}_{k} is the hash codes of ^{i}_{uk}. Then, the similarity between ^{i}_{uk} and ^{i}_{uk} can be evaluated as the inner product of ^{i}_{uk} and ^{i}_{uk} as shown in equation 1:

^{i}

_{kuv}= 1/

_{k}(

^{i}

_{uk})

^{T}(

^{i}

_{vk})

Thus, the overall similarity can be computed as the average of

^{i}

_{uv}= 1/

_{k}(

^{i}

_{uk})

^{T}(

^{i}

_{vk})

Here, we suggest a general framework for learning {_{k}}^{K}_{k=1}, which is the most important component. The framework basically constructs an objective function in terms of {_{k}}^{K}_{k=1} such as shown in equation 3, where _{S}_{U}_{W} are regularizers of _{k}}^{K}_{k=1}), _{k}}^{K}_{k=1}), and Ω({_{k}}^{K}_{k=1}), respectively, and then minimizes (or maximizes) it:

_{k}}

^{K}

_{k=1}) =

_{k}}

^{K}

_{k=1}) +

_{S}

_{k}}

^{K}

_{k = 1}) +

_{U}

_{k}}

^{K}

_{k = 1}) +

_{W}Ω ({

_{k}}

^{K}

_{k = 1})

_{k}}^{K}_{k=1}) is a reconfiguration error term between the low-dimensional representation of the original data and hash codes, which is the main term of the objective function, and generates the hash codes from the original data, as shown in equation 4, where ||·||_{F} is a Frobenius norm [_{k}(_{k}^{i})_{k}^{T}_{k}^{i}_{k}^{i}=sign(^{i}_{k}).

_{k}}

^{K}

_{k=1}) = ∑

_{i}∑

_{k}||

_{k}

^{T}

_{k}

^{i}-

_{k}

^{i}||

^{2}

_{F}

The objective function can incorporate regularizers, as well as the main term to obtain better solutions of {_{k}}^{K}_{k=1} by (1) introducing additional information to improve either unsupervised or supervised learning if desired, (2) solving an ill-posed problem, and 3) preventing overfitting. Possible regularizers are listed as follows:

_{k}}^{K}_{k=1}) is a supervised loss term that measures the quantization loss during the hashing process when supervision information is available for the patients. Here, the supervision information could be the labels of the patients, such as the disease the patients have. For example, if both ^{i}_{u} and ^{i}_{v} have the same disease, then their relationship ^{i}_{uv}=1, otherwise ^{i}_{uv}=-1. Then, we can set _{k}}^{K}_{k=1}) as shown in equation 5:

_{k}}

^{K}

_{k=1}) = ∑

_{i}∑

_{k}∑

_{u,v}-

^{i}

_{kuv}

^{i}

_{uv}

^{i}:

^{i}: patient population in ^{i}

^{i}: patient population size of ^{i}

^{i}_{k}: patient population for ^{i}

^{i}_{j}: ^{i}, ^{i}

^{i}_{jk}: ^{i}_{k}, ^{i}_{j}

_{k}:

_{k}: dimensionality of the

_{k}: the number of bits of the learned hash codes for the

_{k}: function coefficients of the hash function for the

_{ik}: _{k}

^{i}_{k}: numerical data transformed from ^{i}_{k}

sign(^{i}_{k}): signed ^{i}_{k}

^{i}_{k}: hash codes for ^{i}_{k}(=sign(^{i}_{k}))

^{i}_{jk}: ^{i}_{k}, the hash codes of ^{i}_{jk}

Ψ({_{k}}^{K}_{k=1}): reconfiguration error term for {_{k}}^{K}_{k=1}

_{k}}^{K}_{k=1}): supervised loss term for {_{k}}^{K}_{k=1}

_{k}}^{K}_{k=1}): unsupervised loss term for {_{k}}^{K}_{k=1}

_{k}}^{K}_{k=1}): term related to {_{k}}^{K}_{k=1} itself

_{S}: regularizer of _{k}}^{K}_{k=1})

_{U}: regularizer of _{k}}^{K}_{k=1})

_{W}: regularizer of _{k}}^{K}_{k=1})

^{i}_{kuv}: similarity between ^{i}_{uk} and ^{i}_{vk}

^{i}: pairwise relationship of ^{i} for labeled information

^{i}_{k}: pairwise similarity of ^{i}_{k}

^{i}_{uv}: relationship between ^{i}_{uk} and ^{i}_{vk} for labeled information

^{i}_{kuv}: similarity between ^{i}_{uk} and ^{i}_{vk}

_{L}^{i}_{k}^{i}_{k}

The whole process of patient matching in a federated environment. The user sends a patient matching request to the service center, which is delegated to patient data resources from several clinical sites. Due to the privacy concerns, the center does not have access to the raw patient data. All patients within different sites need to be first hashed, and the center only has the patient’s signatures after hashing. The hash functions are shared across different sites.

The process of calculating patient similarity with a multi-hash approach.

The possible choices of supervised loss term could be any loss function

Note that _{k}}^{K}_{k=1}) is an unsupervised term that exploits the intrinsic data distribution and enforces the resultant hash codes to comply with the distribution. For example, we can request similar patients to have similar hash codes on each feature type. This can be achieved by minimizing the below regularizer, as shown in equation 6, where ^{i}_{kuv} is a similarity between ^{i}_{uk} and ^{i}_{vk} based on, for example, a Gaussian function for continuous valued features or a cosine function after Term Frequency-Inverse Document Frequency normalization on bag-of-code (eg, diagnosis code or procedure code):

_{k}}

^{K}

_{k=1}) = ∑

_{i}∑

_{k}∑

_{u,v}

^{i}

_{kuv}||

^{i}

_{uk}-

^{i}

_{uk}||

^{2}

_{F}

Ω({_{k}}^{K}_{k=1}) is a term related to {_{k}}^{K}_{k=1} themselves, which is independent of the patient features. Examples of Ω({_{k}}^{K}_{k=1}) include (1) Frobenius norm regularizer ∑^{K}_{k=1}||_{k}||^{2}_{F}, which can be used for improving the numerical stability of the solution process and (2) orthogonality regularizer ∑^{K}_{k=1}∑_{i≠j}||^{T}_{ik}_{jk}||^{2}, where _{ik} is the _{k}, which can encourage the diversity of the learned hash codes and thus improve their representation effectiveness.

_{k}}^{K}_{k=1} as variable blocks that alternatively update _{k} (1≤_{k}}^{K}_{k=1} as soon as new patient data is received on site

Without loss of generality, let us instantiate the objective function with the regularizer

Example of transformation of patient vectors into hash codes and computation of similarity between hash codes.

When solving the initiated objective function, two possible problems because of the sign function for ^{i}∈ ^{N}_{i}^{ⅹN}_{i} is the pairwise relationship of ^{i} for labeled information:

_{i}∑

_{k}||

_{k}

^{T}

^{i}

_{k}

_{L}(

^{i}

_{k})||

^{2}

_{F}+

_{i}∑

_{k}tr(

_{L}(

^{i}

_{k})

^{i}

_{L}(

^{i}

_{k})

^{T}

_{i}∑

_{k}||

^{i}

_{k}||

^{2}

_{F}

If both ^{i}_{u} and ^{i}_{v} have the same disease, then their relationship ^{i}_{uv}=1, otherwise ^{i}_{uv}=-1, and _{L}(·) is the surrogate function, as shown in equation 8, where ∘ is the hadamard (elementwise) product:

_{L}(

^{i}

_{k}) = (

^{i}

_{k}∘

^{i}

_{k}

^{-1/2}∘

^{i}

_{k}

The detailed process to derive the final objective function is given in _{k}}^{K}_{k=1} and {^{i}_{k}}^{K,M}_{k,i=1} can be solved one by one iteratively as variable blocks [_{k} for each of _{l} for all

^{new}

_{k}=

_{k}- (∂

^{2}

^{2}

_{k})

^{-1}∂

_{k}

Then, similarly, we update ^{i}_{k} for each combination of (

^{i,new}

_{k}=

^{i}

_{k}- (∂

^{2}

^{i}

_{k}

^{2})

^{-1}∂

^{i}

_{k}

The derivation process for the first and second derivatives of _{k}}^{K}_{k=1} iteratively until convergence.

The time complexity at each iteration depends on feature type _{k} for each of _{l} for all _{k}^{3}) because each site has to inverse the _{k}ⅹ _{k} Hessian matrix. When updating ^{i}_{k} for each combination of (_{k}^{3}_{i}^{3}) because ^{i} has to inverse the _{k}_{i}ⅹ_{k}_{i} Hessian matrix. Therefore, parameters that have a significant effect on time complexity include original and projection dimensions by feature type and population size by site. Other parameters such as the number of sites

To find similar patients across sites, hash codes for each site ^{i} (ie, {^{i}_{k}}^{K}_{k=1} have to be exchanged across institutions originally. However, when all other sites expect for ^{i} for similarity search, the patient-level information of ^{i}because they have both {_{k}}^{K}_{k=1} and ^{i}, as well as their information in equation 4.

Therefore, we suggest the way to search similarity among different sites by avoiding revealing ^{i}_{k} but able to compute similarities based on ^{i}_{k}. We introduce homomorphic encryption specifically that is a form of encryption where a specific algebraic operation performed on the plaintext is equivalent to another algebraic operation performed on the cipher-text, and an encrypted result, when decrypted, matches the result of the same operation performed on the plaintext. Unlike traditional encryption schemes that do not allow any computations to be performed on the cipher-text without first decrypting it, homomorphic encryption allows computations to be performed without decrypting the data. The results of the computations remain encrypted and can only be read and interpreted by someone with access to the decryption key. Therefore, it is appropriate to use homomorphic encryption in our case that other sites and a server can attack maliciously. It enables cross-site comparison of health care statistics with protecting privacy for each site. The procedure of homomorphic encryption in this paper is summarized as follows: first,

We note that homomorphic encryption provides an extra layer of privacy protection especially during patient similarity search.

There are several participants in our framework.

Data custodians (DCs) represent institutions or hospitals who have access to patient data and would like to collaborate in learning about similar patients.

Crypto service provider (CSP) generates public and private keys. The public key is provided to the data custodians to safeguard the intermediary statistics.

Cloud server (CS) computes over summary statistics from individual data custodians to obtain a global patient similarity model.

Our goal is that a DC does not learn patient-level information from other DCs during the process. We also want to ensure CS cannot infer patient-level information from the data. We assume a CSP is trustworthy and provides encryption keys (public and private). In the threat model, we assume the CS to be semi-honest, that is, it is honest to follow the protocol but curious about patient’s private information while executing the protocol. We make the following basic assumptions: (1) DC and CS do not collude, (2) CS and CSP also do not collude, and (3) DC always receives correct keys from the CSP. To evaluate the security of our system, it is assumed that the security of the system is compromised if patient-level data or intermediary statistics that can infer patient-level data are leaked. CSP is only involved in generating public and private keys and transferring those keys to DCs, and no access to unintended fine-grained local information is involved in this process.

The leakage is related to computation of {_{k}}^{K}_{k=1}, and possible scenarios according to the participants are as follows:

Leakage to CSP in each computation: CSP does not participate in computation at all. Therefore, there is no leakage.

Leakage to DC in each computation: each DC cannot indirectly learn patient data from other DCs only with {_{k}}^{K}_{k=1} and its local information {_{k}}^{K}_{k=1} and {^{i}_{k}}^{K}_{k=1}. If all DCs except for one collude, it is infeasible for the other DCs to reconstruct ^{i}_{k} of that one DC because the first and second derivatives of _{k} have a nonlinear relationship for ^{i}_{k}. Specifically, it is not possible to specify a certain matrix only given information of covariance matrix because of insufficient equations. They also do not have information (first and second derivatives) about ^{i}_{k}.

Leakage to CS in each computation: CS cannot infer patient data from {_{k}}^{K}_{k=1}. Even though CS receives local information for the first and second derivatives of {_{k}}^{K}_{k=1}, it is infeasible for CS to recover {^{i}_{k}}^{K}_{k=1} for the same reason as the collusion among DCs. In finding similar patients, hash codes for each site {^{i}_{k}}^{K}_{k=1} have to be exchanged across institutions originally, but the use of homomorphic encryption prevents direct exchange of hash codes {^{i}_{k}}^{K}_{k=1} between DCs, and thus, there is no leakage.

Example of potential privacy leakage in patient similarity search across sites.

Privacy-preserving patient similarity search by homomorphic encryption; green key: encryption (public) key, blue key: decryption (private) key.

We conducted experiments to validate our proposed method on real data. For comparison, we assumed two different systems against our system according to connection among _{k}}^{K}_{k=1}. Then, we predicted the incidence of a certain disease and compared the standard

A sequence is composed of lab tests, prescriptions, diagnoses, conditions, and symptoms that were given to a patient in multiple hospital admissions. We only extracted common lab tests, prescriptions, diagnoses, conditions, and symptoms (prefixed with “

We used Multiparameter Intelligent Monitoring in Intensive Care-III (MIMIC-III) database that contains health-related data associated with 46,520 patients and 58,976 admissions to the intensive care unit of Beth Israel Deaconess Medical Center from 2001 to 2012. The database consists of detailed information about patients, including demographics such as gender, age, and race; admissions; lab test results; prescription records; procedures; and discharge ICD diagnoses. On the basis of this database, we randomly selected several common diseases (ie, diseases with relatively large number of positives) as a target disease to verify that our method can perform well in general not only for a specific disease. Then, we extracted temporal sequences and constructed following six feature vectors (^{i}_{1}∈^{d}_{1}^{ⅹN}_{i}, lab results ^{i}_{2}∈ ^{d}_{2}^{ⅹN}_{i}, diagnoses ^{i}_{3}∈^{d}_{3}^{ⅹN}_{i}, prescriptions ^{i}_{4}∈^{d}_{4}^{ⅹN}_{i}, conditions ^{i}_{5}∈^{d}_{5}^{ⅹN}_{i}, and symptoms ^{i}_{6}∈ ^{d}_{6}^{ⅹN}_{i}. Time decay constant

To test three-site scenario, we made datasets balanced and horizontally partitioned the dataset into three, assuming data are evenly partitioned among sites (^{1}_{k}∈^{d}_{k}^{ⅹ125}, ^{2}_{k}∈^{d}_{k}^{ⅹ125}, and ^{3}_{k}∈^{d}_{k}^{ⅹ125} for every ^{-3} in common. In addition, for multi-hash approach, we reduced the original dimensions for each feature to ten (ie, _{k}=10 for _{1}=2), and for uni-hash approach we reduced the total dimensionality to the sum of projection dimensions in multi-hash approach (ie,

Example of constructing temporal sequence with target disease in red and its vector representation.

Description of five datasets from Multiparameter Intelligent Monitoring in Intensive Care-III (MIMIC-III) database.

Disease | Data size (negative or positive) | Dimension (_{k}, |

Disorders of lipoid metabolism | 4546/2990 | (12,204,814,1338,262,170) |

Hypertensive chronic kidney disease | 5652/1884 | (12,204,822,1338,266,169) |

Cardiac dysrhythmias | 3878/3658 | (12,204,817,1338,263,169) |

Heart failure | 4167/3369 | (12,204,819,1338,265,169) |

Acute renal failure | 4182/3354 | (12,204,809,1338,268,170) |

Averaged area under the curve (AUC) with SD of

Disease | Multi-hash | Uni-hash | Baseline | ||||

Our system, Averaged AUC (SD) | Open system, Averaged AUC (SD) | Closed system, Averaged AUC (SD) | Open system, Averaged AUC (SD) | Closed system, Averaged AUC (SD) | Open system, Averaged AUC (SD) | Closed system, Averaged AUC (SD) | |

Disorders of lipoid metabolism | 0.9330 (0.0086) | 0.9343 (0.0125) | 0.9002 (0.0285) | 0.9159 (0.0255) | 0.8486 (0.0271) | 0.8079 (0.0222) | 0.7945 (0.0308) |

Hypertensive chronic kidney disease | 0.9078 (0.0346) | 0.9283 (0.0432) | 0.8538 (0.0421) | 0.9270 (0.0350) | 0.8501 (0.0305) | 0.7823 (0.0261) | 0.7762 (0.0262) |

Cardiac dysrhythmias | 0.9135 (0.0287) | 0.9368 (0.0492) | 0.8833 (0.0397) | 0.9072 (0.0414) | 0.8236 (0.0328) | 0.7695 (0.0151) | 0.7340 (0.0343) |

Heart failure | 0.9058 (0.0282) | 0.9351 (0.0326) | 0.8798 (0.0414) | 0.9089 (0.0376) | 0.8471 (0.0248) | 0.7986 (0.0292) | 0.7733 (0.0421) |

Acute renal failure | 0.9169 (0.0397) | 0.9477 (0.0374) | 0.8637 (0.0320) | 0.8821 (0.0403) | 0.7929 (0.0378) | 0.7434 (0.0380) | 0.7289 (0.0341) |

Averaged area under the curve (AUC) of κ-NN (κ=3) for heart failure based on hamming distance from our, open and closed systems with multi-hash approach.

Additionally,

However, in real life, different sites have a different specialty and have a different distribution in patient data. To see how our platform works in random and skewed distribution, we differentiated the ratio of samples having negative and positive classes by site. We assumed that three sites, respectively, have 10%, 30%, and 50% of positive class for five diseases. Note that all other settings including the number of sites and patients for each site, projection dimensions, and parameters were set the same as before to test only the change originated from the class imbalance and for experimental convenience; we omitted the uni-hash approach, which is expected to have the similar trend about multi-hash approach to that shown in

Averaged area under the curve (AUC) of κ-NN (κ=3) for heart failure based on hamming distance from our system with multi-hash approach and open and closed systems with uni-hash approach.

Averaged area under the curve (AUC) of κ-NN with different κ (κ=1,3,9) for five diseases from our system.

Averaged area under the curve (AUC) with SD of

Disease | Multi-hash | Baseline | |||

Our system, AUC (SD) | Open system, AUC (SD) | Closed system, AUC (SD) | Open system, AUC (SD) | Closed system, AUC (SD) | |

Disorders of lipoid metabolism | 0.8056 (0.0386) | 0.8309 (0.0412) | 0.7629 (0.0295) | 0.7525 (0.0212) | 0.7104 (0.0187) |

Hypertensive chronic kidney disease | 0.7637 (0.0367) | 0.7924 (0.0209) | 0.7275 (0.0266) | 0.7296 (0.0215) | 0.7141 (0.0207) |

Cardiac dysrhythmias | 0.7840 (0.0301) | 0.7937 (0.0228) | 0.7659 (0.0223) | 0.7638 (0.0198) | 0.7385 (0.0188) |

Heart failure | 0.8287 (0.0283) | 0.8832 (0.0278) | 0.7459 (0.0331) | 0.7735 (0.0206) | 0.6778 (0.0213) |

Acute renal failure | 0.8239 (0.0326) | 0.8704 (0.0335) | 0.7558 (0.0263) | 0.7304 (0.0218) | 0.7415 (0.0225) |

Averaged execution time of each basic cryptographic operation for five diseases.

Operation | Time (seconds) | ||||

Disorders of lipoid metabolism | Hypertensive chronic kidney disease | Cardiac dysrhythmias | Heart failure | Acute renal failure | |

Homomorphic encryption | 1.9 | 2.2 | 2.2 | 2.3 | 2.2 |

Initialization | 5.2 | 6.3 | 5.8 | 6.5 | 6.0 |

Comparison | 994.2 | 1243.9 | 1067.1 | 1131.7 | 1066.5 |

Homomorphic decryption | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 |

Most of the results can be interpreted in the same context as _{k}}^{K}_{k=1} with information from skewed distributions. However, it is encouraging that sensitivity is obtained stably in multi-hash approach rather than baseline. Sensitivity is an important measure in medical analysis because it is much more dangerous to diagnose that the disease has not occurred even though it has already developed than the opposite case. The fact that F1 is significantly larger is consistent with this. Therefore, considering all the results, we believe that our system is a useful alternative.

Next, we performed secure data aggregation and data comparison among different sites in a federated setting by which each site is able to retrieve its hamming distance under certain criteria in a privacy-preserving manner. In our experiments with balanced data, each row has 52 bits (hash code), and a 128-bit encryption key is used for homomorphic encryption. We measured the execution time of some key cryptographic operations in a workstation with an Intel 2.5 GHz CPU, where all the results are averaged over five-fold CV of total time for six cases (three test sets by two training sets). The execution time of each basic cryptographic operation has been profiled and shown in

We confirmed that the calculated similarities across sites are the same when exchanging raw {^{i}_{k}}^{K}_{k=1} directly with each other (ie, without homomorphic encryption) or exchanging encrypted {^{i}_{k}}^{K}_{k=1} (ie, with homomorphic encryption) with each other. Therefore, the results after homomorphic encryption were obtained exactly the same as the results in

There are several limitations in the proposed framework. When learning hash functions, the assumption is that each site has common feature events that should be needed. However, different sites, for example, hospitals, may have different event types, and additionally, the notation system for each event type cannot be standardized except for diagnoses, symptoms, and conditions that are based on ICD-9. Even though we have the limitation of common feature events, we believe that our methodology can be still useful for cooperating hospitals eager to find similar patients across sites at the point of care. We are planning to develop a new and more practical approach to relax this assumption.

Basically, our system works better when all the participants have similar distributions. However, we have confirmed through the imbalance class experiment that our system still works well with different distributions, as well at the cost of some performance degradation. We will address more generalized imbalance data problem in future work.

Next, even if we have computational benefits by adopting a multi-hash approach compared with a uni-hash approach, and the computational complexity is not prohibitive in practice, a technical challenge still remains in scalable hash function learning when the sample size and the feature dimensionality are large. This is because the complexity for inverting Hessian matrices in our algorithm is affected by the sample size and the feature dimensionality. This is an expensive operation of time complexity and requires a lot of memory. We can solve this problem by using parallelization or graphics processing units or utilizing a gradient descent method that replaces the inversion of Hessian matrix with a constant or a variable varying with the iteration number.

We demonstrated the feasibility of privacy-preserving similarity search, and the experiments were conducted on a single machine (with different processes) to serve as a proof of concept. In practice, we need to deploy the algorithm in multiple computers, and that is a trivial task. We will execute this algorithm using secure multiparty computation such as in the Secure Multi-pArty Computation Grid LOgistic REgression [

We have also listed several limitations to consider for more elaborate future work. When constructing temporal sequences, it assumes the sequence events are sampled at the same frequency for simplicity, which means the temporal effect has not been represented in this work. We roughly determined parameters of projection dimension and decay factor, which might not be optimal. In our experiment, we used 3-digit ICD to show a proof of concept, but the granularity of the ICD code will affect the performance in real applications, especially if the interest is related to the rare ones.

We proposed a federated patient hashing framework and developed a privacy-preserving patient similarity learning algorithm. This technique allows to learn hash codes for each patient reflecting information of different sites without sharing patient-level data. Using MIMIC-III database, we conducted experiments to demonstrate the accuracy and usability of the proposed algorithm. By utilizing the multi-hash approach, our algorithm obtained more usable and practical results than the uni-hash approach. To avoid privacy leakage in patient similarity search, we also applied homomorphic encryption able to calculate the hamming distance without transmitting hash codes. As a result, we confirmed the same results without any privacy leakage.

Privacy-preserving patient representation learning in a federated setting.

Prediction performance of balanced class datasets.

Prediction performance of imbalanced class datasets.

Prediction performance of balanced class datasets.

area under the curve

cloud server

crypto service provider

cross validation

data custodian

electronic health record

International Classification of Diseases, 9th revision

Multiparameter Intelligent Monitoring in Intensive Care-III

true positive rate

This work was supported by NHGRI grants R00HG008175, R01HG008802, and R01HG007078, NIGMS R01GM114612, NLM grants R00LM011392, R21LM012060, and NHLBI grant U54HL108460. The work of FW is supported by NSF IIS-1650723 and IIS-1716432.

None declared.