# Patient Clustering Improves Efficiency of Federated Machine Learning to predict mortality and hospital stay time using distributed Electronic Medical Records (Reading Notes)

Published:

Electronic Medical Records (EMRs) data is often used in the development of machine learning algorithms to predict disease incidence, patient response to treatment, and other medical events. But so far, most of the algorithms are centralized, rarely considering non-identically independent distributed (non-IID) data, and rarely considering the privacy sensitivity of EMRs can complicate the learning process of data.

## Background

Electronic Medical Records (EMRs) data is often used in the development of machine learning algorithms to predict disease incidence, patient response to treatment, and other medical events. But so far, most of the algorithms are centralized, rarely considering non-identically independent distributed (non-IID) data, and rarely considering the privacy sensitivity of EMRs can complicate the learning process of data.

## Introduction

To address the issue of decentralized data affecting the machine learning process, the authors introduced the community-based federated machine learning (CBFL) algorithm and evaluated it on IID’s ICU EMRs data.

The CBFL algorithm clusters the distributed data into clinically meaningful communities, obtains similar diagnoses and geographic locations, and learns a model for each community. Throughout the learning process, data is stored in the hospital’s local data, while local calculations are aggregated on the server.

The experimental results show that the CBFL algorithm outperforms the baseline FL algorithm, which is reflected in: Area Under the Receiver Operating Characteristic Curve (ROC AUC), Area Under the Precision-Recall Curve (PR AUC) and the communication cost between the hospital and the server. three aspects.

Next, we will introduce the application of CBFL development and evaluation to demonstrate the application of decentralized clustering and federated machine learning in ICU EMRs prediction.

## Data Source

The CBFL was developed based on the eICU Collaborative Research Database, which contains high-quality intensive care data from 200859 patients from 208 hospitals across the United States. The research mainly involves three dimensions:

• Medicine for patients within 48 hours of the first ICU
• Unit discharge status, specifying when the patient leaves the ICU (mortality, 0 means survival, 1 means death)
• Unit flow offset, recorded from admission to discharge (ICU hospital stay, average 3858 minutes)

In addition, the study selected an additional 50 hospitals, each of which randomly selected 560 patients as samples to form the final 280,000 sample data sets.

## CBFL Procedures

The whole algorithm is divided into three steps:

• In the first step, in the compile training phase, we give each client an initial noise-reducing compiler, fautoencoder, with a weight of ${w_0}^c$. After $E_1$ iterations, we pass the gradient descent weights of each end to the server to get the average value;

• In the second step, the trained feature is used to extract the feature value $X_c$ on each client and return the average feature value. Then, we initialize $K$ clusters, and use K-means clustering to divide these eigenvalues into these clusters and return to the $f_k$-means model;

• In the third step, the server distributes $f_k$-means and fencoder to each client. The client uses $f_k$-means to classify the features extracted by fencoder into $K$ class. Thereafter, the server initializes a total of $K$ models from $f_1$ to $f_k$, $f_i$ only in each The client is divided into the $i$-class feature training, and the wc mean of each client is iterated to $f_i$ until $f_1$ to $f_k$ converge.

When a new sample needs to be predicted, use fencoder to extract its features, send it to $f_k$-means to determine its classification $i$, and use $f_i$ to predict it.

### 1. Community Analysis

Patient clustering is a key step in the algorithm: Since grouping patients with similar characteristics together, community-based learning will be easier than learning the entire model on all patients. To illustrate the common characteristics of patients in the same community, we divided 28,000 patients into 5 communities and conducted an enrichment analysis of the diagnosis. The table below lists the number of patients and the number of diagnosed patients.

### 2. Mortality and Stay Time Prediction

The experiment carried out specific research on the survival rate and ICU hospitalization time. The cluster parameters $K=5, 10, 15, 50$ were selected for testing. In addition, the experiment also compared changes in ROC AUC images from training and test data sets from the same hospital, as well as training and test data sets from different hospitals.

Mortality prediction (same hospitals in training and test sets): In our study, ROC AUC refers to the probability that CBFL scores for randomly selected death patients is higher than that of randomly selected surviving patients, while PR AUC indicates Average accuracy between 0 and 1.

The figure below shows the ROC curve for FL and the ROC curve comparison of the CBFL algorithm under different conditions.

The above evaluation results show that CBFL is superior to FL in mortality and hospitalization time prediction tasks in a short communication cycle. Communities tend to accommodate patients with similar diagnoses and geographic locations, making individual community models easier to learn on average than one model.

## Conclusion

The CBFL algorithm is also not perfect, and there are limitations in some respects. For example, suppose we have K community models that are trained locally. Compared to traditional FL, CBFL has more K-1 parameters between the client and the server. And such additional communication load will increase as the number of training samples and community K increases. The experimental results show that CBFL performs best in 5 or up to 10 communities, but there is no guarantee that the application of CBFL on other biomedical data sets will be optimal with less K. Future research directions may include optimizing communication load by designing more effective community-based learning programs, combining more dimensions beyond drug characteristics to further improve prediction accuracy, and developing better clustering methods to capture patient characteristics. and many more.

Last but not least, although this study focuses on machine learning on ICU EMRs, CBFL can be extended to other bioinformatics applications, such as medical image recognition or medical planning decisions across multiple medical shafts. Large, distributed, and privacy-sensitive data.

Tags: