Jekyll2019-08-23T01:13:47-07:00https://niklausliu.github.io/feed.xmlNiklaus(Yi) Liu’s Personal WebsiteYi Liu was born in Hangzhou, Zhejiang, China, in 1997. He is about to acquire an undergraduate degree in computer science at Heilongjiang University. He used to be a visiting student at Jun Zhao Lab of Nanyang Technological University in January 2019. Now, he is a research assistant at James Yu Lab of Southern University of Science and Technology. Currently, he gets a pre-doctoral offer from Cornell University and the University of Maryland.Niklaus(Yi) Liu97liuyi@gmail.comApplication of federated XGBoost in outlier detection2019-08-12T00:00:00-07:002019-08-12T00:00:00-07:00https://niklausliu.github.io/posts/2019/07/FL-9<p>Federated learning technology is rapidly evolving, and the combination of other machine learning methods has become a popular privacy protection model. <strong>XGBoost</strong> is the <strong>“big killer”</strong> of machine learning algorithms. In June 2019, the paper <strong>“Secureboost: A lossless federated learning framework”</strong> proposed the <strong>Secureboost federated learning framework</strong>, which realized the application of Boosting integrated learning method in vertical federated learning. In another recent paper, <strong>“The Tradeoff Between Privacy and Accuracy in Anomaly Detection Using Federated XGBoost”</strong>, the authors proposed a method of combining XGBoost with horizontal federayed learning, which was used for credit card transaction anomaly detection and obtained. A good detection effect, find out the privacy protection program to achieve the best training results.</p>
<p>🔗<a href="https://arxiv.org/abs/1907.07157">Paper Link</a></p>
<h2 id="introduction">Introduction</h2>
<p>Today, many large Internet companies have built large-scale information technology infrastructure to provide customers with multiple services. However, a large amount of data transmission leads to privacy leakage and increased transmission costs: on the one hand, data transmission between different enterprises is likely to cause privacy leakage; on the other hand, data transmission will greatly increase communication costs. In this context, <strong>federated learning</strong> technology came into being. <strong>Federated learning</strong> does not transmit raw data, but instead transmits a pre-trained learning model from the client to the server, effectively protecting user privacy.</p>
<p>With the promotion of federated learning techniques, the combination of federal learning and other machine learning methods has become a popular approach, such as logistic regression, tree models, and so on. In June 2019, the paper “Secureboost: A lossless federated learning framework” realized the application of Boosting integrated learning method in vertical federation learning. This article continues to expand federated learning, hoping to achieve a combination of horizontal federated learning and XGBoost integrated learning, and used in bank credit card abnormal transaction detection.</p>
<p>Executing the XGBoost algorithm in the framework of federated learning is to calculate the parameters <script type="math/tex">g_i</script> and <script type="math/tex">h_i</script> for updating the overall model on each local data node <script type="math/tex">i</script>, and then transfer the parameters to a central server to update the overall model. The central server then selects an optimal partitioning method to pass the updated model back to each local model. This method is effective for vertical federated learning scenarios. However, the horizontal federated learning scenarios involved in the paper “The Tradeoff Between Privacy and Accuracy in Anomaly Detection Using Federated XGBoost” do not apply: horizontal federated learning for each local data node has a different data distribution, plus the sample is extremely asymmetric for training. The effect of the sample distribution on the results will be more pronounced. If the parameters of each node are simply updated as described above, regardless of the distribution of data of different nodes, the resulting training model may be unreasonable.</p>
<h2 id="method">Method</h2>
<p>In order to solve the above problems, this paper proposes to transmit user features in a transparent and efficient manner, mainly through the following two steps.</p>
<h3 id="data-aggregating">Data aggregating</h3>
<p>The first is to aggregate the raw data and map a batch of similar samples into a virtual data sample. Considering the first virtual sample <script type="math/tex">I_1</script>, we sum <script type="math/tex">g_i</script> and <script type="math/tex">h_i</script> (<script type="math/tex">i</script>∈<script type="math/tex">I_1</script>) in this virtual sample to obtain the update parameters <script type="math/tex">g_{I1}</script> and <script type="math/tex">h_{I1}</script> of the virtual sample, and the other virtual samples perform the same operation. Then, the update parameter corresponding to each virtual sample is transmitted to the central server to update the total model, and the central server returns the updated model. It should be noted that this process is returned to each raw data node, not a virtual data node. Because data aggregation is used in this process, data leakage may be involved. To solve this problem, the author used <strong>Modified k-anonymity</strong> to protect the data.</p>
<p><strong>K-anonymity</strong> is to publish low-precision data through generalization and concealment techniques, so that each record has exactly the same attribute value as other k-1 records in the data table, thus reducing the risk of data privacy leakage. In this article, since the local node is transferring the model parameters to the server, we use the improved <strong>Modified k-anonymity</strong>: the model parameters of each node are transmitted to the central server instead of the original data.</p>
<p><img src="/images/FL-9-1.jpg" width="500" height="300" title="A set of data samples are mapped to virtual data samples" class="align-center" /></p>
<h3 id="sparse-federal-update">Sparse federal update</h3>
<p>In actual transmission operations, the transmission of data is inefficient due to excessive data volume, and not all data has an updated value. Therefore, in order to better update the model, the data needs to be filtered. Although the XGBoost model is good for outlier prediction, there are still many samples that cannot be correctly classified. A new perspective is presented in this paper. Training samples that were misclassified in previous learning should receive more attention in subsequent learning and be transmitted to the server for federated updates. Because the training samples that are misclassified in learning are more valuable than others, it helps the model to better self-improve. Secondly, because the data in the anomaly detection is extremely unbalanced, the XGBoost algorithm can solve the skewness problem to some extent. In addition, if the correctly classified training samples are not filtered out, these samples will affect the splitting and construction of the tree model during the federal update process, which will adversely affect the model improvement.</p>
<h2 id="experiment">Experiment</h2>
<p>The article applies the federated XGBoost algorithm to a set of credit card data to detect credit card anomalies to test the effects of the algorithm. This set of data contains 284,807 samples and 30 features, and 492 cases of abnormal samples in all samples, only 0.172% of the total number of samples (that is, the case where the sample distribution mentioned above is extremely uneven). In the empirical use, the common federated XGBoost, GBDT, random forest, federated XGBoost after data aggregation and federated XGBoost after sparse gradient update are used to calculate the training effects and compare them. The main conclusions are as follows.</p>
<ul>
<li>(1) Model users need to balance the privacy protection and training effects. The optimal number of virtual data sets for privacy protection is 405. The larger the virtual data set, the better the protection of data privacy. However, there is a trade-off between privacy protection and model training. As shown in the figure below, the horizontal axis represents the number of virtual data sets, and the vertical axis represents F1-Score. As the number of virtual data sets increases, the privacy protection capability decreases, and the F1-Score is rising, and the training effect is better. By observing the dynamic changes between the two, the author selects the 405 represented by the straight line A as the optimal number of virtual data sets. At this time, the F1-Score is higher and the protection of data privacy is stronger. If you choose a value slightly smaller than 405, although the privacy protection ability becomes stronger, the training effect is obviously reduced; if you choose a value slightly larger than 405, the privacy protection ability is weakened, and the training effect is only slightly improved, so 405 is the most The number of excellent virtual data sets. The straight line A in the figure represents F1-Score when the number of clusters is 405 is 0.895105, and the straight line B represents F1-Score in the case of no data aggregation, which is 0.901408.</li>
</ul>
<p><img src="/images/FL-9-2.jpg" width="500" height="300" title="The effect of the number of virtual sample sets on F1-Score" class="align-center" /></p>
<ul>
<li>(2) After selecting the appropriate data aggregation scale, the federated XGBoost using data aggregation and the federated XGBoost with sparse gradient update can significantly improve the training effect. Because the distribution of the sample itself is extremely uneven, it is unreasonable to use Accuracy to measure the training effect. This paper compares the training effects of different algorithms by comparing AUC and F1-Score. For F1-Score, with the random forest, GBDT, common federated XGBoost, the federated XGBoost (whether or not data aggregation) using the sparse federated update method is significantly better. For the AUC value, it can also be seen that the use of the data aggregation and the sparse federated update of the federated XGBoost has achieved better results than the other algorithms mentioned above.</li>
</ul>
<p><img src="/images/FL-9-3.jpg" width="500" height="300" title="Comparison of the training effects of the federated XGBoost and other algorithms proposed in the paper" class="align-center" /></p>
<h2 id="innovation-and-future-prospects">Innovation and future prospects</h2>
<p>The main innovations of the paper are:</p>
<ul>
<li>XGBoost for horizontal federated learning scenarios.</li>
<li>The use of data aggregation methods enhances privacy protection.</li>
<li>The use of sparse gradient update improves transmission efficiency and training effectiveness.</li>
</ul>
<p>In the future, authors may try more privacy protection methods to train and improve the empirical details to achieve more significant training results.</p>Niklaus(Yi) Liu97liuyi@gmail.comFederated learning technology is rapidly evolving, and the combination of other machine learning methods has become a popular privacy protection model. XGBoost is the “big killer” of machine learning algorithms. In June 2019, the paper “Secureboost: A lossless federated learning framework” proposed the Secureboost federated learning framework, which realized the application of Boosting integrated learning method in vertical federated learning. In another recent paper, “The Tradeoff Between Privacy and Accuracy in Anomaly Detection Using Federated XGBoost”, the authors proposed a method of combining XGBoost with horizontal federayed learning, which was used for credit card transaction anomaly detection and obtained. A good detection effect, find out the privacy protection program to achieve the best training results.Leveraging blockchain to make machine learning models more accessible2019-07-18T00:00:00-07:002019-07-18T00:00:00-07:00https://niklausliu.github.io/posts/2019/07/FL-8<p>Significant advances are being made in artificial intelligence, but accessing and taking advantage of the machine learning systems making these developments possible can be challenging, especially for those with limited resources. These systems tend to be highly centralized, their predictions are often sold on a per-query basis, and the datasets required to train them are generally proprietary and expensive to create on their own. Additionally, published models run the risk of becoming outdated if new data isn’t regularly provided to retrain them.</p>
<p>🔗<a href="https://www.microsoft.com/en-us/research/blog/leveraging-blockchain-to-make-machine-learning-models-more-accessible/">Source from Microsoft AI blog.</a></p>
<p>🔗<a href="https://github.com/niklausliu/niklausliu.github.io/raw/master/files/Federated%20Learning%20%26%20Blockchain.pdf">Slide about my idea.</a></p>
<p>🔗<a href="https://www.microsoft.com/en-us/research/uploads/prod/2019/07/1907.07247.pdf">Paper</a></p>
<p>🔗<a href="https://github.com/microsoft/0xDeCA10B">Data</a></p>
<p>We envision a slightly different paradigm, one in which people will be able to easily and cost-effectively run machine learning models with technology they already have, such as browsers and apps on their phones and other devices. In the spirit of democratizing AI, we’re introducing Decentralized & Collaborative AI on Blockchain.</p>
<p>Through this new framework, participants can collaboratively and continually train and maintain models, as well as build datasets, on public blockchains, where models are generally free to use for evaluating predictions. The framework is ideal for AI-assisted scenarios people encounter daily, such as interacting with personal assistants, playing games, or using recommender systems. An open-source implementation for the <a href="https://www.ethereum.org/">Ethereum blockchain</a> is available on <a href="https://github.com/microsoft/0xDeCA10B">GitHub</a>, and author’s paper “<a href="https://www.microsoft.com/en-us/research/publication/decentralized-collaborative-ai-on-blockchain/">Decentralized & Collaborative AI on Blockchain</a>” will be presented at the second <a href="http://www.blockchain-ieee.org/index.php">IEEE International Conference on Blockchain</a> July 14–17.</p>
<p><img src="/images/FL-7-5.png" width="500" height="300" title="Federated AI." class="align-center" /></p>
<h1 id="why-blockchain">Why blockchain?</h1>
<p>Leveraging blockchain technology allows us to do two things that are integral to the success of the framework: offer participants a level of trust and security and reliably execute an incentive-based system to encourage participants to contribute data that will help improve a model’s performance.</p>
<p>With current web services, even if code is open source, people can’t be 100 percent sure of what they’re interacting with, and running the models generally requires specialized cloud services. In our solution, we put these public models into <a href="https://www.investopedia.com/terms/s/smart-contracts.asp">smart contracts</a>, code on a blockchain that helps ensure the specifications of agreed upon terms are upheld. In our framework, models can be updated on-chain, meaning within the blockchain environment, for a small transaction fee or used for inference off-chain, locally on the individual’s device, with no transaction costs.</p>
<p>Smart contracts are unmodifiable and evaluated by many machines, helping to ensure the model does what it specifies it will do. The immutable nature and permanent record of smart contracts also allows us to reliably compute and deliver rewards for good data contributions. Trust is important when processing payments, especially in a system like ours that seeks to encourage positive participation via incentives (more to come on that later). Additionally, blockchains such as Ethereum have <a href="https://www.ethernodes.org/network/1">thousands of decentralized machines all over the world</a>, making it less likely a smart contract will become completely unavailable or taken offline.</p>
<p><img src="/images/FL-7-1.jpg" width="500" height="300" title="Ethereum nodes are located around the world. " class="align-center" /></p>
<p>Ethereum nodes are located around the world. Locations are as of July 4, 2019. Source: <a href="https://www.ethernodes.org">https://www.ethernodes.org</a></p>
<h1 id="deploying-and-updating-models">Deploying and updating models</h1>
<p>Hosting a model on a public blockchain requires an initial one-time fee for deployment, usually a few dollars, based on the computational cost to the blockchain network. From that point, anyone contributing data to train the model, whether that be the individual who deployed it or another participant, will have to pay a small fee, usually a few cents, again proportional to the amount of computation being done.</p>
<p>Using our framework, we set up a <a href="https://en.wikipedia.org/wiki/Perceptron">Perceptron</a> model capable of classifying the sentiment, positive or negative, of a movie review. As of July 2019, it costs about USD0.25 to update the model on Ethereum. We have plans to extend our framework so most data contributors won’t have to pay this fee. For example, contributors could get reimbursed during a reward stage or a third party could submit the data and pay the fee on their behalf when the data comes from usage of the third party’s technology, such as a game.</p>
<p>To reduce computational costs, we use models that are very efficient to train with such as a Perceptron or a <a href="https://en.wikipedia.org/wiki/Nearest_centroid_classifier">Nearest Centroid Classifier</a>. We can also use these models along with high-dimensional representations computed off-chain. More complicated models could be integrated using API calls from the smart contract to machine learning services, but ideally, models would be kept completely public in a smart contract.</p>
<p><img src="/images/FL-7-2.png" width="500" height="300" title="Decentralized & Collaborative AI on Blockchain framework. " class="align-center" /></p>
<p>Adding data to a model in the Decentralized & Collaborative AI on Blockchain framework consists of three steps: (1) The incentive mechanism, designed to encourage the contribution of “good” data, validates the transaction, for instance, requiring a “stake” or monetary deposit. (2) The data handler stores data and metadata onto the blockchain. (3) The machine learning model is updated.</p>
<h1 id="incentive-mechanisms">Incentive mechanisms</h1>
<p>Blockchains easily let us share evolving model parameters. Newly created information such as new words, new movie titles, and new pictures can be used to update existing models hosted regardless of a specific person or organization’s ability to update and host the model themselves. To encourage people to contribute new data that will help maintain the model’s performance, we propose several incentive mechanisms: gamified, prediction market–based, and ongoing self-assessment.</p>
<p><strong>Gamified:</strong> Like on <a href="https://stackexchange.com/">Stack Exchange</a> sites, data contributors can earn points and badges when other contributors validate their contributions. This proposal relies solely on the willingness of contributors to collaborate for a common good—the betterment of the model.</p>
<p><strong>Prediction market–based:</strong> Contributors get rewarded if their contribution improves the performance of the model when evaluated using a specific test set. This proposal builds on existing work using <a href="https://en.wikipedia.org/wiki/Prediction_market">prediction market</a> frameworks to collaboratively train and evaluate models, including “<a href="https://arxiv.org/abs/1111.2664">A Collaborative Mechanism for Crowdsourcing Prediction Problems</a>” and “<a href="https://papers.nips.cc/paper/5995-a-market-framework-for-eliciting-private-data.pdf">A Market Framework for Eliciting Private Data.</a>”</p>
<p>The prediction market–based incentive in our framework has three phases:</p>
<ul>
<li>A <strong>commitment phase</strong> in which a provider stakes a bounty to be awarded to contributors and shares enough of the test set to prove the test set is valid</li>
<li>A <strong>participation phase</strong> in which participants submit training data samples with a small deposit of funds to cover the possibility their data is incorrect</li>
<li>A <strong>reward phase</strong> in which the provider reveals the rest of the test set and a smart contract validates it matches the proof provided in the commitment phase</li>
</ul>
<p>Participants are rewarded based on how much their contribution helped the model improve. If the model did worse on the test set, then participants who contributed “bad” data lose their deposit.</p>
<p>Here is how the process looks when running in a simulation:</p>
<p><img src="/images/FL-7-3.gif" width="500" height="300" title="Simulation. " class="align-center" /></p>
<p>For this simulation, a Perceptron was trained on the <a href="https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification">IMDB reviews dataset for sentiment classification</a>. The participants contributing “good” data profit, the participants contributing “bad” data lose their funds, and the model’s accuracy improves. This simulation represents the addition of about 8,000 training samples.</p>
<p><strong>Ongoing self-assessment</strong>: Participants effectively validate and pay each other for good data contributions. In such scenarios, an existing model already trained with some data is deployed. A contributor wishing to update the model submits data with features <script type="math/tex">x</script>, label <script type="math/tex">y</script>, and a deposit. After some predetermined time has passed, if the current model still agrees with the classification, then the person gets their deposit back. We now assume the data has been validated as “good,” and that contributor earns a point. If a contributor adds “bad” data—that is, data that cannot be validated as “good”—then the contributor’s deposit is forfeited and split among contributors who’ve earned points for “good” contributions. Such a reward system would help deter the malicious contribution of “bad” data.</p>
<p><img src="/images/FL-7-4.gif" width="500" height="300" title="Simulation. " class="align-center" /></p>
<p>For this simulation, a Perceptron was again trained on the <a href="https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification">IMDB reviews dataset for sentiment classification</a>. This figure demonstrates balance percentages and model accuracy in a simulation where an adversarial “bad contributor” is willing to spend more than an honest “good contributor.” Despite the malicious efforts, the accuracy can still be maintained, and the honest contributor profits. This simulates the addition of about 25,000 training samples.</p>
<h1 id="from-the-small-and-efficient-to-the-complex">From the small and efficient to the complex</h1>
<p>The Decentralized & Collaborative AI on Blockchain framework is about sharing models, making valuable resources more accessible to all, and—just as importantly—creating large public datasets that can be used to train models inside and outside the blockchain environment.</p>
<p>Currently, this framework is mainly designed for small models that can be efficiently updated. As blockchain technology advances, we anticipate that more applications for collaboration between people and machine learning models will become available, and we hope to see future research in scaling to more complex models along with new incentive mechanisms.</p>Niklaus(Yi) Liu97liuyi@gmail.comSignificant advances are being made in artificial intelligence, but accessing and taking advantage of the machine learning systems making these developments possible can be challenging, especially for those with limited resources. These systems tend to be highly centralized, their predictions are often sold on a per-query basis, and the datasets required to train them are generally proprietary and expensive to create on their own. Additionally, published models run the risk of becoming outdated if new data isn’t regularly provided to retrain them.Interpret Federated Learning with Shapley Values2019-06-20T00:00:00-07:002019-06-20T00:00:00-07:00https://niklausliu.github.io/posts/2019/06/FL-7<p>In this paper authors investigate the model interpretation methods for Federated Learning, specifically on the measurement of feature importance of vertical Federated Learning where feature space of the data is divided into two parties, namely host and guest. For host party to interpret a single prediction of vertical Federated Learning model, the interpretation results, namely the feature importance, are very likely to reveal the protected data from guest party. Aunthors propose a method to balance the model interpretability and data privacy in vertical Federated Learning by using Shapley values to reveal detailed feature importance for host features and a unified importance value for federated guest features. Authors’ experiments indicate robust and informative results for interpreting Federated Learning models.</p>
<h2 id="shapley-value">Shapley value</h2>
<p>We know that many machine learning models are black box or semi-black box models. If the model is used for perception tasks such as speech recognition and image recognition, people may not be too concerned about the interpretability of the model. Many people remember the example of the panda that is recognized as a growth arm, but these jobs may be more from the perspective of model reliability.</p>
<p><img src="/images/FL-7-1.png" width="500" height="300" title="Adversarial examples" class="align-center" /></p>
<p>The model’s interpretability is especially important for non-perceived machine learning models in other critical applications, such as loan risk estimation, insurance underwriting forecasting, fraud identification models, and so on. An open source project on Github mentions a very interesting model-independent model interpretation method that uses <strong>Shapley values</strong> to calculate the importance of model features. It is also an important idea behind the <a href="https://github.com/slundberg/shap">SHAP</a> toolkit. SHAP has been used in many practical work by a number of colleagues who are working on financial and insurance machine learning models.</p>
<p>We assume that a model <script type="math/tex">f(x)</script> has been trained to map the feature vector <script type="math/tex">x</script> to a predicted value <script type="math/tex">y</script>, where <script type="math/tex">x</script> contains n different features, <script type="math/tex">x = {x_1, x_2, ..., x_n}</script>. We want to know how each feature plays in this model. This involves two levels: (1) we can be interested in each individual prediction, trying to explain each particular decision with the importance of the feature; (2) we can also be interested in the whole model, want to know each The importance of features when the model makes a large number of decisions. For the latter, which is the characteristic importance of the entire model, in fact, a toolkit such as xgboost already contains similar functions. SHAP is more focused on the former.</p>
<p>The <strong>Shapley value</strong> is derived from <strong>game theory</strong>, and the authors subtly apply the shapley value to model interpretability. Given a model <script type="math/tex">f</script>, for a particular decision, assume that specific values for all features <script type="math/tex">{x_1, x_2, ..., x_n}</script> exist. We treat each feature as a switch and try to construct a series of simulated feature values. Each time, some of the features in <script type="math/tex">{x_1, x_2, ..., x_n}</script> are “turned on” and “turned off”, and then the deviation between the predicted result and the original result is calculated. When we try all the possible combinations of “on” and “off”, there is a series of such deviations. We now need to calculate the importance of the feature <script type="math/tex">x_i</script>. As long as we refer to all the feature combinations except <script type="math/tex">x_i</script>, the average fluctuation of the original result is better for both cases of <script type="math/tex">x_i</script> off and on.</p>
<h2 id="explaining-vertical-federated-learning">Explaining vertical federated learning</h2>
<p>For the vertical federated learning model established by A and B, imagine that we want to explain the model from the perspective of A. We naturally hope that the importance of all the features that A possesses can be displayed as accurately and as accurately as possible. Because A can’t get the eigenvalue of B, we hope that we can combine all the features of B into a whole feature, we call it the <strong>federated feature</strong>, and then calculate the weight value of the federated feature of B.</p>
<p>The author no longer simply arranges all the features of A and B, but arranges and combines all the features of A and the <strong>federated features</strong> of B, and then calculates the Shapley values of these features. The author hopes to accurately find the feature weight of A, and also has an overall understanding of B’s feature weight.</p>
<p>The author looks for a public dataset to experiment with, and the dataset is used to predict whether a person rich. Features include age, gender, ethnicity, country, educational information, investment income over the past year, job rating, position, and weekly working hours. We use the Shapley value to first calculate the weight values for all features as a <strong>Ground Truth</strong> reference. Then, the vertical federated learning is simulated, and the three work-related features are calculated as weights of the <strong>federated features</strong> of B, and compared with <strong>Ground Truth</strong>. We add five features of investment results and work-related features as the <strong>federated features</strong> of B, and compare them again when the federated features occupy more feature space.</p>
<p>🔗The code is available at <a href="https://github.com/crownpku/federated_shap">https://github.com/crownpku/federated_shap</a></p>
<p>🔗<a href="https://arxiv.org/abs/1905.04519">Paper</a></p>Niklaus(Yi) Liu97liuyi@gmail.comIn this paper authors investigate the model interpretation methods for Federated Learning, specifically on the measurement of feature importance of vertical Federated Learning where feature space of the data is divided into two parties, namely host and guest. For host party to interpret a single prediction of vertical Federated Learning model, the interpretation results, namely the feature importance, are very likely to reveal the protected data from guest party. Aunthors propose a method to balance the model interpretability and data privacy in vertical Federated Learning by using Shapley values to reveal detailed feature importance for host features and a unified importance value for federated guest features. Authors’ experiments indicate robust and informative results for interpreting Federated Learning models.Federated Learning for Medical Imaging2019-06-02T00:00:00-07:002019-06-02T00:00:00-07:00https://niklausliu.github.io/posts/2019/06/FL-6<p>Nearly 153 exabytes of healthcare-related data were generated in 2013; this number will increase by 48% annually to reach 2,314 exabytes in 2020 [1], [2], [3]. While machine learning can benefit from this “big data” to generate state-of-the-art models, most healthcare data is hard to obtain due to legal, privacy, technical, and data-ownership challenges, especially among international institutions where HIPAA and GDPR concerns need to be addressed [3], [4].</p>
<p>🔗<a href="https://www.intel.ai/federated-learning-for-medical-imaging/">Source Intel AI Blog</a></p>
<p>Federated learning, <a href="https://ai.googleblog.com/2017/04/federated-learning-collaborative.html">introduced by Google in 2017</a>, is a distributed machine learning approach that enables multi-institutional collaboration on deep learning projects without sharing patient data. In 2018, Intel began a collaboration with the Center for Biomedical Image Computing and Analytics (<a href="https://www.med.upenn.edu/cbica/">CBICA</a>) at the University of Pennsylvania to show the first proof-of-concept application of federated learning to real-world medical imaging [5] (Figure 1). Our initial study demonstrated that federated learning could train a deep learning model (U-Net, [10]) to 99% of the accuracy of the same model trained with the traditional data-sharing method (Figures 2 and 3). In September, we presented our results at the Medical Image Computing and Computer Assisted Intervention (<a href="https://www.miccai.org/">MICCAI</a>) in Granada, Spain. We recently published our results in the Springer’s Lecture Notes in Computer Science [5].</p>
<p><img src="/images/FL-6-1.png" width="500" height="300" title="Federated Learning Architecture using Intel hardware." class="align-center" /></p>
<p>Figure 1: Federated Learning Architecture using Intel hardware. The encrypted model is sent to the individual institutions (Data Owners A-C) which decrypt within a secure enclave in hardware and then train on the local data. Only the model updates are shared with the central model aggregator. This provides protection to both the model and the data. The raw data never leaves the institutions, which not only adds privacy but also prevents large data transfers on the network.</p>
<p>Currently, the University of Pennsylvania and 19 other institutions worldwide are leading the first real-world medical use case of federated learning. Intel will provide support to the project by leveraging the capabilities of our <a href="https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-platform.html">Intel® Xeon® Scalable processors</a> and <a href="https://software.intel.com/en-us/sgx">Intel® Software Guard Extensions (Intel® SGX)</a>. We will show how Intel technology can enhance the security of federated learning by protecting both the data and the model being trained. Our hope is that Intel can provide researchers with the technology to create solutions for federated learning that will enable generalizable, state-of-the-art healthcare models while increasing the protection of sensitive patient data. <a href="https://www.intel.ai/health-and-life-sciences/">Learn more</a> about our AI initiatives in health & life sciences and <a href="https://twitter.com/intelai">follow us</a> to get the latest AI news from Intel.</p>
<p><img src="/images/FL-6-2.png" width="500" height="300" title="Comparing Federated Learning to data sharing." class="align-center" /></p>
<p>Figure 2: Comparing Federated Learning to data sharing. Training a convolutional neural network (U-Net, [10]) with Federated Learning achieves 99% of the accuracy without sharing patient data [cf. [5]].</p>
<p><img src="/images/FL-6-3.jpg" width="500" height="300" title="U-Net Model results." class="align-center" /></p>
<p>Figure 3: U-Net Model results. The final model identifies Glioma brain tumors from MRI scans with 99% of the accuracy as a model that was trained by sharing the raw MRI data, as provided by the BraTS initiative [6-9].</p>
<p>Intel is partnering with the University of Pennsylvania and 19 other medical research institutions on development of a secure federated learning platform, which will enable collaborators to train a shared machine learning model for healthcare without exchanging confidential patient data.</p>
<p>References</p>
<p>[1] Corbin K. How CIOs Can Prepare for Healthcare “Data Tsunami” [Internet]. CIO. 2014 [cited 8 FEB 2019].</p>
<p>[2] Fenton SH, Low S, Abrams KJ, Butler-Henderson K. Health Information Management: Changing with Time. IMIA Yearbook of Medical Informatics 2017.</p>
<p>[3] Stanford Medicine. 2017 Health Trends Report: Harnessing the Power of Data in Health. Accessed online 8 FEB 2019.</p>
<p>[4] Cho J, Lee K, Shin E, Choy G, Do S. How much data is needed to train to a medical image deep learning system to achieve necessary high accuracy? ICLR 2016.</p>
<p>[5] Sheller MJ, Reina GA, Edwards B, Martin J, Bakas S. Multi-institutional Deep Learning Modeling Without Sharing Patient Data: A Feasibility Study on Brain Tumor Segmentation. Lecture Notes in Computer Science book series (Volume 11383). 2019.</p>
<p>[6] Bakas S, Akbari H, Sotiras A, Bilello M, Rozycki M, Kirby JS, Freymann JB, Farahani K, Davatzikos C. “Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features”, Nature Scientific Data, 4:170117 (2017) DOI: 10.1038/sdata.2017.117.</p>
<p>[7] Bakas S, Akbari H, Sotiras A, Bilello M, Rozycki M, Kirby J, Freymann J, Farahani K, Davatzikos C. “Segmentation Labels and Radiomic Features for the Pre-operative Scans of the TCGA-GBM collection”, The Cancer Imaging Archive, 2017. DOI: 10.7937/K9/TCIA.2017.KLXWJJ1Q.</p>
<p>[8] Bakas S, Akbari H, Sotiras A, Bilello M, Rozycki M, Kirby J, Freymann J, Farahani K, Davatzikos C. “Segmentation Labels and Radiomic Features for the Pre-operative Scans of the TCGA-LGG collection”, The Cancer Imaging Archive, 2017. DOI: 10.7937/K9/TCIA.2017.GJQ7R0EF.</p>
<p>[9] Menze BH, Jakab A, Bauer S, Kalpathy-Cramer J, Farahani K, Kirby J, Burren Y, Porz N, Slotboom J, Wiest R, Lanczi L, Gerstner E, Weber MA, Arbel T, Avants BB, Ayache N, Buendia P, Collins DL, Cordier N, Corso JJ, Criminisi A, Das T, Delingette H, Demiralp Γ, Durst CR, Dojat M, Doyle S, Festa J, Forbes F, Geremia E, Glocker B, Golland P, Guo X, Hamamci A, Iftekharuddin KM, Jena R, John NM, Konukoglu E, Lashkari D, Mariz JA, Meier R, Pereira S, Precup D, Price SJ, Raviv TR, Reza SM, Ryan M, Sarikaya D, Schwartz L, Shin HC, Shotton J, Silva CA, Sousa N, Subbanna NK, Szekely G, Taylor TJ, Thomas OM, Tustison NJ, Unal G, Vasseur F, Wintermark M, Ye DH, Zhao L, Zhao B, Zikic D, Prastawa M, Reyes M, Van Leemput K. “The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS)”, IEEE Transactions on Medical Imaging 34(10), 1993-2024 (2015) DOI: 10.1109/TMI.2014.2377694</p>
<p>[10] Ronneberger O., Fischer P. , and T. Brox. “U-Net Convolutional Networks for Biomedical Image Segmentation.” arXiv:1505.04597v1 [cs.CV] 18 May 2015.</p>Niklaus(Yi) Liu97liuyi@gmail.comNearly 153 exabytes of healthcare-related data were generated in 2013; this number will increase by 48% annually to reach 2,314 exabytes in 2020 [1], [2], [3]. While machine learning can benefit from this “big data” to generate state-of-the-art models, most healthcare data is hard to obtain due to legal, privacy, technical, and data-ownership challenges, especially among international institutions where HIPAA and GDPR concerns need to be addressed [3], [4].Federated Learning:Bringing Machine Learning to the edge with Kotlin and Android (Reading Notes)2019-05-25T00:00:00-07:002019-05-25T00:00:00-07:00https://niklausliu.github.io/posts/2019/05/FL-5<p>With the promulgation of the General Data Protection Regulation, users are becoming more aware of their data values and privacy concerns. While anonymous technology can greatly solve the problem of privacy security, the way in which all data is sent to the central processor to train the machine learning model is always the cause of data security concerns.</p>
<p>Selected from the <a href="https://proandroiddev.com/federated-learning-e79e054c33ef">Medium Blog</a><br />
Author: Jose Corbacho</p>
<h2 id="code">Code</h2>
<p>🔗Android Application:<a href="https://github.com/mccorby/PhotoLabeller">https://github.com/mccorby/PhotoLabeller</a><br />
🔗The Server:<a href="https://github.com/mccorby/PhotoLabellerServer">https://github.com/mccorby/PhotoLabellerServer</a></p>
<h2 id="formation">Formation</h2>
<p>The project consists of three main components:</p>
<ul>
<li>The server, written by Kotlin, uses <a href="https://deeplearning4j.org/">DL4J</a> to generate a model based on the Cifar-10 dataset.</li>
<li>An Android app that uses this model to classify camera images, written by Kotlin and also uses DL4J.</li>
<li>The <strong>federated learning environment</strong> enables the Android app to use local data to train the model, and its server can update the shared model using edge updates.</li>
</ul>
<h2 id="model">Model</h2>
<p>The model is based on the Cifar-10 dataset, which classifies images from ten different categories.</p>
<p><img src="/images/FL-5-1.jpg" width="700" height="500" title="Cifar-10 dataset" class="align-center" /></p>
<p>The model chosen is a shallow convolutional neural network with a CNN layer and a dense layer. We used 50 neurons and 10,000 samples to get good performance. The code for the server-side training model is located in the model of <strong>PhotoLabellerServerproject</strong>.</p>
<p><img src="/images/FL-5-2.jpg" width="300" height="100" title="The CNN with a dense layer" class="align-center" /></p>
<p>When connected to a server that uses the latest version of the shared model, the app allows for basic categorization of photos taken by the user using the camera using models embedded in the app itself.</p>
<p><img src="/images/FL-5-3.jpg" width="500" height="300" title="The image classifier in action" class="align-center" /></p>
<p>The app is built by modules, including Android-specific categories and categories related to the Deeplearning4j trainer. The basic module includes interactors and domain objects. The purpose of the trainer application includes making predictions and training with DL4J, and calling the prediction function to obtain image classification.</p>
<p><img src="/images/FL-5-4.jpg" width="700" height="500" title="Key code" class="align-center" /></p>
<h2 id="federated-learning">Federated Learning</h2>
<p>Federated learning reverses the update of machine learning models by allowing edge devices to participate in training. Instead of sending data from the client to a centralized location, federated learning sends the model parameters to the participating devices in an encrypted manner. Then use the local data to retrain the model user’s data without leaving the device including mobile phones, laptops, IoT gadgets, and more. The server opens “loop training” during which the client can send updates to the parameters to the server.</p>
<p><img src="/images/FL-5-5.jpg" width="700" height="500" title="Federated Learning System" class="align-center" /></p>
<h2 id="client---edge-training">Client - edge training</h2>
<p>The Android app will decide when to participate in the training of the shared model. The updated parameters of the model completion are then sent to the server.</p>
<p><img src="/images/FL-5-6.jpg" width="700" height="500" title="Key code" class="align-center" /></p>
<h2 id="server---aggregation-and-update-model">Server - aggregation and update model</h2>
<p>Once the loop training is over, the server updates the shared model via the <a href="https://arxiv.org/abs/1602.05629">Federated Averaging algorithm</a>, as shown in the following figure:</p>
<p><img src="/images/FL-5-7.jpg" width="700" height="500" title="Key code" class="align-center" /></p>
<p>When using the Android app for any image processing operation, the calculation amount of the device is required to be large enough. Training images with images will require them to increase the number of calculations. This also means that the process of migrating learning in an Android application will be fast.
Most apps are able to complete training before the memory is exhausted. The total amount of parameters is about 450k, which is very sufficient for the app.</p>Niklaus(Yi) Liu97liuyi@gmail.comWith the promulgation of the General Data Protection Regulation, users are becoming more aware of their data values and privacy concerns. While anonymous technology can greatly solve the problem of privacy security, the way in which all data is sent to the central processor to train the machine learning model is always the cause of data security concerns.Patient Clustering Improves Efficiency of Federated Machine Learning to predict mortality and hospital stay time using distributed Electronic Medical Records (Reading Notes)2019-05-10T00:00:00-07:002019-05-10T00:00:00-07:00https://niklausliu.github.io/posts/2019/05/FL-4<p><strong>Electronic Medical Records (EMRs)</strong> data is often used in the development of machine learning algorithms to predict disease incidence, patient response to treatment, and other medical events. But so far, most of the algorithms are centralized, rarely considering <strong>non-identically independent distributed (non-IID) data</strong>, and rarely considering the privacy sensitivity of EMRs can complicate the learning process of data.</p>
<h2 id="background">Background</h2>
<p><strong>Electronic Medical Records (EMRs)</strong> data is often used in the development of machine learning algorithms to predict disease incidence, patient response to treatment, and other medical events. But so far, most of the algorithms are centralized, rarely considering <strong>non-identically independent distributed (non-IID) data</strong>, and rarely considering the privacy sensitivity of EMRs can complicate the learning process of data.</p>
<h2 id="introduction">Introduction</h2>
<p>To address the issue of decentralized data affecting the machine learning process, the authors introduced the <strong>community-based federated machine learning (CBFL) algorithm</strong> and evaluated it on IID’s ICU EMRs data.</p>
<p>The CBFL algorithm clusters the distributed data into clinically meaningful communities, obtains similar diagnoses and geographic locations, and learns a model for each community. Throughout the learning process, data is stored in the hospital’s local data, while local calculations are aggregated on the server.</p>
<p>The experimental results show that the <strong>CBFL algorithm</strong> outperforms the baseline <strong>FL algorithm</strong>, which is reflected in: Area Under the Receiver Operating Characteristic Curve (ROC AUC), Area Under the Precision-Recall Curve (PR AUC) and the communication cost between the hospital and the server. three aspects.</p>
<p>Next, we will introduce the application of CBFL development and evaluation to demonstrate the application of decentralized clustering and federated machine learning in ICU EMRs prediction.</p>
<h2 id="data-source">Data Source</h2>
<p>The CBFL was developed based on the eICU Collaborative Research Database, which contains high-quality intensive care data from 200859 patients from 208 hospitals across the United States. The research mainly involves three dimensions:</p>
<ul>
<li>Medicine for patients within 48 hours of the first ICU</li>
<li>Unit discharge status, specifying when the patient leaves the ICU (mortality, 0 means survival, 1 means death)</li>
<li>Unit flow offset, recorded from admission to discharge (ICU hospital stay, average 3858 minutes)</li>
</ul>
<p>In addition, the study selected an additional 50 hospitals, each of which randomly selected 560 patients as samples to form the final 280,000 sample data sets.</p>
<h2 id="cbfl-procedures">CBFL Procedures</h2>
<p><img src="/images/FL-4-1.jpg" width="700" height="500" title="FL Algorithm" /></p>
<p>The whole algorithm is divided into three steps:</p>
<ul>
<li>
<p>In the first step, in the compile training phase, we give each client an initial noise-reducing compiler, fautoencoder, with a weight of <script type="math/tex">{w_0}^c</script>. After <script type="math/tex">E_1</script> iterations, we pass the gradient descent weights of each end to the server to get the average value;</p>
</li>
<li>
<p>In the second step, the trained feature is used to extract the feature value <script type="math/tex">X_c</script> on each client and return the average feature value. Then, we initialize <script type="math/tex">K</script> clusters, and use <strong>K-means clustering</strong> to divide these eigenvalues into these clusters and return to the <strong><script type="math/tex">f_k</script>-means model</strong>;</p>
</li>
<li>
<p>In the third step, the server distributes <script type="math/tex">f_k</script>-means and fencoder to each client. The client uses <script type="math/tex">f_k</script>-means to classify the features extracted by fencoder into <script type="math/tex">K</script> class. Thereafter, the server initializes a total of <script type="math/tex">K</script> models from <script type="math/tex">f_1</script> to <script type="math/tex">f_k</script>, <script type="math/tex">f_i</script> only in each The client is divided into the <script type="math/tex">i</script>-class feature training, and the wc mean of each client is iterated to <script type="math/tex">f_i</script> until <script type="math/tex">f_1</script> to <script type="math/tex">f_k</script> converge.</p>
</li>
</ul>
<p>When a new sample needs to be predicted, use fencoder to extract its features, send it to <script type="math/tex">f_k</script>-means to determine its classification <script type="math/tex">i</script>, and use <script type="math/tex">f_i</script> to predict it.</p>
<p><img src="/images/FL-4-2.jpg" width="700" height="500" title="FL Procesures" /></p>
<h3 id="1-community-analysis">1. Community Analysis</h3>
<p>Patient clustering is a key step in the algorithm: Since grouping patients with similar characteristics together, community-based learning will be easier than learning the entire model on all patients. To illustrate the common characteristics of patients in the same community, we divided 28,000 patients into 5 communities and conducted an enrichment analysis of the diagnosis. The table below lists the number of patients and the number of diagnosed patients.</p>
<p><img src="/images/FL-4-3.jpg" width="700" height="500" title="Community Analysis" /></p>
<h3 id="2-mortality-and-stay-time-prediction">2. Mortality and Stay Time Prediction</h3>
<p>The experiment carried out specific research on the survival rate and ICU hospitalization time. The cluster parameters <script type="math/tex">K=5, 10, 15, 50</script> were selected for testing. In addition, the experiment also compared changes in ROC AUC images from training and test data sets from the same hospital, as well as training and test data sets from different hospitals.</p>
<p>Mortality prediction (same hospitals in training and test sets): In our study, ROC AUC refers to the probability that CBFL scores for randomly selected death patients is higher than that of randomly selected surviving patients, while PR AUC indicates Average accuracy between 0 and 1.</p>
<p>The figure below shows the ROC curve for FL and the ROC curve comparison of the CBFL algorithm under different conditions.</p>
<p><img src="/images/FL-4-4.jpg" width="700" height="500" title="Result" /></p>
<p>The above evaluation results show that CBFL is superior to FL in mortality and hospitalization time prediction tasks in a short communication cycle. Communities tend to accommodate patients with similar diagnoses and geographic locations, making individual community models easier to learn on average than one model.</p>
<h2 id="conclusion">Conclusion</h2>
<p>The CBFL algorithm is also not perfect, and there are limitations in some respects. <strong>For example, suppose we have K community models that are trained locally. Compared to traditional FL, CBFL has more K-1 parameters between the client and the server. And such additional communication load will increase as the number of training samples and community K increases.</strong> The experimental results show that <strong>CBFL performs best in 5 or up to 10 communities</strong>, but there is no guarantee that the application of CBFL on other biomedical data sets will be optimal with less K. Future research directions may include optimizing communication load by designing more effective community-based learning programs, combining more dimensions beyond drug characteristics to further improve prediction accuracy, and developing better clustering methods to capture patient characteristics. and many more.</p>
<p>Last but not least, although this study focuses on machine learning on ICU EMRs, CBFL can be extended to other bioinformatics applications, such as medical image recognition or medical planning decisions across multiple medical shafts. Large, distributed, and privacy-sensitive data.</p>
<p>🔗<a href="https://arxiv.org/abs/1903.09296">Paper Link</a></p>Niklaus(Yi) Liu97liuyi@gmail.comElectronic Medical Records (EMRs) data is often used in the development of machine learning algorithms to predict disease incidence, patient response to treatment, and other medical events. But so far, most of the algorithms are centralized, rarely considering non-identically independent distributed (non-IID) data, and rarely considering the privacy sensitivity of EMRs can complicate the learning process of data.Towards Federated Learning at Scale:System Design (Reading Notes)2019-04-25T00:00:00-07:002019-04-25T00:00:00-07:00https://niklausliu.github.io/posts/2019/04/FL-3<p>Now, Google has implemented the first product-level <strong>Federated Learning System</strong> and published the paper “Towards Federated Learning at Scale: System Design.” The paper further introduces the system design of federated learning and describes the design philosophy and existing challenges of this system. Moreover, Google put forward his solution.</p>
<p>Deepmind research scientist Andrew Trask said on Twitter: “This is one of the most exciting papers of the year of 2019. Google has announced how they can implement scalable federal learning solutions on tens of millions of mobile phones.</p>
<p><img src="/images/FL-3-1.jpg" width="700" height="500" title="Andrew Trask said on Twitter" /></p>
<p>The system described in this project is the first <strong>Production-level</strong> federated learning system implementation, with a primary focus on running the <strong>Federated Averaging Algorithm</strong> on mobile phones. The goal is to extend Google’s systems from federated learning to <strong>Federated Computing</strong>. However, the system is not limited to using TensorFlow for machine learning calculations, but through a <strong>MapReduce-like</strong> workload. One area of the application we see is <strong>Federated Analytics</strong>, which allows us to monitor statistics for <strong>large-scale clustered devices</strong> without having to log raw device data to the cloud.</p>
<h2 id="federated-learning-procedures">Federated Learning Procedures</h2>
<p>As shown in the following figure, the federated learning process can be divided into three phases, namely Selection, Configuration, and Reporting.</p>
<p><img src="/images/FL-3-2.jpg" width="700" height="500" title="Federated learning procedures" /></p>
<p><strong>In the Selection phase:</strong> A device that meets certain conditions will request the server to indicate that it can participate in this round of training. After receiving the request, the server will select a part of the device to participate in this round of training. If some devices are not participating in this round of training, the server will ask them to re-request after a period of time, and the server will consider the factors such as the number of participating devices and the timeout period. This round of training will only succeed if enough equipment is available before the timeout to participate in this round of training.</p>
<p><strong>In the Configuration phase:</strong> The configuration of the server is mainly the way the server selects the model integration, and the configuration of each device is mainly that the server will send the specific FL task and the current FL checkpoint to each device.</p>
<p><strong>In the Reporting phase:</strong> The server waits for each device to return a full result. When the device returns the result to the server, the server aggregates using the aggregation algorithm. Then, the device is notified of the time of the next request. If there are enough device results to return the result before the timeout, the current training is successful. Otherwise, the current training fails.</p>
<p>Throughout the system, there is <strong>Pace Steering</strong>, which manages the connectivity of the device. For small-scale FL training, Pace Steering guarantees that enough equipment is involved in each round of training. For large-scale FL training, Pace Steering randomizes the request time of the device, avoiding a large number of devices requesting at the same time and causing problems.</p>
<h2 id="algorithm">Algorithm</h2>
<p>The key issue in the design of federated learning system infrastructure is to focus on asynchronous or synchronous training algorithms. Many of the results of deep learning have adopted asynchronous training methods before, but recently, there has been a trend of using large-scale synchronous training. Considering ways to enhance privacy in federated learning, including differentially privacy policies (Mchahan et al., 2018), these methods require some post-synchronization concepts on fixed devices, allowing the server side of the learning algorithm to consume only a large number of users. Simple aggregation of update information.</p>
<p>Therefore, Google researchers chose to use the synchronous training method to run the large-scale SGD algorithm and the federal average algorithm. This is the main algorithm in our production operation. The algorithm code is as shown below:</p>
<p><img src="/images/FL-3-3.jpg" width="700" height="500" title="FL algorithm" /></p>
<p><script type="math/tex">K</script> represents all nodes, <script type="math/tex">B</script> is the size of the local minibatch, and <script type="math/tex">E</script> is the number of rounds of local training, and <script type="math/tex">C</script> is the proportion of nodes selected in this round. The algorithm is divided into server level and end level. At the server level: First, the server selects a specific node to participate in the current round of training. Then the server transmits the current model to each node. After the node returns the model result of the training, the server weights, and averages each parameter. At the end level: First, the node divides the local data into a set of batch size B. Next, the data is used for batch training. The primary method used is the gradient descent method, and finally, the updated model is returned to the server, behind In the experimental phase, there is a comparative baseline, FederatedSGD. When C=1, the algorithm is called Federated SGD.</p>
<p>The system described in the paper uses <strong>TensorFlow</strong> to train deep neural networks to train data stored on mobile phones. The <strong>federated learning average algorithm</strong> is used to combine the training data weights in the cloud, construct a global model, and push back to the mobile phone to run the reasoning process. The implementation of secure aggregation ensures that personal updates from mobile phones cannot be peeked globally.</p>
<p><img src="/images/FL-3-4.jpg" width="700" height="500" title="Equipment architecture diagram" /></p>
<p>The paper points out that the system can solve many practical problems, solve the availability problems of devices related to local data distribution in complex ways (such as time zone dependence), deal with unreliable device connections and execution interruption problems, and on devices with different availability. Scheduling problems with lock-steps, as well as limited storage space and computational resources. These issues can be resolved at the communication protocol, device, and server levels.</p>
<p><img src="/images/FL-3-5.jpg" width="700" height="500" title="Components in the federated learning server architecture" /></p>
<p>The data of the federated learning system on the device is more relevant and more sensitive to privacy than the data present on the server. Currently, federated learning is mostly used to supervise learning tasks, usually using tags that are inferred from user activity (such as clicks or typing).</p>
<h2 id="experiment">Experiment</h2>
<p>To evaluate the effectiveness of the federated learning system, the data set used by the project can be divided into two parts, a common data set and a real-world data set.</p>
<p>The generic data set consists of the MINIST and Shakespeare corpus, and the author divides each data set into independent and identical distributions and non-independent and identical distributions. The author built a multi-layer perceptron and CNN to train the MINIST dataset and built LSTM to train the Shakespeare dataset. The authors found that when <script type="math/tex">B \ne \infty</script>, increasing the degree of parallelism (increasing the number of nodes trained) can effectively reduce the number of rounds of communication. Increasing the local training amount of each node can better reduce the number of rounds of communication, and the <strong>FedAvg</strong> algorithm is better than the <strong>FedSGD</strong> algorithm. However, excessively increasing the number of local training rounds often does not lead to better results.</p>
<p>The real dataset is the CIFAR dataset and the social network dataset. The author built a CNN model to train the CIFAR dataset and built the LSTM to train the social network dataset. It can be found that FedAvg trains fewer rounds when the same effect is achieved, and the final effect of <strong>FedAvg</strong> is better than the baseline <strong>FedSGD</strong>.</p>
<h2 id="application">Application</h2>
<h3 id="on-device-item-ranking">On-device item ranking)</h3>
<p>A common use of machine learning models in mobile applications is to select and sort items from inventory on the device. For example, an app can expose search settings for information retrieval or navigation in an app. Sorting search results on the device eliminate costly calls to the server (which may be due to latency, bandwidth limitations, or high power consumption), and any potential private information about search queries and user choices remains on the device. on. Each user’s interaction with the ranking feature can be used as a tag data point, and the interaction information of the user with his or her preference can be observed in the fully sorted item list.</p>
<h3 id="content-suggestions-for-on-device-keyboards">Content suggestions for on-device keyboards</h3>
<p>The federated learning system can increase the value to users by providing recommendations for relevant content entered by users. Federated learning can be used to train machine learning models to trigger suggestion functions and rank items that can be suggested in the current context. Google’s Gboard mobile keyboard team is using this federated learning system and adopts this approach.</p>
<h3 id="next-word-prediction">Next word prediction</h3>
<p>Gboard also uses the federated learning platform to train recurrent neural networks (RNN) for next word prediction. The model has about 1.4 million parameters. After 5 days of training, after processing 600 million sentences from 1.5 million users, it achieves convergence after 3000 rounds of joint learning (about 2-3 minutes per round). The model increases the maximum recall rate of the baseline n-gram model from 13.0% to 16.4%, and its performance is comparable to that of the 120-step server-trained RNN. In real-time comparison experiments, the performance of the joint learning model is better than that of n-gram and server-trained RNN models.</p>
<h2 id="future-research-direction">Future research direction</h2>
<h3 id="state-analysis">State analysis</h3>
<p>There may be system crashes and other problems in the training of federated studies. Since most of the activity in the training takes place on the device, the server does not have the authority to control or touch these activities, so if a crash or other problem is discovered, the server cannot determine what went wrong. The system needs to analyze the cooperation between the device and the server. In each training, the device needs to record the activity and health parameters, and record each activity in each round of training. These data are often not related to privacy so that they can be uploaded to the cloud, and the server can collect similar data, such as how many devices are connected or rejected. By analyzing the data, we can understand what happened, and it happened. What went wrong and proposed a solution.</p>
<h3 id="secure-aggregation">Secure aggregation</h3>
<p>In order to further ensure the privacy of each device, a secure aggregation method can be adopted. Security aggregation is the reporting phase of federated learning. It consists of four phases and three phases. The first phase is the preparation phase, in which the device generates information to be shared. If a device is disconnected at this stage, Then the result of this device will not be aggregated into the final result; the second phase is the commit phase, in which the device encrypts its own results and uploads the encrypted results to the server; the final phase is the termination phase Each device transmits decoding information to the server, and the server decodes and aggregates according to the decoding information.</p>
<p>Security aggregation increases the computational complexity at the server level, which limits the number of devices involved in training. To solve this problem, the authors used a secure aggregation on each aggregator to aggregate the results of the device responsible for the aggregator. An intermediate value and the primary aggregator re-aggregates the intermediate values so that this problem can be solved.</p>
<h3 id="developer-tools-and-workflows">Developer tools and workflows</h3>
<p>Federated learning faces many challenges compared to traditional model training. First, since the server does not know the training data of each node, specific tools are needed to perform pre-training and simulation using proxy data when modeling and initializing. Second, federated learning models are not interactive, but need to be pre-compiled and deployed to the server. The resource consumption and scalability of the final model must be pre-tested. For these reasons, the authors have designed a set of Python interface tools and workflows to help developers solve these problems.</p>
<p>In the model design and simulation phase, engineers can use tools and library files to design federated learning tasks, build models and pre-train, and simulate the entire training process to generate parameters that can be used as initialization parameters for formal training.</p>
<p>In the plan generation phase, each federated learning task is associated with a federal learning plan. This plan is generated by a combination of models and configurations. Each plan is divided into two parts, the server, and the device. The author designed the library to help the engineer. Automatically separate the two parts.</p>
<p>In the specific deployment phase, the federated learning plan can be deployed to the server only if certain conditions are met, and the version issue is also a challenge in federated learning. To overcome this challenge, the tool can help developers generate versions. The plan is mainly compatible with other TensorFlow versions by modifying the calculation map.</p>
<p>Finally, in the evaluation phase, as described earlier, each node will record some additional information to assist in the analysis, and engineers can use the provided analysis tools to analyze the data.</p>
<h3 id="communication-efficiency">Communication efficiency</h3>
<p>Based on the limitations of each device, communication efficiency may become a bottleneck for federated learning. For the problem of model efficiency, there are some solutions to improve communication efficiency. There are two main ways to improve communication efficiency, which are all implemented by modifying the transmission update. The first way is to transmit a structured update, and the second way is to transmit a summary update.</p>
<h2 id="conclusion">Conclusion</h2>
<p>What Google researchers do in this project is to describe the main components of the system and the challenges they face, to determine which issues are not resolved, and hope that these efforts will be instructive for further systematic research.</p>
<p>🔗<a href="https://arxiv.org/abs/1902.01046">Paper Link</a></p>Niklaus(Yi) Liu97liuyi@gmail.comNow, Google has implemented the first product-level Federated Learning System and published the paper “Towards Federated Learning at Scale: System Design.” The paper further introduces the system design of federated learning and describes the design philosophy and existing challenges of this system. Moreover, Google put forward his solution.Federated Learning System2019-04-15T00:00:00-07:002019-04-15T00:00:00-07:00https://niklausliu.github.io/posts/2019/04/FL-2<p>We use the vertically federated learning as an example to introduce the architecture of the <strong>federated learning</strong> system and to explain the detailed process of how it works.</p>
<h2 id="architecture-for-a-federated-learning-system">Architecture for a Federated Learning System</h2>
<p>In this section, we use the vertically federated learning as an example to introduce the architecture of the federated learning system and to explain the detailed process of how it works. <br />
First, let’s take the scenario of two data owners (i.e, companies A and B) as an example to introduce the architecture of the federated learning system, which can be extended to scenarios involving multiple data owners. Suppose that companies A and B want to jointly train a machine learning model, and their business systems each have their own data. In addition, Company B also has label data that the model needs to predict. For data privacy and security reasons, A and B cannot directly exchange data. At this point, the model can be built using the federated learning system, which consists of two parts, as shown in Figure 1a.</p>
<p><strong>Part 1: Encrypted entity alignment.</strong> Since the user groups of the two companies are not the same, the system uses the encryption-based user ID alignment technology to confirm the common users of both parties without A and B exposing their respective data, and the system does not expose users that do not overlap with each other.</p>
<p><strong>Part 2 : Encrypted model training.</strong> After determining the common entities, we can use these common entities’ data to train the machine learning model. In order to ensure the confidentiality of the data during the training process, it is necessary to use a third-party collaborator C for encryption. Taking the linear regression model as an example, the training process can be divided into the following four steps (as shown in Figure 1b):</p>
<ul>
<li>Step ①:collaborator C creates encryption pairs, send public key to A and B;</li>
<li>Step ②:A and B encrypt and exchange the intermediate results for gradient and loss calculations;</li>
<li>Step ③: A and B computes encrypted gradients respectively, and B also computes encrypted loss; A and B send encrypted values to C.</li>
<li>Step ④:C decrypts and send the decrypted gradients and loss back to A and B; A and B update the model parameters accordingly.</li>
</ul>
<p>Iterations through the above steps continue until the loss function converges, thus completing the entire training process. During entity alignment and model training, the data of A and B are kept locally, and the data interaction in training does not lead to data privacy leakage. Therefore, the two parties achieve training a common model cooperatively with the help of federated learning.</p>
<p><strong>Part 3: Incentives Mechanism.</strong> A major characteristic of federated learning is that it solves the problem of why different organizations need to jointly build a model. After the model is built, the performance of the model will be manifested in the actual applications and recorded in a permanent data recording mechanism (such as blockchain). Organizations that provide more data will be better off, and the model’s effectiveness depends on the data provider’s contribution to the system. The effectiveness of these models are distributed to parties based on federated mechanisms and continue to motivate more organizations to join the data federation.</p>
<p>The implementation of the above three steps not only considers the privacy protection and effectiveness of collaboratively-modeling among multiple organizations, but also considers how to reward organizations that contribute more data, and how to implement incentives with a consensus mechanism. Therefore, federated learning is a “closed loop” learning mechanism.</p>
<p><img src="/images/FL-2-1.png" width="700" height="500" title="FL System" /></p>Niklaus(Yi) Liu97liuyi@gmail.comWe use the vertically federated learning as an example to introduce the architecture of the federated learning system and to explain the detailed process of how it works.Federated Learning2019-03-16T00:00:00-07:002019-03-16T00:00:00-07:00https://niklausliu.github.io/posts/2019/03/FL-1<p>The federative learning framework intends to make industries effectively and accurately use data across organizations while meeting the privacy, security and regulatory requirements, in addition to building more flexible and powerful models to enable business cooperation by using data collectively but without data exchange directly.</p>
<h1 id="1-introduction">1. Introduction</h1>
<h2 id="11-background">1.1 Background</h2>
<p>The success of AI relies on the availability of big data. Deep learning systems that recognize images require tens of millions of training images to reach top performance. This is true not only in computer vision, but speech recognition, question answering chatbots, and large-scale recommendation and prediction systems that empower e- commerce systems. A typical example is <strong>AlphaGo</strong> in 2016, which used a total of 300,000 games as training data and achieved excellent results. With AlphaGo’s success, people naturally hope that the big data-driven AI like AlphaGo will be realized in all aspects of life soon. However, the real situation is very disappointing: with the exception of few industries, most fields have only limited data or poor quality data, making the realization of AI technology difficult. The majority organizations and applications only have <strong>small data</strong>, as data collection is often costly if not impossible today. That is the case with many medical applications such as diagnosis, drug design, and health care. Many of these datasets are scattered across different organizations, departments, and businesses.</p>
<p>These data may look like <strong>isolated islands</strong> on a vast ocean, and we may refer to this problem as <strong>small-data problem</strong>. At the same time, it is also hard to break the barriers between data sources. In general, the data required by AI involves multiple fields. For example, in an AI-driven product recommendation service, the product seller has information about the product, data of the user’s purchase, but no data of the user’s purchasing ability and payment habits. In most industries, data exists in the form of isolated islands. Due to industry competition, privacy security, and complicated administrative procedures, even data integration between different departments of the same company faces heavy resistance. It is almost impossible to integrate the data scattered around the country and institutions, or the cost is prohibited.</p>
<h2 id="12-the-gdpr-and-new-challenge-of-ai">1.2 The GDPR and New Challenge of AI</h2>
<p>On the other hand, with the advancement of big data, the emphasis on data privacy and security has become a worldwide trend. Every leak of public data will cause great concern to the media and the public. For example, the recent data breach of Facebook has caused a wide range of protests. At the same time, countries are strengthening the protection of data security and privacy. Take the <strong>General Data Protection Regulation (GDPR)</strong>[1], which was enforced by the European Union on May 25, 2018, for example. GDPR aims to protect users’ personal privacy and data security. It requires business to use clear and plain language for its user agreement and grants users the “right to be forgotten”, that is, users can have their personal data deleted or withdrawn. The GDPR has nearly banned all kinds of autonomous activities in collecting, transferring and using user data. Which means, it is no longer acceptable to simply collect sources of data and integrate them in one location without user permission. Also, many normal operations in the big data domain, such as merging user data from various source parties for building an AI model without any user agreement, are to be considered illegal in the new regulatory framework. The GDPR brings a fundamental shift in the protection of data and privacy, shaping the way how businesses operate; companies will face serious monetary fines for the violation of the regulation.</p>
<p>In the field of AI, the traditional data processing model often involves in one party collecting and transferring data to another party for processing, cleaning and modeling, and finally selling the model to a third party. However, as the above regulations and monitoring become stricter and more substantial, it is possible to break the law by leaving the collector or the user unclear about the specific use of the model. Our data is already in the form of isolated islands. A straightforward solution is to collect all the data to one place for processing. However, it is now illegal to do so because the law does not allow businesses to arbitrarily consolidate data. How to legally solve the problem of isolated data islands is a major challenge for AI scholars and practitioners, because the big data dilemma is likely to lead to the next AI winter.</p>
<h2 id="13-federated-learning-a-feasible-solution">1.3 Federated Learning a Feasible Solution</h2>
<p>It is thus argued that for AI to be a genuinely successful and transforming technology, there need to be efforts on two fronts to address the challenges of the small-data problem and data privacy problem. However, traditional methods for solving this dilemma of big data have run into bottlenecks. Simply exchanging data between two companies is not allowed by many regulations including GDPR. First, the user is the owner of the original data, and the company cannot exchange data without the user’s approval. Second, the usage of models can’t be changed until the user approves it.</p>
<p>Therefore, many attempts at exchanging data in the past, such as Data Exchanges, also require drastic changes to be compliant. At the same time, the data owned by commercial companies often has huge potential value. Two organizations and even two departments of the same organization must consider the interests of exchanging data. Under this premise, one department often choose not to consolidate data with other departments, resulting in data appearing in isolated islands even in the same company.</p>
<p>The federative learning framework intends to make industries effectively and accurately use data across organizations while meeting the privacy, security and regulatory requirements, in addition to building more flexible and powerful models to enable business cooperation by using data collectively but without data exchange directly.</p>
<p>Federated learning is a system that:</p>
<ul>
<li>Data distributed located in each data entities, with no privacy revealing and no compliance violation.</li>
<li>Multiple data parties build a virtual shared model under a data federation system, gaining mutual benefit from the system.</li>
<li>Under such a federal mechanism, the identity and status of each participant are the same.</li>
<li>This virtual model has the same, or nearly the same performance as the model that built by putting all data together.
Federated learning permit learning to be done while multiple data sets stay put – no data exchanges are needed on the raw data to protect privacy and secrecy, providing a feasible solution to the date isolation problem.</li>
</ul>
<h1 id="2-federated-learning">2. Federated Learning</h1>
<h2 id="21-definition-of-federated-learning">2.1 Definition of Federated Learning</h2>
<p>Define multiple data owners 𝐹𝑖, i=1…N who all wish to train a machine learning model by consolidating their respective data 𝐷𝑖 . A conventional method is to put all data together and use D={Di, i=1…N} to train a model <script type="math/tex">M_{SUM}</script>. However, this solution is not possible to implement due to legal issues such as privacy and data security. To solve this problem, we propose federal learning. Federated Learning is a learning process in which data owners collaboratively train a model <script type="math/tex">M_{FED}</script> and in the process any data owner 𝐹𝑖 does not expose its data 𝐷𝑖. In addition, the performance of <script type="math/tex">𝑀_{𝐹𝐸𝐷}</script>,<script type="math/tex">𝑉_{𝐹𝐸𝐷}</script> should be very close to the performance of <script type="math/tex">M_{SUM}</script>,<script type="math/tex">V_{SUM}</script>. That is,</p>
<script type="math/tex; mode=display">|V_{FED}-V_{SUM}| \leqslant \delta</script>
<h2 id="22--categorization-of-federated-learning">2.2 Categorization of Federated Learning</h2>
<p>The above definition of federated learning does not discuss how to specifically design and implement federated learning. In practice, the island data has different distribution characteristics. According to these characteristics, we can propose a corresponding federated learning framework. Below, we will classify federal learning based on the feature and sample ID distribution characteristics of the island data.</p>
<p>Considering that there are multiple data owners, the data set D_i held by each data owner can be represented by a matrix. Each row of the matrix represents a user, and each column represents a user feature. At the same time, some data sets may also contain label data. If you want to build a predictive model of user behavior, you must have label data. We can call the user feature X and the labels Y. For example, in the financial field, the user’s credit is the label Y that needs to be predicted; in the marketing field, the label is the user’s purchase desire Y; in the education field, Y is the degree of knowledge of the student. The user feature X plus the label Y constitutes the complete training dataset (X, Y). However, in reality, it is common that the users of the various data sets are not identical, or the user characteristics are not identical. Specifically, taking federated learning with two data owners as an example, the data distribution can be divided into the following three cases:</p>
<ul>
<li>The overlap of features (X1, X2,…)is large,whereas the overlap of users (U1, U2…)is small;</li>
<li>The overlap of users (U1, U2…)is large,whereas the overlap of features(X1, X2,…) is small;</li>
<li>The overlap of users (U1, U2…) and the overlap of features(X1, X2,…) are both small.</li>
</ul>
<p>In order to provide solutions for the above three scenarios, we classify federated learning into horizontally federated learning, vertically federated learning and federated transfer learning (shown in Figure 1).<a href="https://www.fedai.org.cn/">Ref WeBank</a></p>
<p><img src="/images/FL-1-1.png" width="700" height="500" title="Categorization of Federated Learning" /></p>
<h2 id="221-horizontally-federated-learning">2.2.1 Horizontally Federated Learning</h2>
<p>In the scenarios that two data sets share the same feature space but different in samples, a federated learning system is called horizontal federated learning. For example, two regional banks may have very different user groups from their respective regions, and the intersection of users is very small. However, their business is very similar, so the recorded user features are the same. In this case, a horizontal federated learning model can be built. In 2017, Google proposed a horizontal federated learning solution for Android phone model updates[6-7]: A single user using an Android phone constantly updates the model parameters locally and uploads the parameters to the Android cloud, thus jointly training the centralized model together with other data owners. A secure aggregation scheme to protect the privacy of aggregated user updates under their federated learning framework is also introduced.</p>
<h2 id="222-vertically-federated-learning">2.2.2 Vertically Federated Learning</h2>
<p>Vertically federated learning is applicable to the cases that two data sets share the same users but differ in feature space. For example, consider two different companies in the same city, one is a bank, and the other is an e-commerce. Their user base is likely to contain most of the residents of the area, so the size of common users is large. However, since the bank records the user’s revenue and expenditure behavior and credit rating, and the e-commerce retains the user’s browsing and purchasing history, their user features are very different. Vertically federated learning is the process of aggregating these different features in an encrypted state and computing the training loss and gradients in a privacy-preserving manner to build a model with both data collaboratively. At present, machine learning models such as logistic regression models, tree structure models and neural network-based models have all been proved being able to incorporate into this federated system.</p>
<h2 id="223-federated-transfer-learning">2.2.3 Federated Transfer Learning</h2>
<p>Federated Transfer Learning applies to the scenarios that the two data sets differ not only in samples but also in feature space. In this case, transfer learning[9] techniques can be applied to overcome the lack of data or labels. Consider two institutions, one is a bank located in China, and the other is an e-commerce company located in the United States. Due to geographical restrictions, the user groups of the two institutions have a small intersection. On the other hand, due to the different businesses, only a small part of the data features of the two companies overlap. In this case, in order to carry out effective federated learning, it is necessary to introduce transfer learning to solve the problem of small data size and weak supervision, thereby improving the performance of the model.</p>
<h1 id="3-conclusions-and-prospects">3. Conclusions and Prospects</h1>
<p>In recent years, the isolation of data and the emphasis on data privacy are becoming the next challenges for artificial intelligence, but federated learning has brought us new hope. It could establish a united model for multiple enterprises while the local data is protected, so that enterprises could win together taking the data security as premise. This article generally introduces the basic concept, architecture and techniques of federated learning, and discusses its potential in various applications. It is expected that in the near future, federated learning would break the barriers between industries and establish a community where data and knowledge could be shared together with safety, and the benefits would be fairly distributed according to the contribution of each participant. The bonus of artificial intelligence would finally be brought to every corner of our lives.</p>
<h1 id="reference">Reference</h1>
<p>[1] Dwork C. Differential privacy: A survey of results[C]//International Conference on Theory and Applications of Models of Computation. Springer, Berlin, Heidelberg, 2008: 1-19.<br />
[2] Sweeney L. k-anonymity: A model for protecting privacy[J]. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002, 10(05): 557-570.<br />
[3] Li N, Li T, Venkatasubramanian S. t-closeness: Privacy beyond k-anonymity and l-diversity[C]//Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on. IEEE, 2007: 106-115.<br />
[4] Ho Q, Cipar J, Cui H, et al. More effective distributed ml via a stale synchronous parallel parameter server[C]//Advances in neural information processing systems. 2013: 1223-1231.<br />
[5] Sheth A P, Larson J A. Federated database systems for managing distributed, heterogeneous, and autonomous databases[J]. ACM Computing Surveys (CSUR), 1990, 22(3): 183-236. <br />
[6] Konečný J, McMahan H B, Yu F X, et al. Federated learning: Strategies for improving communication efficiency[J]. arXiv preprint arXiv:1610.05492, 2016.<br />
[7] McMahan H B, Moore E, Ramage D, et al. Communication-efficient learning of deep networks from decentralized data[J]. arXiv preprint arXiv:1602.05629, 2016. <br />
[8] Hardy S, Henecka W, Ivey-Law H, et al. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption[J]. arXiv preprint arXiv:1711.10677, 2017. <br />
[9] Pan S J, Yang Q. A survey on transfer learning[J]. IEEE Transactions on knowledge and data engineering, 2010, 22(10): 1345-1359. <br />
[10] Hesamifard E, Takabi H, Ghasemi M. CryptoDL: Deep Neural Networks over Encrypted Data[J]. arXiv preprint arXiv:1711.05189, 2017.<br />
[11] <a href="https://www.eugdpr.org">https://www.eugdpr.org</a><br />
[12] <a href="http://www.xinhuanet.com/politics/2016-11/07/c_1119867015.htm">http://www.xinhuanet.com/politics/2016-11/07/c_1119867015.htm</a> <br />
[13] <a href="http://www.npc.gov.cn/npc/xinwen/2017-03/15/content_2018907.htm">http://www.npc.gov.cn/npc/xinwen/2017-03/15/content_2018907.htm</a></p>Niklaus(Yi) Liu97liuyi@gmail.comThe federative learning framework intends to make industries effectively and accurately use data across organizations while meeting the privacy, security and regulatory requirements, in addition to building more flexible and powerful models to enable business cooperation by using data collectively but without data exchange directly.