A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.



Application of federated XGBoost in outlier detection

8 minute read


Federated learning technology is rapidly evolving, and the combination of other machine learning methods has become a popular privacy protection model. XGBoost is the “big killer” of machine learning algorithms. In June 2019, the paper “Secureboost: A lossless federated learning framework” proposed the Secureboost federated learning framework, which realized the application of Boosting integrated learning method in vertical federated learning. In another recent paper, “The Tradeoff Between Privacy and Accuracy in Anomaly Detection Using Federated XGBoost”, the authors proposed a method of combining XGBoost with horizontal federayed learning, which was used for credit card transaction anomaly detection and obtained. A good detection effect, find out the privacy protection program to achieve the best training results. Read more

Leveraging blockchain to make machine learning models more accessible

8 minute read


Significant advances are being made in artificial intelligence, but accessing and taking advantage of the machine learning systems making these developments possible can be challenging, especially for those with limited resources. These systems tend to be highly centralized, their predictions are often sold on a per-query basis, and the datasets required to train them are generally proprietary and expensive to create on their own. Additionally, published models run the risk of becoming outdated if new data isn’t regularly provided to retrain them. Read more

Interpret Federated Learning with Shapley Values

5 minute read


In this paper authors investigate the model interpretation methods for Federated Learning, specifically on the measurement of feature importance of vertical Federated Learning where feature space of the data is divided into two parties, namely host and guest. For host party to interpret a single prediction of vertical Federated Learning model, the interpretation results, namely the feature importance, are very likely to reveal the protected data from guest party. Aunthors propose a method to balance the model interpretability and data privacy in vertical Federated Learning by using Shapley values to reveal detailed feature importance for host features and a unified importance value for federated guest features. Authors’ experiments indicate robust and informative results for interpreting Federated Learning models. Read more

Federated Learning for Medical Imaging

5 minute read


Nearly 153 exabytes of healthcare-related data were generated in 2013; this number will increase by 48% annually to reach 2,314 exabytes in 2020 [1], [2], [3]. While machine learning can benefit from this “big data” to generate state-of-the-art models, most healthcare data is hard to obtain due to legal, privacy, technical, and data-ownership challenges, especially among international institutions where HIPAA and GDPR concerns need to be addressed [3], [4]. Read more

Federated Learning:Bringing Machine Learning to the edge with Kotlin and Android (Reading Notes)

3 minute read


With the promulgation of the General Data Protection Regulation, users are becoming more aware of their data values and privacy concerns. While anonymous technology can greatly solve the problem of privacy security, the way in which all data is sent to the central processor to train the machine learning model is always the cause of data security concerns. Read more

Patient Clustering Improves Efficiency of Federated Machine Learning to predict mortality and hospital stay time using distributed Electronic Medical Records (Reading Notes)

6 minute read


Electronic Medical Records (EMRs) data is often used in the development of machine learning algorithms to predict disease incidence, patient response to treatment, and other medical events. But so far, most of the algorithms are centralized, rarely considering non-identically independent distributed (non-IID) data, and rarely considering the privacy sensitivity of EMRs can complicate the learning process of data. Read more

Towards Federated Learning at Scale:System Design (Reading Notes)

13 minute read


Now, Google has implemented the first product-level Federated Learning System and published the paper “Towards Federated Learning at Scale: System Design.” The paper further introduces the system design of federated learning and describes the design philosophy and existing challenges of this system. Moreover, Google put forward his solution. Read more

Federated Learning System

3 minute read


We use the vertically federated learning as an example to introduce the architecture of the federated learning system and to explain the detailed process of how it works. Read more

Federated Learning

14 minute read


The federative learning framework intends to make industries effectively and accurately use data across organizations while meeting the privacy, security and regulatory requirements, in addition to building more flexible and powerful models to enable business cooperation by using data collectively but without data exchange directly. Read more



Decision Tree Model in the Diagnosis of Breast Cancer

Published in 2017 International Conference on Computer Technology, Electronics and Communication (ICCTEC), 2017

Breast cancer is the second leading cause of cancer death in women. At the same time, it is one of the most curable cancer if it could be diagnosed early. More and more researchers have confirmed that the decision tree model has a good ability to accurately diagnose. This paper presents a diagnostic method for breast cancer based on the decision tree model combined with feature selection. Experiments were conducted on different training test divisions of the Wisconsin Breast Cancer Data Set (WBCD), a common method used by researchers to diagnose breast cancer based on machine learning methods. In this paper, in order to reduce the complexity of the decision tree model, this paper proposed to delete some highly relevant features of … After data correlation and independence tests, it finally chosed the tumor thickness, cell shape consistency, single epithelial cell size and mitosis as a subset of the decision tree model. Experimental results show that the classification accuracy (94.3%) significantly outperforms the state-of-theart method with respect to a variety of metrics. Read more

Download here

Big Data Platform Architecture under The Background of Financial Technology

Published in Proceedings of the 2018 International Conference on Big Data Engineering and Technology, 2018

With the rise of the concept of financial technology, financial and technology gradually in-depth integration, scientific and technological means to become financial product innovation, improve financial efficiency and reduce financial transaction costs an important driving force. In this context, the new technology platform is from the business philosophy, business model, technical means, sales, internal management and other dimensions to re-shape the financial industry. In this paper, the existing big data platform architecture technology innovation, adding space-time data elements, combined with the insurance industry for practical analysis, put forward a meaningful product circle and customer circle. Read more

Download here

Dominant Dataset Selection Algorithms for Time-Series Data Based on Linear Transformation

Published in IEEE Internet of Things Journal, 2019

With the explosive growth of time-series data, the scale of time-series data has already exceeds the conventional computation and storage capabilities in many applications. On the other hand, the information carried by time-series data has high redundancy due to the strong correlation between time-series data. In this paper, we propose the new dominant dataset selection algorithms to extract the dataset that is only a small dataset but can represent the kernel information carried by time-series data with the error rate less than {\epsilon}, where {\epsilon} can be arbitrarily small. We prove that the selection problem of the dominant dataset is an NP-complete problem. The affine transformation model is introduced to define the linear transformation function to ensure the selection function of dominant dataset with the constant time complexity O(1). Furthermore, the scanning selection algorithm with the time complexity O(n2) and the greedy selection algorithm with the time complexity O(n3) are respectively proposed to extract the dominant dataset based on the linear correlation between time-series data. The proposed algorithms are evaluated on the real electric power consumption data of a city in China. The experimental results show that the proposed algorithms not only reduce the size of kernel dataset but ensure the time-series data integrity in term of accuracy and efficiency. Read more

Download here


Published in 计算机与数字工程, 2019

国家财政收入来源之一是税收, 而目前, 税务稽查存在着数据采集不全面, 数据传递和存储技术有待完善, 信息无法共享导致稽查成本高和效率低等多种问题. 论文提出了基于时空信息的智慧稽查的大数据应用平台框架构建, 旨在将互联网+, 大数据挖掘, 数据可视化等技术服务综合运用于此平台, 实现税务稽查的低成本, 高效, 达到了基础数据的时间空间一体化, 各行业部门资源整合, 信息共享. Read more

Download here

PPGAN: Privacy-preserving Generative Adversarial Network (under review)

Published in The 25th IEEE International Conference on Parallel and Distributed Systems (ICPADS) (CCF-C, Core-B Conference), 2019

Generative Adversarial Network (GAN) and its variants serve as a perfect representation of the data generation model, providing researchers with a large amount of highquality generated data. They illustrate a promising direction for research with limited data availability. When GAN learns the semantic-rich data distribution from a dataset, the density of the generated distribution tends to concentrate on the training data. Due to the gradient parameters of the deep neural network contain the data distribution of the training samples, they can easily remember the training samples. When GAN is applied to private or sensitive data, for instance, patient medical records, as private information may be leakage. To address this issue, we propose a Privacy-preserving Generative Adversarial Network (PPGAN) model, in which we achieve differential privacy in GANs by adding well-designed noise to the gradient during the model learning procedure. Besides, we introduced the Moments Accountant strategy in the PPGAN training process to improve the stability and compatibility of the model by controlling privacy loss. We also give a mathematical proof of the differential privacy discriminator. Through extensive case studies of the benchmark datasets, we demonstrate that PPGAN can generate high-quality synthetic data while retaining the required data available under a reasonable privacy budget. Read more

Download here




Undergraduate course, School of Data Science of Technology, Heilongjiang University, 2018

I am an instructor for the AI-Team course. This course is designed for undergraduate students. The course is mainly about learning machine learning algorithms and the TensorFlow framework. Thanks to the courseware and source code provided by @Tsinghua University Big Data College. Read more