Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Federated Learning and Differential Privacy: Software tools analysis, the Sherpa.ai FL framework and methodological guidelines for preserving data privacy

Federated Learning and Differential Privacy: Software tools analysis, the Sherpa.ai FL framework... The high demand of artificial intelligence services at the edges that also preserve data pri- vacy has pushed the research on novel machine learning paradigms that fit these require- ments. Federated learning has the ambition to protect data privacy through distributed learning methods that keep the data in its storage silos. Likewise, differential privacy attains to improve the protection of data privacy by measuring the privacy loss in the communication among the elements of federated learning. The prospective matching of federated learning and differential privacy to the challenges of data privacy protection has caused the release of several software tools that support their functionalities, but they lack a unified vision of these techniques, and a methodological workflow that supports their usage. Hence, we present the Sherpa.ai Federated Learning framework that is built upon a holistic view of federated learning and differential privacy. It results from both the study of how to adapt the machine learning paradigm to federated learning, and the definition of methodological guidelines for developing artificial intelligence ser- vices based on federated learning and differential privacy. We show how to follow the methodological guidelines with the Sherpa.ai Federated Learning framework by means of a classification and a regression use cases. Keywords: federated learning, differential privacy, software framework, Sherpa.ai Federated Learning framework Corresponding author Email addresses: rbnuria@ugr.es (Nuria Rodríguez-Barroso), g.stipcich@sherpa.ai (Goran Stipcich), dajilo@ugr.es (Daniel Jiménez-López), jantonioruiz@ugr.es (José Antonio Ruiz-Millán), emcamara@decsai.ugr.es (Eugenio Martínez-Cámara), g.gonzalez@sherpa.ai (Gerardo González-Seco), luzon@ugr.es (M. Victoria Luzón), ma.veganzones@sherpa.ai (Miguel Ángel Veganzones), herrera@decsai.ugr.es (Francisco Herrera) arXiv:2007.00914v2 [cs.LG] 6 Oct 2020 1. Introduction The last advances in fundamental and applied research in artificial intelligence (AI) has aroused interest in industry and end users. This interest goes beyond the traditional centralised setting of AI, and nowadays there is a high demand of AI services at the edges. One of the main pillars of AI is data, whose larger availability has boosted the progress of AI in the last years. However, data is a sensitive element, especially when it describes users’ personal features, such as clinical or financial data. This sensitive nature of per- sonal data has raised the awareness of end users on data privacy protection, promoting the publication of legal frames [1] and recommendations for developing AI services that preserve data privacy [2]. In this context, the progress of AI applications is based on (1) using data generated or stored at the edges, (2) working with large amounts of data from a wide range of sources, and (3) protecting data privacy in order to comply with the legal restrictions and to pay attention to end users’ concerns. Some use cases of AI with these dependencies are: • When data contains sensitive information, such as email accounts, personalised rec- ommendations or health records, applications should employ privacy-preserving techniques to learn from a population of users whilst keeping the sensitive informa- tion on each user ’s device [3]. • When information is located in data silos, for instance, healthcare industry is usually reluctant to disclose its records, keeping it as sequestered data [4]. Nevertheless, joint learning from data silos of different health institutions would allow to improve the robustness of the resulting models. • Due to data privacy legislation, banks [5] and telecom [6] companies cannot share individual records. However they would benefit from models that learn from sev- eral entities’ data. The standard machine learning paradigm does not match the previous dependencies, as it learns from a centralised data source. Likewise, distributed machine learning does not fit the preserving data privacy challenge, because data is shared among several com- putational elements. Moreover, distributed machine learning cannot cope with the chal- lenges associated to decentralised data processing, such as the ability to work with a great amount of clients with non homogeneous data distributions [7]. Federated learning (FL) is a nascent machine learning paradigm where many clients, in the sense of electronic devices or entire organisations, jointly train a model under the or- chestration of a central server, while keeping the training data decentralised [8]. Roughly speaking, data is not shared with the central server, indeed it is kept in the devices where it is stored or generated. Accordingly, FL addresses the challenges of developing AI ser- vices on scattered data across a large amount of clients with non homogeneous data dis- tributions. 2 Maintaining the data in its corresponding storage silos does not completely assure pri- vacy preservation, since several adversarial attacks can still be damaging [9]. Data obfus- cation, anonymisation techniques, such as blindly trusting artificial intelligence black box models (i.e. convolutional neural networks), or randomly sampling data from the clients’ models have been proven to be inadequate to preserve privacy [10, 11]. Moreover, the complete obfuscation of the data greatly reduces its value, thus a balance between pri- vacy and utility is needed. Differential privacy (DP) is proposed as a data access tech- nique which aims to maintain personal data privacy while maximising its utility [12]. The characteristics of FL and DP, and by extension their combination, make them can- didates to address the challenges of distributed AI services that preserve data privacy. The research and progress of FL and DP need the support of software tools that ease the design of privacy-preserving AI services while not requiring development from scratch. Consequently, in recent years several software tools with FL and DP functionalities have been released with this aim. We perform a comparative analysis of the FL and DP software tools released so far, and we conclude that their lack of a holistic view of FL and DP hinders the development of unified FL and DP AI services, as well as the furtherance of addressing the challenges of AI services at the edges that preserve data privacy. Therefore, we present the Sherpa.ai 1,2 Federated Learning framework, an open-source unified FL and DP framework for AI. Sherpa.ai FL aims to bridge the gap between the fundamental and applied research. Moreover, it will facilitate open research and development of new solutions built upon FL and DP for the challenges posed by AI at the edges and data privacy protection. A flexible approach to a wide range of problems is assured by its modular design that takes into account all the key elements and functionalities of FL and DP, which consist of: 1. Data. Different data sets can be processed. 2. Learning model. Several core machine learning algorithms are incorporated. 3. Aggregation operator. Different operators for fusing the parameters of the clients’ learning models are embodied. 4. Clients. It is where the learning models are run. 5. Federated server. The clients can be orchestrated by different communication strate- gies. 6. Communication among clients and server. Different solutions are encompassed to reduce the communication iterations, to protect the learning from adversarial at- tacks, and to obfuscate the parameters with DP techniques. 7. DP mechanisms. The fundamental DP mechanisms, such as the Laplace mecha- nism, as well as the composition of DP mechanisms are incorporated. The progress of AI is not only supported by the release of software tools, but it needs fun- https://developers.sherpa.ai/privacy-technology/ https://github.com/sherpaai/Sherpa.ai-Federated-Learning-Framework 3 damental guidelines defining how to put together the different software tools’ attributes for reaching the intended learning goal while at same time matching the problem restric- tions. Accordingly, since FL is a machine learning paradigm, we first study the principles of machine learning and how to make them fit the FL requirements. We see that most ma- chine learning methods can be directly adapted to a FL setting, but some of them require ad-hoc amendments. As a result of this study, we define the experimental workflow of FL in terms of methodological guidelines for preserving data privacy in the development of AI services at the edges. These methodological guidelines are grounded in the machine learning workflow, and they have guided the design and development of Sherpa.ai FL, therefore they can be followed with Sherpa.ai FL. It is shown how to follow the mentioned methodological guidelines with Sherpa.ai FL through two examples encompassing a classification and a regression use cases, namely: 1. Classification. We use the EMNIST Digits dataset to describe how to conduct a classification task with Sherpa.ai FL. We also compare the federated classification with its centralised counterpart. Both approaches achieve similar results. 2. Regression. We describe how to perform a regression experiment using the Califor- nia Housing dataset. We compare the FL experiment with its centralised version. In addition, it is shown how to assess and limit the privacy loss using DP. The main contributions of this paper are: 1. To analyse the most recently released FL and DP software tools, revealing the lack a unified view of FL and DP that hinders the possibility of addressing the challenges of AI at the edges with data privacy. 2. To present Sherpa.ai FL, an open-source unified FL and DP framework for AI. 3. To study the adaptation of machine learning models to the principles of FL and, accordingly, to define the methodological guidelines which can be followed with Sherpa.ai FL for developing AI services that preserve data privacy with FL and DP. The rest of the paper is organised as follows: the next section formally defines FL and DP as well as their key elements. Section 3 analyses the main FL and DP frameworks’ features. Section 4 introduces Sherpa.ai FL including software architecture and func- tionalities. Section 5 explains the adaptation of the machine learning paradigm to FL, taking into account the adaptation of core algorithms and the methodological guidelines of an experimental workflow. Section 6 shows some illustrative examples consisting in a classification and regression problem. Finally, the concluding remarks and future work are reported in Section 7. 4 2. Federated Learning and Differential Privacy The development of a framework for FL and DP requires a thorough understanding of what FL is and what its key elements are. Accordingly, we formally define FL in Section 2.1, and we detail each key element of a FL scheme in Section 2.2. Similarly, DP is defined in Section 2.3, and its key elements are described in Section 2.4. 2.1. The definition of Federated Learning FL is a distributed machine learning paradigm that consists of a network of nodes where we distinguish two types of nodes: (1) Data owner nodes, fC , . . . , C g, that possess a col- lection of data, fD , . . . , D g, and (2) Aggregation nodes, fG , . . . , G g, aiming at learning 1 n 1 k a model from data owners. The deployment of these two types of nodes defines, at least, two kind of federated architectures according to Yang et al. [13], namely: 1. Peer-to-peer: It is the architecture in which all the nodes are both Data owner and Aggregation nodes. This scheme does not require a coordinator. The main advan- tages are the elevated security and data privacy while the main disadvantage is the computation cost. This FL architecture is illustrated in Figure 1. 2. Client-server: It consists of a coordinator Aggregation node named server and a set of Data owner nodes named clients. In this architecture, the client does not share its local data ensuring its privacy. We represent the client-server scheme in Figure 2. Data Owner B Data Owner A Model update W Updated Model  Updated Model Model update W Model update Model update Model update Model update Aggregation Node A Aggregation Node B Figure 1: Representation of peer-to-peer FL architecture. In the literature we find different ways to refer to the clients in a FL architecture, namely: nodes, agents or clients. In this paper, we rather prefer the term clients. 5 Server (Aggregation node) Updated Model Updated Model Model update Model update  Updated Model Model update Client A (Data Owner node) Client B (Data Owner node) Client C (Data Owner node) Figure 2: Representation of client-server FL architecture. Since the peer-to-peer model is a generalisation of the client-server model, we consider the latter for the formal definition of FL. In this architecture, each of the clients C has a lo- cal learning model LL M represented by the parameters q . FL aims at learning the global i i learning model G L M, represented by q, using the scattered data across clients through an iterative learning process known as round of learning. For that purpose, in each round of learning t, each client trains the LL M over its local training data D , updating their local t t parameters q . Subsequently, the global parameters q are computed aggregating the local t t parameters fq , . . . , q g using a specific federated aggregation operator D: t t t t q = D(q , q , . . . , q ) (1) 2 n After the aggregation of the parameters in the GLM, the LLMs are updated with the ag- gregated parameters: t+1 t q q , 8i 2 f1, . . . , ng (2) The communication between server and clients can be synchronous or asynchronous. In the first option, the server awaits for the clients updates, aggregates all the local parame- ters and sends them to each client. Nevertheless, in the second option, the server merges the local parameters with the GLM as soon as it receives them, using a weighted scheme based on the age difference among the models. 6 We repeat this iterative process for as many rounds of learning as needed. Thus, the final value of q will sum up the clients’ underlying knowledge. In particular, the learning goal is typically to minimise the following objective function: min F(q), with F(q) := w F (q) (3) å i i i=1 where n is the number of clients, F is the local objective function for the i-th client which is the common objective function of the problem fitted to each client’s data, w  0 and w = 1. 2.2. Key elements of Federated Learning The development of a FL environment requires the right combination of a set of necessary key elements. Since FL is a specific configuration of a machine learning environment, FL shares some key elements with it, namely: (1) data and (2) the learning model. However, the particularities of FL make necessary additional key elements, such as: (1) federated aggregation operators, (2) clients, (3) federated server and (4) communication among the federated server and the clients. The adaptation of the common key elements among FL and machine learning, and the FL specific ones are described as what follows. Data. Data plays a central role in FL as in machine learning. The distribution of data becomes crucial in FL since it is distributed among the different clients. Regarding the splitting of the data among clients, there are two possibilities depending on the data dis- tribution: • IID (Independent and Identically Distributed) scenario: when the data distribu- tion in each client corresponds to the population data distribution. In other words, the data in each client is independent and identically distributed, as well as repre- sentative of the population data distribution. • Non-IID (non Independent and Identically Distributed) scenario: when the data distribution in each client is not independent or identically distributed from the population data distribution. In a real FL scenario, each client only stores the data generated on the client itself, ensuring the non-IID property of the global data. Hence, the non-IID scenario is the most likely one and it represents a real challenge for FL. Learning model. The learning model is the shared structure between the server and the clients, where each client trains a local model using its own data, while the global model on the server is never trained, but instead it is obtained aggregating the clients’ model pa- rameters. Thus multiple models are trained without explicit data sharing, a configuration that is essentially different from the classical (centralised) learning paradigm. 7 Federated aggregation operators. The aggregation operator is in charge of aggregating the parameters in the server. It has to: (1) assure a proper fusion of the local learning models in order to optimise the objective function in Equation 3; (2) reduce the number of communication rounds among the clients and the federated server and (3) be robust against clients with poor data quality or malicious clients. Some of the most commonly used federated aggregation operators in the literature are: • Federated Averaging (FedAvg) [14]. It is based on keeping a shared global model that is periodically updated by averaging models that have been trained locally on clients. The training process is arranged by a central server which hosts the shared global model. However, the actual optimisation is done locally on clients. • CO-OP [15]. It proposes an asynchronous approach, which merges any received client model with the global model. Instead of directly averaging the models, the merging between a local model and the global model is carried out using a weight- ing scheme based on a measure of the difference in the age of the models. This is motivated by the fact that in an asynchronous framework, some clients will be trained on obsolete data while others will be trained on more up-to-date data. Clients. Each client of a federated scenario represents a node of the distributed scheme. Typical clients in FL could be smartphones, IoT devices or connected vehicles. Each client owns its specific training dataset and its local model. Their principal aim is to train lo- cal models on their own private data and share the trained model parameters with the federated server where the parameters fusion is performed. Federated server. The federated server orchestrates the iterative learning of FL, which is composed of several rounds of learning. The server participates in: (1) receiving the trained parameters of the local models, (2) aggregating the trained parameters of each client model using federated aggregation operators and (3) updating every learning model with the aggregated parameters. The learning process involved in both training the local models and updating them, is known as a round of learning. The global model, which is stored in the federated server, represents the final model after the learning pro- cess. Therefore it is used for predicting, testing or any posterior evaluation. Communication among the federated server and the clients. Communication between clients and server is the most tricky element of a FL scheme. On the one hand, an efficient communication is a crucial requirement due to the high communication times needed because of the network speed limitations and availability. For that reason, FL should minimise communications and maximise their efficiency by means of, for example, re- ducing the number of rounds of learning. On the other hand, the interchange of model parameters between the server and the clients constitutes a vulnerability to the federated server scheme, since the original data may be reconstructed from the model parameters through model-inversion adversarial attacks [16], resulting in a great risk of private data leakage. For this reason, DP techniques [12] are commonly used in order to share model parameters [17]. 8 2.3. The definition of Differential Privacy DP is the property of an algorithm whose input is typically a database, and whose en- coded response allows to obtain relatively accurate answers to potential queries [12, 18]. The motivation for DP stems from the necessity of ensuring the privacy of individuals whose sensitive details are part of a database, while at the same time being able to gain accurate knowledge about the whole population when learning from the database. DP does not imply a binary concept, i.e., the guarantee or not of an individual’s data privacy. Instead DP establishes a formal measure of privacy loss, allowing for comparison between different approaches. Thus, DP will rigorously bound the possible harm to an individual whose sensitive information belongs to the database by fixing a budget for privacy loss. The formal definition of DP requires a few preliminary notions. Namely, we define the probability simplex over a discrete set B, denoted D(B), as the set of real valued vectors whose jBj components sum up to one and are non-negative: ( ) jBj jBj D(B) := x 2 R : x = 1, x  0, i = 0, . . . ,jBj (4) i i i=1 A randomised algorithmM : A ! B, with B a discrete set, is defined as a mechanism which is associated with a mapping M : A ! D(B) such that, with input a 2 A, the mechanism produces M(a) = b with probability ( M(a)) = P(bja), for each b 2 B. The probability is taken over the randomness employed by the mechanism M. In general, databases are collections of records from a universe X . It is convenient to jXj express databases x by their histogram x 2 N , where each component x stands for the number of elements in the database of type i in X . This interpretation naturally leads to define the distance between databases: two databases x, y are said to be n-neighbouring if they differ by n entries asjjx yjj = n, wherejjjj 1 1 denotes the ` norm. In particular, if the databases only differ in a single data element (n = 1), the databases are simply addressed as neighbouring. At this stage, DP can be formally introduced. A randomised algorithm (mechanism) M jXj jXj with domain N preserves e-DP for e > 0 if for all neighbouring databases x, y 2 N and all S  Range(M) it holds that: P[M(x) 2 S]  exp(e)P[M(y) 2 S] (5) If, on the other hand, for 0 < d < 1 it holds that: P[M(x) 2 S]  exp(e)P[M(y) 2 S] + d (6) then the mechanism possesses the weaker property of (e, d)-DP. The probability is taken over the randomness employed by the mechanism M. In essence, Equation 5 tells us that for every run of the randomisation mechanism M(x), it is almost equally likely to observe the same output for every neighbouring database y, 9 such probability is governed by e. Equation 6 is weaker since it allows us to exceed e with probability d. In other words, DP specifies a “privacy budget” given by e and d. The way in which it is spent is given by the concept of privacy loss. We define the privacy loss incurred in observing the output m employing the randomised algorithm M in two neighbouring databases x, y: P[M(x) = m] L := ln (7) M(x)jjM(y) P[M(y) = m] Since the privacy loss can be both positive and negative, we consider the absolute value of it in the following interpretation. The privacy loss allows us to reinterpret both e and d in a more intuitive way: • e limits the quantity of privacy loss permitted, that is, our privacy budget. • d is the probability of exceeding the privacy budget given by e, so that we can ensure that with probability 1 d, the privacy loss will not be greater than e. DP is immune to post-processing, that is, if and algorithm protects an individual’s pri- vacy, then there is not any way in which privacy loss can be increased, stated in a more jXj formal way: let M : N ! R be a (e, d)-differentially private mechanism and let 0 jXj 0 f : R ! R , then f M : N ! R is (e, d)-differentially private. 2.4. Key elements of Differential Privacy DP arose as the principal setting for privacy-preserving sensitive data when delivering trained models to untrusted parties. The possibilities of DP are built upon the modular structure of its elements, which allows to construct more sophisticated DP mechanisms, and to design, analyse and post-process DP mechanisms for a specific privacy-preserving learner [12, 19]. These necessary or key elements of DP are the DP mechanisms, the com- position DP mechanisms, and the subsampling techniques to increase the privacy. We subsequently detail them. DP mechanisms. We describe the main privacy-preserving mechanisms as what follows: • Randomised response mechanism. It is aimed at evaluating the frequency of an embarrassing or illegal practice. When answering whether it engaged in the afore- mentioned activity in the past period of time, the following procedure is proposed: 1. Flip a coin; 2. If tails, respond truthfully; 3. If heads, flip a second coin and if heads, respond “Yes”, and respond “No” if tails. This approach provides privacy due to “plausible deniability” since the response “Yes” may have been submitted when both coins flips turned out heads. By direct 10 computation it can be shown that this is an e-differentially private mechanism with e = log(3) [12, Section 3.2]. • Laplace mechanism [18]. It is usually employed for preserving privacy in numeric jXj k jXj queries f : N ! R , which map databases x 2 N to k real numbers. At this point, it is important to introduce a key parameter associated to the accuracy of such queries, namely the ` sensitivity: D f := max k f (x) f (y)k (8) jjxyjj =1 jXj Since the above definition must hold for every neighbouring x, y 2 N , it is also denoted as global sensitivity [19]. This parameter measures the maximum magnitude of change in the output of f associated to a single data element, thus, intuitively, it establishes the amount of uncertainty (i.e., noise) to be introduced in the output to preserve the privacy of a single individual. Moreover, we denote as Lap(b) the Laplace distribution with probability density jXj k function with scale b and centred at 0. Given any function f : N ! R , the Laplace mechanism can be defined as M (x, f (), e) := f (x) + (Y , . . . , Y ) (9) L 1 where the components Y are IID drawn from the distribution Lap(D f /e). In other words, each component of the output of f is perturbed by Laplace noise according to the sensitivity of the function D f . It can be shown that this is an e-differentially private mechanism with e = D f /b [12, Section 3.3]. • Exponential mechanism [20]. It is a general DP mechanism that has been proposed for situations in which adding noise directly to the output function (as for Laplace mechanism) would completely ruin the result. Thus the exponential mechanism constitutes the building component for queries with arbitrary utility, where the goal is to maximise the utility while preserving privacy. For a given arbitrary range R, jXj the utility function u : N R ! R maps database/output pairs to utility values. We introduce the sensitivity of the utility function as Du := max max ju(x, r) u(y, r)j (10) r2R jjxyjj 1 where the sensitivity of u with respect to the database is of importance, while it can be arbitrarily sensitive with respect to the range r 2 R. The exponential mechanism M (x, u,R) is defined as a randomised algorithm which picks as output an element of the range r 2 R with probability proportional to exp (eu(x, r)/(2Du)). When normalised, the mechanism details a probability density function over the possible responses r 2 R. Nevertheless, the resulting distribution can be rather complex and over an arbitrarily large domain, thus the implementation of such mechanism might not always be efficient [12]. It can be shown that this is a (2eDu)- differentially private mechanism [20]. 11 • Gaussian mechanism [12]. It is a DP mechanism that adds Gaussian noise to the output of a numeric query. It has two great advantages over the differentially pri- vate mechanisms stated previously: – Common source noise: the added Gaussian noise is the same as the one which naturally appears when dealing with a database. – Additive noise: the sum of two Gaussian distributions is a new Gaussian dis- tribution, therefore it is easier to statistically analyse this DP mechanism. Instead of scaling the noise to the ` sensitivity, as we previously did with the Lapla- cian mechanism, it is scaled to the ` sensitivity: D ( f ) := max k f (x) f (y)k (11) jjxyjj =1 Moreover, we denote as N(0, s ) the Gaussian distribution with probability density 2 jXj k function with mean 0 and variance s . Given any function f : N ! R , the Gaussian mechanism can be defined as: M (x, f (), e) := f (x) + (Y , . . . , Y ) (12) G 1 k where the components Y are IID drawn from the distribution N(0, s). However, it needs to satisfy the following restrictions to ensure it is a (e, d)- differentially private mechanism: for e 2 (0, 1) and variance s > 2 ln(1.25/d) (D ( f )/e) , the Gaussian mechanism is (e, d)-differentially private. To sum up, the main idea behind DP mechanisms is adding a certain amount of noise to the query output, while preserving the utility of the original data. Such noise is calibrated to the privacy parameters (e, d) and the sensitivity of the query function. Composition of DP mechanisms. An appealing property of DP is that more advanced private mechanisms can be devised by combining DP mechanisms, such as the general building components described in Section 2.4. The resulting mechanism then still pre- serve DP, and the new values of e and d can be computed according to the composition theorems. Before the composition theorems are provided, we state an experiment with an adversarial which proposes a composition scenario for DP [12]. Composition experiment b 2 f0, 1g for adversary A with a given set, M, of DP mechanisms. For i = 1, . . . , k: 0 1 1. A generates two neighbouring databases x and x and selects a mechanism M i i from M. 2. A receives the output y 2 M (x ) i i 12 In the experiments the adversary preserves its state between iterations, and we define A’s view of the experiment b as V = fy , . . . , y g. In order to ensure DP in these Composition 1 k experiments we need to introduce a statistical distance which resembles the privacy loss (Equation 7). The d-Approximate Max Divergence between random variables Y and Z is defined as: P[Y 2 S] d D (YjjZ) = max ln (13) P[Z 2 S] P[Y2S]>d We say that the composition of a sequence of DP mechanisms under the Composition ex- d 0 1 periment is (e, d)-differentially private if D (V jjV )  e. Now, we are ready to introduce the composition theorems: • Basic composition theorem. The composition of a sequence fM g of (e , d )- k i i differentially private mechanisms under the Composition experiment with M = k k fM g, is ( e , d )-differentially private. å å k i i i=1 i=1 • Advanced composition theorem. For all e, d, d  0 the composition of a sequence fM g of (e, d)-differentially private mechanisms under the Composition experi- 0 00 ment with M = fM g, satisfies (e , d )-DP with: 0 e 00 0 e = e 2k ln(1/d ) + ke(e 1) and d = kd + d (14) More advanced versions of Equation 14 that allow the composition of private mech- anisms with diverse e and d values and provide tighter bounds can be found in [21]. Privacy filters [22]. While composition theorems are quite useful, they require some pa- rameters to be defined upfront, such as the number of mechanisms to be composed. Therefore, no intermediate result can be observed and the privacy budget can be wasted. In such situations it is required a more fine grained composition techniques which allows to observe the result of each mechanism without compromising the privacy budget spent. In order to remove some of the stated constraints, a more flexible experiment of compo- sition is introduced [22]: Adaptive composition experiment b 2 f0, 1g for adversary A. For i = 1, . . . , k: 0 1 1. A generates two neighbouring databases x and x and selects a mechanism M i i that is (e , d )-differentially private. i i 2. A receives the output y 2 M (x ) i i In these situations, the e and d of each mechanism is adaptively selected based on the i i outputs of previous iterations. For the adaptive composition experiment, the privacy loss 13 of the adversary’s view V = fy , . . . , y g for each pair of neighbouring databases x, y is 1 k defined as follows: P[M (x) = y jV ] i i i V i=1 L = ln (15) P[M (y) = y jV ] i i i i=1 where we write V = fy , . . . , y g, that is, the adversary’s view at the beginning of the i 1 i th i -iteration of the adaptive composition experiment. In particular, if the adaptive com- position experiment has only one iteration (k = 1), the Equation 15 is the same as the definition of privacy loss (see Equation 7). 2k The function CO MP : R ! f H ALT, CON Tg is a valid privacy filter for e, d  0 e ,d g g 0 if for all adversaries in the adaptive composition experiment, the following "bad event" occurs with probability at most d when the adversary’s view V: jL j > e and CO MP (e , d , . . . , e , d ) = CON T (16) e ,d 1 1 k k g g A privacy filter can be used to guarantee that with probability 1 d , the stated pri- vacy budget e is never exceeded. That is, fixed a privacy budget (e , d ), the function g g g 2k CO MP : R ! f H ALT, CON Tg controls the composition. It returns HALT if the e ,d g g 0 composition of k given DP mechanisms surpasses the privacy budget, otherwise it re- turns CONT. Privacy filters have similar composition theorems to the ones given above: • Basic composition for privacy filters. For any e , d  0, CO MP is valid Privacy g g e ,d g g Filter, where: k k H ALT if d > d or e > e , å å i g i g i=1 i=1 CO MP (e , d , ..., e , d ) = e ,d 1 1 k k g g CON T otherwise • Advanced composition for privacy filters. We define K as follows: k k k exp (e ) 1 t 2 2 K := e + H 2 + ln e + 1 ln (2/d ) + e g j å å å i i H 2 i=1 i=1 j=1 with H = . 28.04 ln(1/d ) Then CO MP is a valid Privacy Filter for d 2 (0, 1/e) and e > 0, where: g g e ,d g g H ALT if d > d /2 or K > e , i g g i=1 CO MP (e , d , ..., e , d ) = e ,d 1 1 k k g g CON T otherwise 14 The value of K might be strange at first sight, however if we assume e = e for all j, it remains: ke exp (e) 1 K = ke + H 2 + ln + 1 ln (2/d) + ke H 2 which is quite similar to Equation 14. Increase privacy by subsampling. The privacy of a DP mechanism can be further im- proved whether instead of querying all the stored data, a random subsample is queried. That is, if an (e, d)-differentially private mechanism is used to query random subsample 0 0 from a database with n records, then an improved (e , d ) parameters can be provided ac- cording to the type of random subsample [23]. If the random subsample of size m < n is performed without replacement then: m m 0 e 0 e = ln 1 + (e 1) and d = d (17) n n 0 0 This expression for (e , d ) is better than the original (e, d) in the sense that it is smaller and so is the privacy budget spent. The noise considered in such situation comes from a different source than the noise added by the DP mechanism itself. That is, the DP mech- anism is adding a certain quantity of noise specified by the (e, d) parameters to a random subsample of the database, therefore the information extracted is influenced by the indi- viduals contained in it. This random subsample is sampled each time the DP mechanism is used, which may result in slightly different results for the same query applied multiple times, that is, a new source of noise is added to the query. Particularly, the improvement is greatly noticeable when e < 1, which makes the Gaus- sian Mechanism ideal, since to achieve (e, d)-DP e must be smaller than 1. That is, the Gaussian Mechanism and the subsampling methods, when applied together, can ensure a minor quantity of noise and a tinier privacy budget expenditure at the cost of accessing a small random subsampling of the data. This technique is particularly suited for FL, where the data does not come from all the clients in each iteration, but it does from a random sample of them. Moreover, it is well suited for programs in which the privacy parameters are hardcoded, so the privacy bud- get must be carefully spent. 3. Software tools: FL and DP frameworks analysis The high demand of AI services at the edges which must preserve data privacy has pushed the release of several software tools or frameworks of FL and DP. In this Sec- tion, we discuss the strengths and weaknesses of these software frameworks, we compare them and stress out their main shortcomings. The discussion covers the state of the development of the software tools until the end of May 2020. 15 3.1. PySyft PySyft is a Python library for secure and private deep learning. PySyft decouples private data from model training, using FL, DP, and Encrypted Computation (like Multi-Party Computation (MPC) and Homomorphic Encryption (HE)) within the main deep learning frameworks like PyTorch and TensorFlow. Features. It is compatible with existing deep learning frameworks such as TensorFlow and PyTorch. Their low level FL implementation allows developing and debugging projects with complex communication networks in a local environment with almost no overhead. It is mainly focused on providing Secure MPC through HE, it thus allows to apply computations on ciphertext which is ideal for developing FL models while pre- serving privately the results of the computations to the participants. Last, they offer many Python notebooks, which greatly softens the learning curve of this framework. Shortcomings. Its low level of FL support is missing some key features: neither it in- cludes any dataset by default nor it implements any model aggregation operators. Its low level implementation and the two drawbacks stated before make this framework quite complex to use, requiring considerable knowledge in this field to correctly assem- ble a FL model. While its webpage advertises many DP mechanisms, they are nowhere to be found. As a matter of fact, in their github documentation they state the following: “Do NOT use this code to protect data (private or otherwise) - at present it is very insecure. Come back in a couple of months”. Overview. We conclude that PySyft is a low level FL framework for advanced users which is compatible with many well-known deep learning frameworks and it does pro- vide neither any DP mechanism nor any DP algorithm. 3.2. TensorFlow TensorFlow implements DP and FL through its libraries TensorFlow Privacy and Tensor- Flow Federated, respectively. Features. TensorFlow Privacy is a Python library for training machine learning models with privacy for training data. It integrates seamlessly with existing TensorFlow models and allows the developer to train its models with DP techniques. In addition they have many tutorials to quickly learn how to use it. TensorFlow Federated is an open-source framework for machine learning and other computations on decentralised data. As TensorFlow Privacy, it integrates easily with ex- isting TensorFlow Models. In addition, it has built-in many known training datasets. https://github.com/OpenMined/PySyft https://www.openmined.org https://github.com/tensorflow/privacy https://www.tensorflow.org/federated 16 Shortcomings. TensorFlow Privacy only focuses on differentially private optimisers and it does not provide any DP mechanisms to implement your own differentially private optimisers. It does not officially support any other deep learning library and it is still not compatible with the latest TensorFlow 2.x. In addition, it is a ”library under continual development” according to its Github documentation , it is not thus mature enough for production usage. While TensorFlow Federated provides both low level and high level interfaces for FL settings and it has some high level interfaces to create aggregation operators, it does not provide any built-in aggregation operators. Last, it is not yet compatible with the latest TensorFlow 2.x. Overview. These TensorFlow frameworks in conjunction allow us to develop FL models, but they are tied to the TensorFlow framework, which greatly denies any portability of the generated model. They are neither compatible with the latest version of TensorFlow nor they are ready for final products. In addition, they lack DP mechanisms to implement new privacy-preserving algorithms. 3.3. FATE FATE is an open-source project initiated by Webank’s AI Department to provide a secure computing framework to support the federated AI ecosystem. Features. It provides many interesting FL algorithms and it exposes a high level inter- face driven by custom scripts. Shortcomings. Its high level interface made of scripts relies too much on command line parameters and on a poorly documented domain specific language. It is unclear how to implement a low level FL model, which makes us think this framework is designed as a black box model. Their modular architecture seems quite complex. Also, it does not feature any DP algorithm, and there are no signs of future plans for implementing them. Overview. This framework is mainly focused on FL, making one of its biggest weak- nesses that it does not implement any DP algorithm, in order to improve its data protec- tion regulation compliance. Secure computation protocols ensure that data is not eaves- dropped by an adversary, but it does not ensure that individuals’ privacy, roughly speak- ing, is preserved. In addition, it is expected to be used as a high level interface which relies on a barely documented custom language. https://github.com/tensorflow/privacy https://fate.fedai.org/overview/ https://fate.readthedocs.io/en/latest/examples/federatedml-1.x-examples/README. html 17 3.4. LEAF LEAF is a benchmarking framework for learning in federated settings, with applications including FL, multi-task learning, meta-learning, and on-device learning. Features. This framework mainly focuses on benchmarking FL settings. It provides some basic FL mechanisms such as the Federated Averaging Aggregator and given its modular design it can be adapted to work on any existing framework. Last, it has some known built-in datasets such as FEMNIST, Shakespeare and Celeba. Shortcomings. It does not provide any benchmark for preserving privacy in a FL setting, even though privacy must be taken into consideration as it is a desired property of many FL settings. Moreover, it does not offer as many official documentation or tutorials as the other frameworks discussed in this section. Overview. LEAF offers a baseline implementation for some basic FL methods but its main purpose is benchmarking FL settings. However, DP benchmarks are not provided, even though nowadays privacy is a concern in most FL settings. 3.5. PaddleFL 2 3 PaddleFL is an open source FL framework based on PaddlePaddle . PaddlePaddle is an industrial platform with advanced technologies and rich features that cover core deep learning frameworks, basic model libraries, end-to-end development kits, tool and com- ponent as well as service platforms. Features. PaddleFL provides a high level interface to develop FL models with DP. In the FL field it implements the Federated Averaging Aggregator and its secure multi-party computation equivalent. When it comes to DP, it provides an implementation of the dif- ferentially private stochastic gradient descent. Shortcomings. This framework has little documentation. It lacks any other DP algorithm so there is great difficulty in developing alternative privacy-preserving techniques. Last, since it is based on PaddlePaddle it is not compatible with other frameworks, and there is little documentation which makes it really hard to use and understand. Overview. PaddleFL provides a high level interface for some basic and well-known FL aggregators and implements a differentially private algorithm, being one of its main drawbacks that it is little documented and it does not implement any tool to easily ex- tend its capabilities. https://leaf.cmu.edu/ https://paddlefl.readthedocs.io/en/latest https://github.com/paddlepaddle/paddle 18 3.6. Frameworks analysis The discussed software tools share some shortcomings for developing distributed AI ser- vices that preserves data privacy. Among them, we stress out the following: 1. They focus on FL or DP, but they do not provide a unified approach for both of them. 2. They lack DP mechanisms and related methods from the DP area. Likewise, they do not allow to develop and integrate new DP mechanisms in the frameworks. 3. Only the most basic federated aggregation operators are implemented. They are mainly focused on deep learning models, and they do not provide support for other machine learning algorithms that may be also used in the FL setting. We summarise and compare the characteristics of the frameworks reviewed in Table 1. We conclude that a unified FL and DP framework is required, and this is the ambitious aim of Sherpa.ai FL, which we present in the following section. TensorFlow PySyft LEAF PaddleFL FL & DP features Federated Learning: Use federated models with different datasets Support for other libraries Sampling environment: IID or non-IID distribution Federated aggregation mechanisms Federated attack simulator Differential Privacy: Mechanisms: Exponential, Laplacian, Gaussian Sensitivity sampler Subsampling methods to increase privacy Adaptive Differential Privacy Desired properties: Documentation & tutorials High level API Ability to extend the framework with new properties Table 1: FL and DP features comparison among existing frameworks. Complete Partial Do not work Unknown 4. Sherpa.ai Federated Learning Framework 1,2 We develop Sherpa.ai FL, which is an open-research unified FL and DP framework that aims to foster the research and development of AI services at the edges and to pre- serve data privacy. We describe the hierarchical and modular software architecture of Sherpa.ai FL, related to the key elements of FL and DP shown in Section 4.1. Likewise, https://developers.sherpa.ai/privacy-technology/ https://github.com/sherpaai/Sherpa.ai-Federated-Learning-Framework 19 we detail the functionalities and the implementation details of Sherpa.ai FL in Section 4.2 and Section 4.3. 4.1. Software architecture The software is structured in several modules that encapsulate the specific functionality of each key element of a FL setting. The architecture of these software modules allows the extension of the framework in relation to the progress of the research on FL. Figure 3 shows the backbone of the software architecture of Sherpa.ai FL, and we describe each module as what follows: • data base: it is in charge of reading the data according to the chosen database. It is related to the data key element. • data distribution: it performs the federated distribution of data among the clients involved in the FL process. It is also related to the data key element and completes its functionality. • private: it includes several interfaces such as the node interface which represents the clients key element and other ones that allow to access and modify the federated data distribution. • learning approach: it represents the whole FL scheme including the federated server model and the communication and coordination among federated server and clients. It encapsulates the federated server and the communication key elements. • federated aggregator: it defines the software structure to develop federated ag- gregation operators. It is linked to the federated aggregation operator key element. Federated  Goverment Federated  Data distribution Model aggregator implements Differential Data base Private privacy Figure 3: Links between the different modules of Sherpa.ai FL. 20 • model: it defines the learning model using predefined models and their functional- ities. This learning model could be any machine learning model that can be aggre- gated by its representation in parameters. It is related to the model key element, as we associate a model object with the clients and the federated server. • differential privacy: it preserves DP of the clients by specifying the data access. It is related with the DP key elements, and also with the data, the clients and the communication FL key elements. 4.2. Software functionalities In this section we highlight the main contributions of Sherpa.ai FL, which are sum- marised in a wide range of functionalities, namely: • To define and customise a FL simulation with a fixed number of clients using clas- sical data sets. • To define the previous FL simulation using high-level functionalities. • To train machine learning models among different clients. Currently, Sherpa.ai FL offers support for a Keras models (neural networks), and for several models from Scikit-Learn (linear regression, k-means clustering, logistic regression). • To aggregate the information learned from each of the clients into a global model using classical federated aggregation operators such as: FedAvg, weighted FedAvg [24] and an aggregation operator for the adaptation of the k-means algorithm to the federated setting [25]. • To apply modifications on federated data such as normalisation or reshaping. • To evaluate the FL approach in comparison with the classical centralised one. • To preserve DP of clients’ data and model parameters in the FL context. The plat- form currently offers support for the fundamental DP mechanisms (Randomized Response, Laplace, Exponential, Gauss), and the composition of DP mechanisms (Basic and Advanced adaptive composition using privacy filters for the maximum privacy loss). Moreover, it is possible to increase privacy by subsampling. In Table 2, we summarise the main contributions of Sherpa.ai FL in comparison with the key points analysed for each framework in the previous Section. Thanks to the hierarchical implementation of each module, the aforementioned func- tionalities can be extended and customised just by adding software classes that inherit from the original software classes. For example, the already available machine learning models, DP mechanisms and federated aggregation operators can be modified, or new ones can be created, simply by overwriting the corresponding methods in the classes TrainableModel, DataAccessDefinition, FederatedAggregator, respectively. 21 TensorFlow LEAF PySyft PaddleFL FL & DP features Federated Learning: Use federated models with different datasets Support for other libraries Sampling environment: IID or non-IID distribution Federated aggregation mechanisms Federated attack simulator Differential Privacy: Mechanisms: Exponential, Laplacian, Gaussian Sensitivity sampler Subsampling methods to increase privacy Adaptive Differential Privacy Desired properties: Documentation & tutorials High level API Ability to extend the framework with new properties Table 2: FL & DP features comparison between existing frameworks and Sherpa.ai FL. Complete Partial Do not work Unknown 4.3. Implementation details 1 2 Sherpa.ai FL has been developed by DaSCI Institute and Sherpa.ai. We developed the software using Python language for the whole architecture. Furthermore, Keras , 4 5 TensorFlow and scikit-kearn APIs are employed for the machine learning part which ensures efficiency and compatibility. It can also be run on computing devices such as CPUs or GPUs. In order to use GPUs, the adequate versions of TensorFlow and CUDA must be installed. For detailed installation 6,7 instructions, please see the installation guide. The framework is licensed under the Apache License 2.0, a permissive license whose main conditions require preservation of copyright and license notices. 5. Machine learning matches federated learning. Methodological guidelines for pre- serving data privacy FL is a paradigm of machine learning, but its particularities force to adapt the machine learning settings to FL. For this reason, Sherpa.ai FL functionalities outlined in Section https://dasci.es/ https://sherpa.ai/ https://keras.io/ https://www.tensorflow.org/ https://scikit-learn.org/stable/ https://github.com/sherpaai/Sherpa.ai-Federated-Learning-Framework/blob/master/ install.md The implementation details described in this paper corresponds to the release 0.1.0 of Sherpa.ai FL. https://www.apache.org/licenses/LICENSE-2.0 22 4 stem from the need to develop machine learning algorithms specialised for Federated Artificial Intelligence that protect clients’ data privacy. However, the adaptation is not only focused on the algorithms, but also on the workflow of machine learning. In this section, we first discuss the key aspects of distributed computing for machine learning in the federated setting. Following, Section 5.2 defines a rather specific paradigm for adapting machine learning models to the federated setting, followed by some remark- able exceptions to the defined adaptation in Section 5.3. Finally, we also define the adap- tation of the machine learning workflow to FL in Section 5.4 as methodological guidelines for preserving data privacy with FL using Sherpa.ai FL. 5.1. Key aspects of distributed computing for Federated Machine Learning The recent introduction of FL [26, 14, 7] responds to the need for novel distributed ma- chine learning algorithms in a setting that clashes with several assumptions of conven- tional parallel machine learning in a data centre. The differences are substantially origi- nated by the unreliable and poor network connection of the clients, since clients are typi- cally mobile phones. Thus reducing the number of rounds of learning is essential as com- munication constrains are more severe. Additionally, the data is unevenly scattered across K clients and must be considered as non-IID, that is, the data accessible locally is not in any way representative of the overall trend. The data is in general sparse, where the features of interest take place on a reduced number of clients or data points. Ultimately, the number of total clients greatly exceeds the number of training points available locally on each client (K  n/K). In the federated machine learning setting, the training is decoupled from the access to the raw data. In fact, the raw data never leaves users’ mobile devices and a high-accuracy model is produced in a central server by aggregating locally computed updates. At each FL round, an update vector q 2 R is sent from each client to the central server to improve the global model, with d the parameters’ dimension of the computed model. It is worth noting that the magnitude of the update q is thus independent from the amount of raw data available on the local client (e.g., q might be a gradient vector). One of the advantages of this approach is the considerable bandwidth and time saved in data communication. Another motivation for the FL setting (but that also constitutes one of its intrinsic advan- tages) is the concern for privacy and security. By not transferring any raw data to the central server, the attack surface reduces to only the single client, instead of both client and server. On the other hand, the update q sent by the client might still reveal some of its private information, however the latter will be almost always dramatically reduced with respect to the raw training data. Besides, after improving the current model by the update q, this can (and should) be deleted. Sherpa.ai FL allows for both IID and non-IID client data. Sherpa.ai FL allows for weighted aggregation, emphasising the contribution of most significant clients to the global model. 23 Additional privacy can be provided by randomised algorithms providing DP [12], as de- tailed in Sections 2.3 and 2.4. In particular, the centralised algorithm could be equipped with a DP layer allowing the release of the global model without compromising the pri- vacy of the individual clients who contributed to its training (see e.g., Abadi et al. [27]). On the other hand, in the case of a malicious or compromised server, or in the case of po- tential eavesdropping, DP can be applied on the local clients for protecting their privacy [28, 29, 30]. 5.2. The Machine Learning paradigm in a federated setting In the following, we describe the federated machine learning paradigm by recognising some relevant attributes that ease the natural adaptation of a ML model in the federated setting. Primarily, we observe that a great number of machine learning methods resemble the minimisation of an objective function with finite-sum as in Equation 3. The aforemen- tioned problem structure encompasses both linear and logistic regressions, support vec- tor machines, and also more elaborated techniques such as conditional random fields and neural networks [26]. Indeed, in neural networks predictions are made through a non- convex function, yet the resulting objective function can still be expressed as F (q) and the gradients can be efficiently obtained by backpropagation, thus resembling Equation 3. A variety of algorithms have been proposed to solve the minimisation problem in Equa- tion 3 in the federated setting, where, as mentioned earlier, the primary constrain is the communication efficiency for reducing the number of FL rounds in the aggregation of local models. In this context, another characteristic trait of the Federated ML paradigm is constituted by the intrinsic compatibility with baseline aggregation operators (e.g. Feder- ated Averaging), and where no ad-hoc adaptation is required. Ultimately, several of these FL algorithms have been supplied with DP [27, 29, 30]. We thus identify a rather important aspect of the Federated machine learning paradigm as being prone to straightforward application of the common building components of DP. In addition, the latter feature eases the task of estimating the privacy loss in the FL rounds by the application of composition theorems for DP. To summarise, a machine learning method is prone to adaptation in the federated set- ting if it adheres to the principles of the federated machine learning paradigm described above, namely: (1) the problem structure resembling the minimisation of an objective function as in Equation 3, (2) the attribute of easy aggregation of local models’ parame- ters, and (3) the direct applicability of DP techniques for additional privacy. Among such machine learning models we cite neural networks [31], linear [32, 33] and logistic [34] Sherpa.ai FL allows to apply sophisticated and customised DP mechanisms on the model’s parame- ters, as well as on client’s raw data (see Section 2.4). Sherpa.ai FL offers support for both common building components for DP, as well as for its basic and advanced composition theorems using privacy filters (see Section 2.4). 24 1 regressions. 5.3. Models deviating from the federated machine learning paradigm It is worth mentioning specific machine learning models whose structure only partially fits in the federated machine learning paradigm described above. Although the problem structure can still be represented by a minimisation of an objective function as in Equation 3, their adaptation to a federated setting requires additional and ad-hoc procedures. One example is found in the k-means clustering algorithm for unsupervised learning [35, 36], where the non-IID nature of the data distribution is seen as a major obstacle in a federated setting. Namely, the direct application of average aggregation is unfeasible due to potentially different number and ordering of local clusters, and more advanced al- gorithms need to be employed. A workable solution is to fix the number of local clusters, and apply an additional k-means clustering in the average aggregation. Alternatively, one might try grouping the clients’ population sharing jointly trainable data distributions, as proposed by Sattler et al. [37] in the context of deep neural networks. An additional complication is constituted by the preservation of clients’ privacy. For instance, the base- line DP building components necessitate some adjustments in order to be applied. Al- though not in a FL context, in Zhang et al. [38] the authors adjust the Laplace noise added to each centroid based on the contour coefficients. Another notable example is represented by the federated version of matrix factorization- based ranking algorithms for recommendation systems [39, 40]. The peculiar architecture of this algorithm involves the communication of only a portion of the update vector for improving the model, thus the round of learning implies additional communications be- tween the central server and the clients. Moreover, the multiple communication iterations necessitates further caution with privacy loss in the DP context. A viable approach is to implement a two-stage randomised response mechanism on the local data, and to allow the clients to specify their privacy level [30]. 5.4. Methodological guidelines for preserving data privacy with federated learning in Sherpa.ai FL The experimental settings of FL and machine learning are very similar because FL is a machine learning paradigm. Nonetheless, the particularities of FL force to revise the ma- chine learning workflow, and adapt it to the FL definition. In this section, we define the workflow of FL based on the machine learning workflow, and we present it as method- ological guidelines. Sherpa.ai FL comply with these methodological guidelines assur- ing the following of good practises in the development of AI services at the edges that preserve data privacy. See also notebooks on deep learning, linear and logistic regressions available in Sherpa.ai FL at https: //github.com/sherpaai/Sherpa.ai-Federated-Learning-Framework/tree/master/notebooks See notebook on k-means clustering in Sherpa.ai FL at https://github.com/sherpaai/Sherpa. ai-Federated-Learning-Framework/tree/master/notebooks 25 We distinguish two scenarios in FL: (1) a real one, where we do not actually know the un- derlying distribution of the data and (2) a simulation of a FL scenario, where it is possible to emulate a federated data distribution in order to analyse its use case. The guidelines are focused on the real FL scenario although we remark the particularities of a simulated FL experiment. Moreover, we assume that the problem is properly formulated, that is, the data features and the target variable are previously defined and agreed upon by the clients. Based on this hypothesis, we show the scheme of the workflow of a FL experiment in Figure 4 and detail it in the following sections. DATA Data Collection Data Preparation Learning model Hyper-parameter Model training Model evaluation LEARNING selection Tuning MODEL Aggregation  operator selection Predictions PREDICTION making Figure 4: Flow chart of a FL experiment. 5.4.1. Data collection In a real FL scenario, the data naturally belongs to the clients. Therefore, the data collec- tion takes place locally at each client, resulting in a distributed approach from the outset. In a strict FL scenario, the server has no knowledge at all of the data. However, there is the possibility for the server to gain minor prior knowledge of the problem if a global validation or test dataset is used. Here, we assume that the server does not have any information, which is the most restrictive and common situation. Remark: When simulating a FL scenario in scientific research, data collection is reduced to accessing a database. The distribution of the data among clients is simulated in the data preparation step. 5.4.2. Data preparation Data preparation involves two tasks: (1) Data partition, where we split data in training, evaluation and test sets, and (2) data preprocessing, where we transform the training data in order to improve its quality. Data partition. The process of splitting data into training, evaluation and test datasets in FL is similar to centralised machine learning process with the difference of replicating the 26 process for the data stored on each client. That is, each client dataset is split into training, evaluation and test sets. Remark: When it comes to a FL scenario for scientific research, it is feasible to have global evaluation and test datasets by extracting them before assigning the rest of the data to the clients as local training datasets. Moreover, in a simulation it could be a good practise to use both global and local evaluation and test datasets combining both methodologies. Data preprocessing. Preprocessing is the most tricky task in FL due to the distributed and private character of the data. The challenge is to consistently preprocess distributed datasets in several clients without any clue about the underlying data distribution. The process of adapting centralised preprocessing techniques to federated data is time- consuming. For the techniques based on statistics of data distributions (e.g. normali- sation) it is necessary to use robust aggregation of the statistics, which is a challenge in some situations. Algorithms based on intervals (e.g. discretisation) require a global in- terval that includes all the possible values. Moreover, there are complicated methods of robust adaptation such as feature selection [41]. Because of these intricacies, it is advisable to rely on preprocessing techniques adapted to distributed scenarios. Regarding distributed data preprocessing, we might take inspiration from different dis- tributed preprocessing techniques that have already been developed [42]. However, most of these methods need to be adapted in order to respect data privacy. A distributed model that suits privacy restrictions is MapReduce [43]. Therefore, big data preprocessing tech- niques [44] that are interactively applicable can be adapted to a FL scenario in compliance with data constraints. Remark: When simulating FL, it is possible to use centralised preprocessing methods be- fore splitting the data between the clients. It is not a recommended practice, but a useful trade-off in terms of experimentation. 5.4.3. Model selection This step implies, besides the choice of the learning model as in any centralised approach, the choice of the parameter aggregation mechanism used in the server. Choosing the learning model. This task consists of choosing the learning model structure stored both in the server and the clients. Clearly, the model has to correspond to the type of problem being addressed. The only restriction is that the learning model has to be representable using parameters in order to get the server learning model by aggregating local parameters. The canonical example of a learning model that can be represented using parameters is deep learning, but this is not the only one. Remark: When we simulate a FL scenario, server learning model can be initialised using global previous information. However, in a strict FL scenario it is initialised with the first aggregation of local parameters. 27 Choosing the aggregation operator. We also need to choose the aggregation operator used for client parameters aggregation at this point. There are different types of aggrega- tion operators: (1) operators which aggregate every client parameters (such as FedAvg), (2) operators which select the clients that take part in the aggregation (e.g. based on the performance) and (3) asynchronous aggregation operators (such as CO-OP). 5.4.4. Model training The iterative FL training process is divided into rounds of learning, and each round con- sists of: 1. Training the local models on their local training dataset, 2. sharing of the local parameters to the server, 3. aggregation of local models’ parameters on the server using the aggregation opera- tor and 4. updating the local models with the aggregated global model. 5.4.5. Model evaluation The evaluation of a FL model consists of assessing the aggregated model after assigning it to each client using the local evaluation datasets. After that, each client shares the performance with the server, which combines the local performances resulting in global evaluation metrics. Since the amount of data per client can be variable, we recommend using absolute metrics on clients (e.g. confusion matrix) and combine them on the server to get the remaining evaluation metrics. Remark 1: When simulating FL, we can use a global evaluation dataset in order to eval- uate the performance of the aggregated model. Moreover, we can use cross-validation methodologies to evaluate the model’s performance by partitioning all the folds at the beginning and replicating the whole workflow for each of the fold combinations. Remark 2: Although it is not the main purpose of FL, it might be worthwhile to evaluate the local models prior to the aggregation for measuring the customisation of the local model to each client. 5.4.6. Hyper-parameter tuning We base the tuning of the hyper-parameters of the learning models on the metrics ob- tained in the previous step, and modify certain learning model parameters in order to improve the performance on the evaluation datasets. Remark: According to the previously mentioned customisation, although it is not the ob- jective of the FL, we could tune each of the local models independently according to the local evaluation performance before the aggregation in order to improve customisation. 28 5.4.7. Predictions making The last step in the machine learning workflow after the training of the learning model, and by extension in the corresponding FL one, is to predict the label of unknown exam- ples. Those predictions are done with test sets of each client. Remark: When simulating FL, we can use global test dataset for prediction. Moreover, we can test local learning models prior to aggregation using instances of other clients (unknown targets) in order to measure the capability of generalisation of local models. 6. Illustrative cases of study One of the main characteristics of Sherpa.ai FL is its development upon the method- ological guidelines for FL detailed in Section 5.4. In this section, we show how to follow these methodological guidelines with Sherpa.ai FL through two experimental use cases, namely: 1. Classification with FL (see Section 6.1): showing how to create each of the key- elements of a FL experiment and combine them using our framework. To end with this example, we also compare the FL approach with a centralised one. 2. Regression with FL and DP (see Section 6.2): we compare the centralised with the FL approach with DP. Moreover, we demonstrate how to limit the privacy loss through the Privacy Filters implemented in Sherpa.ai FL. For more illustrative examples of the framework use, please see the notebook examples. 6.1. Classification with FL In this section we provide a simple example of how to develop a classification experiment in a FL setting with Sherpa.ai FL. We use a popular dataset to start the experimentation in a federated environment, and we finish the example with a comparison between fed- erated and centralised approaches. 6.1.1. Case of study In order to show the functionality of the software, we implement a simple and instruc- tive case of study. We use the EMNIST Digits dataset [45]. It consists of an extended version of the classic MNIST dataset which includes writings of several authors with dif- ferent features. This fact provides the non-IID character to the data which is useful for the simulation of federated environments. Table 3 shows the size of the dataset. For the simulation of the FL scenario we use 5 clients among which the instances of the dataset are distributed following a non-IID distribution. We use as learning model a sim- ple CNN (Convolutional Neural Networks) based neural network represented in Figure https://github.com/sherpaai/Sherpa.ai-Federated-Learning-Framework/tree/master/ notebooks https://www.nist.gov/itl/products-and-services/emnist-dataset 29 1568 28x28x1 28x28x64 14x14x64 14x14x32 7x7x32 conv3x3, 32 maxpool2x2 dense conv3x3, 64 maxpool2x2 stride (1, 1) stride (2, 2) dense stride (1, 1) stride (2, 2) flatten dense Figure 5: CNN-based neural network used as learning model in the illustrative example. Train set Test set Total 240 000 40 000 280 000 Table 3: Distribution of EMNIST Digits dataset. 5, and as federated aggregation operator the widely used operator FedAvg. The code of the illustrative example is detailed in the following section. 6.1.2. Description of the code We start the simulation with the first step of the methodological guidelines, i.e. the prepro- cessing of the data collection. Accordingly, we begin with loading the dataset. Sherpa.FL provides some functions to load the EMNIST Digits dataset. [1]: import matplotlib.pyplot as plt import shfl from shfl.private.reproducibility import Reproducibility # Comment to turn off reproducibility: Reproducibility(1234) database = shfl.data base.Emnist() _ _ _ _ _ train data, train labels, test data, test labels = database.load data() 30 We can inspect some properties of the loaded data, for instance the size or the dimension of the data. [2]: print(len(train data)) print(len(test data)) print(type(train data[0])) train data[0].shape <class 'numpy.ndarray'> [2]: (28, 28) As we see, our dataset is composed by a set of matrix of 28 by 28. Before starting with the federated scenario, we can take a look to a sample of the training data. [3]: plt.imshow(train data[0]) [3]: <matplotlib.image.AxesImage at 0x105ea3450> Now, we simulate a FL scenario with a set of 5 client nodes containing private data, and a central server which is responsible to coordinate the different clients. First of all, we simulate the data contained in every client with a IID distribution of the data. _ _ [4]: iid distribution = shfl.data distribution.IidDataDistribution(database) _ _ _ _ federated data, test data, test labels = iid distribution. _ _ _ ,!get federated data(num nodes=5, percent=50) 31 As a result, we have created federated data from the EMNIST dataset with 5 nodes and using every available data. Hence, the data collection process have finished. This data is a set of data nodes containing private data. [5]: print(type(federated data)) _ _ print(federated data.num nodes()) _ _ federated data[0].private data <class 'shfl.private.federated operation.FederatedData'> Node private data, you can see the data for debug purposes but the data ,!remains in the node <class 'dict'> {'112883278416': <shfl.private.data.LabeledData object at 0x1a486393d0>} As we can see, private data in a node is not accesible directly but the framework provides mechanisms to use this data in a machine learning model. Once data is prepared, the next step is the definition of the neural network architecture (model selection) used along the learning process. The framework provides a class to adapt a Keras (or Tensorflow) model to the framework, so you only have to create a function that will act as model builder. [6]: import tensorflow as tf def model builder(): model = tf.keras.models.Sequential() model.add(tf.keras.layers.Conv2D(32, kernel size=(3, 3), ,!padding='same', activation='relu', strides=1, input shape=(28, 28, ,!1))) model.add(tf.keras.layers.MaxPooling2D(pool size=2, strides=2, ,!padding='valid')) model.add(tf.keras.layers.Dropout(0.4)) model.add(tf.keras.layers.Conv2D(32, kernel size=(3, 3), ,!padding='same', activation='relu', strides=1)) model.add(tf.keras.layers.MaxPooling2D(pool size=2, strides=2, ,!padding='valid')) model.add(tf.keras.layers.Dropout(0.3)) model.add(tf.keras.layers.Flatten()) model.add(tf.keras.layers.Dense(128, activation='relu')) model.add(tf.keras.layers.Dropout(0.1)) model.add(tf.keras.layers.Dense(64, activation='relu')) model.add(tf.keras.layers.Dense(10, activation='softmax')) 32 model.compile(optimizer="rmsprop", ,!loss="categorical crossentropy", metrics=["accuracy"]) return shfl.model.DeepLearningModel(model) The following step is the definition of the federated aggregation operator in order to com- plete the model selection in FL. The framework provides some aggregation operators that we can use immediately and the possibility to define your own operator. In this case, we use the provided FedAvg operator. [7]: aggregator = shfl.federated aggregator.FedAvgAggregator() _ _ federated government = shfl.federated government. _ _ ,!FederatedGovernment(model builder, federated data, aggregator) The framework also provides the possibility of making data transformation for the data preprocessing step, defining federated operations using FederatedTransformation inter- face. We first reshape data and then normalise it using test data mean and standard deviation (std) as normalisation parameters. [8]: import numpy as np class Reshape(shfl.private.FederatedTransformation): def apply(self, labeled data): _ _ labeled data.data = np.reshape(labeled data.data, _ _ ,!(labeled data.data.shape[0], labeled data.data.shape[1], ,!labeled data.data.shape[2],1)) shfl.private.federated operation. _ _ _ ,!apply federated transformation(federated data, Reshape()) [9]: import numpy as np class Normalize(shfl.private.FederatedTransformation): __ __ def init (self, mean, std): __ self. mean = mean __ self. std = std def apply(self, labeled data): _ _ __ labeled data.data = (labeled data.data - self. mean)/self. __ ,! std 33 _ mean = np.mean(train data.data) std = np.std(train data.data) shfl.private.federated operation. _ _ _ ,!apply federated transformation(federated data, Normalize(mean, std)) We are now ready to train the FL algorithm. We run 2 rounds of learning showing test accuracy and loss of each client and test accuracy and loss of the global aggregated model. _ _ _ _ [10]: test data = np.reshape(test data, (test data.shape[0], test data. ,!shape[1], test data.shape[2],1)) _ _ _ _ federated government.run rounds(2, test data, test labels) Accuracy round 0 Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x1a485e2450>: [15.087034225463867, 0.9314000010490417] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x106a0ffd0>: [21.040000915527344, 0.9094250202178955] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x1a48639e90>: [11.712089538574219, 0.9425749778747559] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x1a486396d0>: [10.11756420135498, 0.9498249888420105] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x1a48639c50>: [24.04242706298828, 0.8968499898910522] Global model test performance : [7.954472064971924, 0.9403749704360962] Accuracy round 1 Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x1a485e2450>: [21.94520378112793, 0.9227499961853027] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x106a0ffd0>: [16.780630111694336, 0.9445000290870667] Test performance client <shfl.private.federated operation. ,!FederatedDataNode 34 object at 0x1a48639e90>: [13.413337707519531, 0.9463250041007996] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x1a486396d0>: [9.085938453674316, 0.9628000259399414] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x1a48639c50>: [20.926694869995117, 0.918524980545044] Global model test performance : [10.171743392944336, 0.958299994468689] If we focus our attention on test accuracy in each client, we realise that there are widely varying results. This is because of the scattered nature of the data distribution, which causes disparity in the quality of training data between clients. 6.1.3. Comparison with a centralised convolutional neural network approach We analyse the behaviour of the FL approach in comparison with the equivalent cen- tralised approach, which means training the neural network represented in Figure 5 on the same data using centralised learning. For this experiment, we use 25 clients and 10 rounds of learning with 5 epochs in both IID and non-IID scenario, where the nodes’ data contain only a portion of all labels. For a fair comparison, in the classical approach we train for e pochs  rounds epochs. F L F L IID non-IID Centralised approach 0.9904 0.9901 Federated approach 0.9921 0.9855 Table 4: Accuracy of the FL and the classical approach, in both IID and non-IID scenarios. In the FL case, the data is distributed over 25 clients, and 10 FL rounds of learning with 5 epochs per client are employed. The high performance of the federated approach stands out in Table 4, where the accu- racy for the considered scenarios is reported. In the IID scenario, it beats the centralised approach results, which shows the robustness of the approach caused by the combina- tion of the information learned by each client. In the non-IID scenario, the federated approach attains lower results than the centralised one due to the additional challenge of non-homogeneous distribution of data across clients. However, the results are very competitive highlighting the strength of the federated approach. Running more learning rounds results in better performance as in the next section. The purpose of this example is to show how it works. The performance of the centralised approach using non-IID data is not perfectly identical to the IID case due to the random sampling employed when generating the non-IID nodes’ data. 35 6.2. Linear regression with DP This section presents a linear regression FL simulation with DP following the method- ological guidelines with Sherpa.ai FL. The Laplace mechanism is used when the model’s sensitivity is estimated by a sampling procedure [19]. Moreover, we demon- strate the application of the advanced composition theorem for DP for not exceeding the maximum privacy loss allowed (see Section 2.4). 6.2.1. Case of study We will use the California Housing dataset, which consists of approximately 20 000 sam- ples for median house prices in California. Although the dataset possesses eight features, in this example we will only make use of the first two, in order to reduce the variance in the prediction. The (single) target is the cost of the house. As it can be observed in the code below, we retain 2 000 samples for later use with the sensitivity sampling for DP, and the rest of the data is split in train and test sets as detailed in Table 5. Train set Test set Total 14 912 3 728 18 640 Table 5: Distribution of the California Housing dataset. For the FL simulation we use 5 clients among which the train dataset is IID. FedAvg is chosen as the federated aggregation operator. The code of the example is detailed in the following section. 6.2.2. Description of the code Sherpa.FL allows to easily convert a generic dataset to interact with the platform: import shfl _ _ from shfl.data base.data base import LabeledDatabase import sklearn.datasets import numpy as np from shfl.private.reproducibility import Reproducibility # Comment to turn off reproducibility: Reproducibility(1234) _ _ _ all data = sklearn.datasets.fetch california housing() n features = 2 Sherpa.ai FL offers support for the linear regression model from scikit-learn https:// scikit-learn.org/stable/index.html https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch california housing.html 36 _ _ data = all data["data"][:,0:n features] labels = all data["target"] # Retain part for DP sensitivity sampling: size = 2000 sampling data = data[-size:, ] sampling labels = labels[-size:, ] # Create database: database = LabeledDatabase(data[0:-size, ], labels[0:-size]) _ _ _ _ _ train data, train labels, test data, test labels = database.load data() We will simulate a FL scenario by distributing the train data over a collection of clients, assuming an IID setting: _ _ iid distribution = shfl.data distribution.IidDataDistribution(database) _ _ _ _ federated data, test data, test labels = iid distribution. _ _ _ ,!get federated data(num nodes=5) At this stage, we need to define the linear regression model, and we choose the aggrega- tion operator to be the average of the clients’ models: _ _ from shfl.model.linear regression model import LinearRegressionModel def model builder(): _ _ model = LinearRegressionModel(n features=n features) return model aggregator = shfl.federated aggregator.FedAvgAggregator() 6.2.3. Running the model in a Federated configuration We are now ready to run the FL model. Note that in this case, we set the number of rounds n=1 since no iterations are needed in the case of linear regression. The performance met- rics used are the Root Mean Squared Error (RMSE) and the R score. It can be observed that the performance of the Global model (i.e. the aggregated model) is in general supe- rior with respect to the performance of each node, thus the federated learning approach proves to be beneficial: _ _ federated government = shfl.federated government. _ _ ,!FederatedGovernment(model builder, federated data, aggregator) _ _ _ _ federated government.run rounds(n=1, test data=test data, _ _ ,!test label=test labels) 37 Accuracy round 0 Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a08d0>: [0.8161535463006577, 0.5010049851923566] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a0ac8>: [0.81637303674763, 0.5007365568636023] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a09b0>: [0.8155342443231007, 0.5017619784187599] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a0be0>: [0.8158502097728687, 0.5013758352304256] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a0cf8>: [0.8151607067608612, 0.5022182878756591] Global model test performance : [0.8154147770321544, 0.5019079411109164] 6.2.4. Differential Privacy: sampling the model’s sensitivity In the case of applying the Laplace privacy mechanism (see Section 2.4), the noise added has to be of the order of the sensitivity of the model’s output, i.e. the model parameters of our linear regression. In the general case, the model’s sensitivity might be difficult to compute analytically. An alternative approach is to attain random differential privacy through a sampling over the data [19]. That is, instead of computing analytically the global sensitivity D f , we compute an empirical estimation of it by sampling over the dataset. This approach is convenient since it allows for the sensitivity estimation of an arbitrary model or a black-box computer function. The Sherpa.FL framework provides this functionality in the class SensitivitySampler. In order to carry out this approach, we need to specify a distribution of the data to sample from. This in general requires previous knowledge and/or model assumptions. In order not make any specific assumption on the distribution of the dataset, we can choose a uniform distribution. To the end, we define our class of ProbabilityDistribution that uniformly samples over a data-frame. We use the previously retained part of the dataset for sampling: class UniformDistribution(shfl.differential privacy. ,!ProbabilityDistribution): """ Implement Uniform sampling over the data """ 38 __ __ _ def init (self, sample data): _ _ _ self. sample data = sample data def sample(self, sample size): _ _ _ row indices = np.random.randint(low=0, high=self. sample data. ,!shape[0], size=sample size, dtype='l') _ _ _ return self. sample data[row indices, :] _ _ _ sample data = np.hstack((sampling data, sampling labels.reshape(-1,1))) The class SensitivitySampler implements the sampling given a query, i.e. the learning model itself in this case. We only need to add the method get to our model since it is required by the class SensitivitySampler. We choose the sensitivity norm to be the ` norm and we apply the sampling. The value of the sensitivity depends on the number of samples n: the more samples we perform, the more accurate the sensitivity. Indeed, increasing the number of samples n, the sensitivity gets more accurate and typically de- creases. from shfl.differential privacy import SensitivitySampler from shfl.differential privacy import L1SensitivityNorm class LinearRegressionSample(LinearRegressionModel): def get(self, data array): data = data array[:, 0:-1] labels = data array[:, -1] train model = self.train(data, labels) _ _ return self.get model params() distribution = UniformDistribution(sample data) sampler = SensitivitySampler() n samples = 4000 _ _ _ max sensitivity, mean sensitivity = sampler.sample sensitivity( _ _ _ LinearRegressionSample(n features=n features, n targets=1), L1SensitivityNorm(), distribution, n=n samples, gamma=0.05) print("Max sensitivity from sampling: " + str(max sensitivity)) print("Mean sensitivity from sampling: " + str(mean sensitivity)) Max sensitivity from sampling: 0.008294354064053988 Mean sensitivity from sampling: 0.0006633612087443363 39 Unfortunately, sampling over a dataset involves the training of the model on two datasets differing in one entry [19]. Thus in general this procedure might be computationally expensive (e.g. in the case of training a deep neuronal network). 6.2.5. Running the model in a Federated configuration with Differential Privacy At this stage we are ready to add a layer of DP to our federated learning model. Specif- ically, we will apply the Laplace mechanism from Section 2.4, employing the sensitiv- ity obtained from the previous sampling, namely D f  0.008. The Laplace mechanism provided by the Sherpa.FL framework is then assigned as the private access type to the model’s parameters of each client in a new FederatedGovernment object. This results into an e-differentially private FL model. For example, picking the value e = 0.5, we can run the FL experiment with DP: from shfl.differential privacy import LaplaceMechanism _ _ params access definition = ,!LaplaceMechanism(sensitivity=max sensitivity, epsilon=0.5) _ _ federated governmentDP = shfl.federated government.FederatedGovernment( _ _ model builder, federated data, aggregator, _ _ _ _ ,!model params access=params access definition) _ _ _ _ federated governmentDP.run rounds(n=1, test data=test data, _ _ ,!test label=test labels) Accuracy round 0 Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a08d0>: [0.8161535463006577, 0.5010049851923566] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a0ac8>: [0.81637303674763, 0.5007365568636023] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a09b0>: [0.8155342443231007, 0.5017619784187599] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a0be0>: [0.8158502097728687, 0.5013758352304256] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a0cf8>: [0.8151607067608612, 0.5022182878756591] Global model test performance : [0.8309024800913748, 0.48280707735516126] In the above example we observed that the performance of the model has slightly deteri- 40 orated due to the addition of DP. In general, the privacy increases at expenses of accuracy (i.e. for smaller values of e). 6.2.6. Comparison with centralised and non private approaches It is of practical interest to assess the performance loss due to DP in the FL context. Table 6 reports the performance metrics for the centralised model, for the FL non-private model, and for the FL differentially-private model. In all the cases, the models have learned on the train set, and the performance results have been computed over the test set. The data is IID over 5 clients in the FL cases. For the federated DP cases, different values of e = f0.2, 0.5, 0.8g are used and the total privacy expense is limited at e = 4. Thus, for each case, we took the average over the total runs before the budget is expended (see discussion about advanced composition theorems for privacy filters in Section 2.4). The sensitivity is fixed, employing the value obtained from sampling above. Approach RMSE R Classical 0.81540 0.50192 Federated non-private 0.81541 0.50190 Federated DP (e = 0.2, average of 20 runs) 1.05541 0.04224 Federated DP (e = 0.5, average of 8 runs) 0.84501 0.46457 Federated DP (e = 0.8, average of 5 runs) 0.82171 0.49414 Table 6: Federated linear regression: comparison between the classical centralised model, the non-private FL model, and the FL model with a DP layer using the Laplace mechanism. For the DP cases, the results are the average over the total runs allowed for the maximum privacy budget e = 4. Different values of e = f0.2, 0.5, 0.8g are considered, and the sensitivity is fixed. The data is IID distributed over 5 clients. It can be observed that the centralised model and the non-private FL model exhibit com- parable performance, thus the accuracy is not degraded by applying a FL approach. The application of the Laplace mechanism guarantees e-DP, and the accuracy of the FL model can be leveraged by setting the value of e: for higher values, lower privacy is guaranteed, but the accuracy increases. 7. Concluding remarks The characteristics of FL and DP make them good candidates to support AI services at the edges and to preserve data privacy. Hence, several software tools for FL and DP have been released. After a comparative analysis, we conclude that these software tools do not provide a unified support for FL and DP, and they do not follow any particular method- ological guidelines that direct the developing of AI services to preserve data privacy. Note that, when applying the composition theorems for privacy filters in the present example, we are assuming that the estimated sensitivity is a good enough approximation of the analytic sensitivity [22]. 41 Since FL is a machine learning paradigm, we have studied how to adapt the machine learning principles to the FL ones, and consequently we have also defined the workflow of an experimental setting of FL. The main result of that study is Sherpa.ai FL, which is a new software framework with a unified support for FL and DP, that allows to follow the defined methodological guidelines for FL. The combination of the methodological guidelines and Sherpa.ai FL is shown by means of a classification and a regression use cases. Those illustrative examples also show that the centralised and federated setting of the same experiments achieve similar results, which means that the joint use FL and DP can support the development of AI services at the edges that preserve data privacy. Sherpa.ai FL is in continuous development. Since FL and DP fields are constantly grow- ing, we plan to extend the framework’s functionalities by new federated aggregation op- erators, machine learning models, and data distributions. Moreover, new DP mechanisms such as RAPPOR [46] will be added, together with relaxations of DP such as Concentrated DP [47] or Rényi DP [48]. Acknowledgments This research work is partially supported by the contract OTRI-4137 with SHERPA Eu- rope S.L., the Spanish Government project TIN2017-89517-P. Nuria Rodríguez Barroso and Eugenio Martínez Cámara were supported by the Spanish Government fellowship programmes Formación de Profesorado Universitario (FPU18/04475) and Juan de la Cierva Incorporación (IJC2018-036092-I) respectively. References [1] E. Parliament, Regulation (eu) 2016/679 of the european parliament and of the coun- cil of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (general data protection regulation) (2016). [2] E. H.-L. E. G. on AI, The ethics guidelines for trustworthy artificial intelligence (ai) (2019). [3] A. Jalalirad, M. Scavuzzo, C. Capota, M. R. Sprague, A simple and efficient federated recommender system, 2019, pp. 53–58. [4] T. S. Brisimi, R. Chen, T. Mela, A. Olshevsky, I. C. Paschalidis, W. Shi, Federated learning of predictive models from federated electronic health records, International Journal of Medical Informatics 112 (2018) 59 – 67. [5] D. Kawa, S. Punyani, P. Nayak, A. Karker1, V. Jyotinagar, Credit risk assessment from combined bank records using federated learning, International Research Jour- nal of Engineering and Technology 6 (2019) 1355–1358. 42 [6] N. H. Tran, W. Bao, A. Zomaya, M. N. H. Nguyen, C. S. Hong, Federated learning over wireless networks: Optimization model design and analysis, in: IEEE INFO- COM 2019 - IEEE Conference on Computer Communications, 2019, pp. 1387–1395. [7] J. Konecný, ˇ H. McMahan, F. Yu, P. Richtarik, A. Suresh, D. Bacon, Federated learning: Strategies for improving communication efficiency, in: NIPS Workshop on Private Multi-Party Machine Learning, 2016. [8] P. Kairouz, H. B. M. et al., Advances and open problems in federated learning (2019). [9] A. Bhagoji, S. Chakraborty, P. Mittal, S. Calo, Analyzing federated learning through an adversarial lens, in: Proceedings of the 36th International Conference on Machine Learning, volume 97, 2019, pp. 634–643. [10] M. Fredrikson, S. Jha, T. Ristenpart, Model inversion attacks that exploit confidence information and basic countermeasures, in: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, 2015, pp. 1322 – 1333. [11] K. Chaudhuri, N. Mishra, When random sampling preserves privacy, in: Annual International Cryptology Conference, 2006, pp. 198–213. [12] C. Dwork, A. Roth, The algorithmic foundations of differential privacy, Foundations and Trends® in Theoretical Computer Science 9 (2014) 211–407. [13] Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen, H. Yu, Federated Learning, volume 13, Morgan & Claypool, 2019. [14] B. McMahan, E. Moore, D. Ramage, S. Hampson, B. A. y Arcas, Communication- Efficient Learning of Deep Networks from Decentralized Data, in: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54, 2017, pp. 1273–1282. [15] Y. Wang, CO-OP: Cooperative Machine Learning from Mobile Devices, Ph.D. thesis, University of Alberta, 2017. [16] M. Fredrikson, S. Jha, T. Ristenpart, Model inversion attacks that exploit confidence information and basic countermeasures, in: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Association for Computing Machinery, 2015, p. 1322–1333. [17] H. B. McMahan, D. Ramage, K. Talwar, L. Zhang, Learning differentially private recurrent language models, in: 6th International Conference on Learning Represen- tations, 2018. [18] C. Dwork, F. McSherry, K. Nissim, A. Smith, Calibrating noise to sensitivity in pri- vate data analysis, in: Theory of Cryptography, 2006, pp. 265–284. 43 [19] B. I. P. Rubinstein, F. Aldà, Pain-free random differential privacy with sensitivity sampling, in: Proceedings of the 34th International Conference on Machine Learn- ing, volume 70, 2017, pp. 2950–2959. [20] F. McSherry, K. Talwar, Mechanism design via differential privacy, in: 48th Annual IEEE Symposium on Foundations of Computer Science, 2007, pp. 94–103. [21] P. Kairouz, S. Oh, P. Viswanath, The composition theorem for differential privacy, IEEE Transactions on Information Theory 63 (2017) 4037–4049. [22] R. M. Rogers, A. Roth, J. Ullman, S. Vadhan, Privacy odometers and filters: Pay-as- you-go composition, in: Advances in Neural Information Processing Systems, 2016, pp. 1921–1929. [23] B. Balle, G. Barthe, M. Gaboardi, Privacy amplification by subsampling: Tight anal- yses via couplings and divergences, Advances in Neural Information Processing Systems (2018) 6277–6287. [24] H. B. McMahan, E. Moore, D. Ramage, B. A. y Arcas, Federated learning of deep networks using model averaging (2016). [25] A. Soliman, S. Girdzijauskas, M.-R. Bouguelia, S. Pashami, S. Nowaczyk, Decentral- ized and adaptive k -means clustering for non-iid data using hyperloglog counters, 2020, pp. 343–355. [26] J. Konecný, ˇ H. McMahan, D. Ramage, P. Richtárik, Federated optimization: Dis- tributed machine learning for on-device intelligence (2016). [27] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, L. Zhang, Deep learning with differential privacy, Proceedings of the 2016 ACM SIGSAC Con- ference on Computer and Communications Security (2016) 308–318. [28] M. J. Wainwright, M. I. Jordan, J. C. Duchi, Privacy aware learning, in: F. Pereira, C. J. C. Burges, L. Bottou, K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 25, 2012, pp. 1430–1438. [29] R. C. Geyer, T. J. Klein, M. Nabi, Differentially private federated learning: A client level perspective (2019). [30] J.-Y. Jiang, C.-T. Li, S.-D. Lin, Towards a more reliable privacy-preserving recom- mender system, Information Sciences 482 (2019) 248 – 265. [31] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. M. Kid- don, J. Konecný, ˇ S. Mazzocchi, B. McMahan, T. V. Overveldt, D. Petrou, D. Ramage, J. Roselander, Towards federated learning at scale: System design, in: SysML 2019, 44 [32] A. Gascón, P. Schoppmann, B. Balle, M. Raykova, J. Doerner, S. Zahur, D. Evans, Secure linear regression on vertically partitioned datasets, IACR Cryptol 2016 (2016) [33] A. Gascón, P. Schoppmann, B. Balle, M. Raykova, J. Doerner, S. Zahur, D. Evans, Privacy-preserving distributed linear regression on high-dimensional data, Proceed- ings on Privacy Enhancing Technologies (2017) 345–364. [34] S. Hardy, W. Henecka, H. Ivey-Law, R. Nock, G. Patrini, G. Smith, B. Thorne, Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption (2017). [35] S. P. Lloyd, Least squares quantization in pcm, IEEE Transactions on Information Theory 28 (1982) 129–137. [36] G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning, Springer, New York, NY, 2013. [37] F. Sattler, K.-R. Müller, W. Samek, Clustered federated learning: Model-agnostic distributed multi-task optimization under privacy constraints (2019). [38] Y. Zhang, N. Liu, S. Wang, A differential privacy protecting k-means clustering al- gorithm based on contour coefficients, PLoS ONE 13 (2018) 1–15. [39] S. Funk, Netflix update: Try this at home (2006). Accessed: 2020-05-09. [40] M. Ammad-ud-din, E. Ivannikova, S. A. Khan, W. Oyomno, Q. Fu, K. E. Tan, A. Flanagan, Federated collaborative filtering for privacy-preserving personalized recommendation system (2019). [41] L. Morán-Fernández, V. Bolón-Canedo, A. Alonso-Betanzos, Centralized vs. dis- tributed feature selection methods based on data complexity measures, Knowledge- Based Systems 117 (2017) 27–45. [42] U. S., N. Malaiyappan, Approaches and techniques of distributed data mining : A comprehensive study, International Journal of Engineering and Technology 9 (2017) 63–76. [43] J. Dean, S. Ghemawat, Mapreduce: Simplified data processing on large clusters, in: OSDI’04: Sixth Symposium on Operating System Design and Implementation, 2004, pp. 137–150. [44] J. Luengo, D. García-Gil, S. Ramírez-Gallego, S. García, F. Herrera, Big Data Prepro- cessing: Enabling Smart Data, Springer, 2020. [45] G. Cohen, S. Afshar, J. Tapson, A. van Schaik, EMNIST: Extending MNIST to hand- written letters, in: 2017 International Joint Conference on Neural Networks, 2017, pp. 2921–2926. 45 [46] Úlfar Erlingsson, V. Pihur, A. Korolova, Rappor: Randomized aggregatable privacy- preserving ordinal response, in: Proceedings of the 21st ACM Conference on Com- puter and Communications Security, Scottsdale, Arizona, 2014. [47] C. Dwork, G. N. Rothblum, Concentrated differential privacy (2016). [48] I. Mironov, Rényi differential privacy, 2017 IEEE 30th Computer Security Founda- tions Symposium (CSF) (2017). http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Statistics arXiv (Cornell University)

Federated Learning and Differential Privacy: Software tools analysis, the Sherpa.ai FL framework and methodological guidelines for preserving data privacy

Loading next page...
 
/lp/arxiv-cornell-university/federated-learning-and-differential-privacy-software-tools-analysis-oikUp0s0AF
ISSN
1566-2535
eISSN
ARCH-3347
DOI
10.1016/j.inffus.2020.07.009
Publisher site
See Article on Publisher Site

Abstract

The high demand of artificial intelligence services at the edges that also preserve data pri- vacy has pushed the research on novel machine learning paradigms that fit these require- ments. Federated learning has the ambition to protect data privacy through distributed learning methods that keep the data in its storage silos. Likewise, differential privacy attains to improve the protection of data privacy by measuring the privacy loss in the communication among the elements of federated learning. The prospective matching of federated learning and differential privacy to the challenges of data privacy protection has caused the release of several software tools that support their functionalities, but they lack a unified vision of these techniques, and a methodological workflow that supports their usage. Hence, we present the Sherpa.ai Federated Learning framework that is built upon a holistic view of federated learning and differential privacy. It results from both the study of how to adapt the machine learning paradigm to federated learning, and the definition of methodological guidelines for developing artificial intelligence ser- vices based on federated learning and differential privacy. We show how to follow the methodological guidelines with the Sherpa.ai Federated Learning framework by means of a classification and a regression use cases. Keywords: federated learning, differential privacy, software framework, Sherpa.ai Federated Learning framework Corresponding author Email addresses: rbnuria@ugr.es (Nuria Rodríguez-Barroso), g.stipcich@sherpa.ai (Goran Stipcich), dajilo@ugr.es (Daniel Jiménez-López), jantonioruiz@ugr.es (José Antonio Ruiz-Millán), emcamara@decsai.ugr.es (Eugenio Martínez-Cámara), g.gonzalez@sherpa.ai (Gerardo González-Seco), luzon@ugr.es (M. Victoria Luzón), ma.veganzones@sherpa.ai (Miguel Ángel Veganzones), herrera@decsai.ugr.es (Francisco Herrera) arXiv:2007.00914v2 [cs.LG] 6 Oct 2020 1. Introduction The last advances in fundamental and applied research in artificial intelligence (AI) has aroused interest in industry and end users. This interest goes beyond the traditional centralised setting of AI, and nowadays there is a high demand of AI services at the edges. One of the main pillars of AI is data, whose larger availability has boosted the progress of AI in the last years. However, data is a sensitive element, especially when it describes users’ personal features, such as clinical or financial data. This sensitive nature of per- sonal data has raised the awareness of end users on data privacy protection, promoting the publication of legal frames [1] and recommendations for developing AI services that preserve data privacy [2]. In this context, the progress of AI applications is based on (1) using data generated or stored at the edges, (2) working with large amounts of data from a wide range of sources, and (3) protecting data privacy in order to comply with the legal restrictions and to pay attention to end users’ concerns. Some use cases of AI with these dependencies are: • When data contains sensitive information, such as email accounts, personalised rec- ommendations or health records, applications should employ privacy-preserving techniques to learn from a population of users whilst keeping the sensitive informa- tion on each user ’s device [3]. • When information is located in data silos, for instance, healthcare industry is usually reluctant to disclose its records, keeping it as sequestered data [4]. Nevertheless, joint learning from data silos of different health institutions would allow to improve the robustness of the resulting models. • Due to data privacy legislation, banks [5] and telecom [6] companies cannot share individual records. However they would benefit from models that learn from sev- eral entities’ data. The standard machine learning paradigm does not match the previous dependencies, as it learns from a centralised data source. Likewise, distributed machine learning does not fit the preserving data privacy challenge, because data is shared among several com- putational elements. Moreover, distributed machine learning cannot cope with the chal- lenges associated to decentralised data processing, such as the ability to work with a great amount of clients with non homogeneous data distributions [7]. Federated learning (FL) is a nascent machine learning paradigm where many clients, in the sense of electronic devices or entire organisations, jointly train a model under the or- chestration of a central server, while keeping the training data decentralised [8]. Roughly speaking, data is not shared with the central server, indeed it is kept in the devices where it is stored or generated. Accordingly, FL addresses the challenges of developing AI ser- vices on scattered data across a large amount of clients with non homogeneous data dis- tributions. 2 Maintaining the data in its corresponding storage silos does not completely assure pri- vacy preservation, since several adversarial attacks can still be damaging [9]. Data obfus- cation, anonymisation techniques, such as blindly trusting artificial intelligence black box models (i.e. convolutional neural networks), or randomly sampling data from the clients’ models have been proven to be inadequate to preserve privacy [10, 11]. Moreover, the complete obfuscation of the data greatly reduces its value, thus a balance between pri- vacy and utility is needed. Differential privacy (DP) is proposed as a data access tech- nique which aims to maintain personal data privacy while maximising its utility [12]. The characteristics of FL and DP, and by extension their combination, make them can- didates to address the challenges of distributed AI services that preserve data privacy. The research and progress of FL and DP need the support of software tools that ease the design of privacy-preserving AI services while not requiring development from scratch. Consequently, in recent years several software tools with FL and DP functionalities have been released with this aim. We perform a comparative analysis of the FL and DP software tools released so far, and we conclude that their lack of a holistic view of FL and DP hinders the development of unified FL and DP AI services, as well as the furtherance of addressing the challenges of AI services at the edges that preserve data privacy. Therefore, we present the Sherpa.ai 1,2 Federated Learning framework, an open-source unified FL and DP framework for AI. Sherpa.ai FL aims to bridge the gap between the fundamental and applied research. Moreover, it will facilitate open research and development of new solutions built upon FL and DP for the challenges posed by AI at the edges and data privacy protection. A flexible approach to a wide range of problems is assured by its modular design that takes into account all the key elements and functionalities of FL and DP, which consist of: 1. Data. Different data sets can be processed. 2. Learning model. Several core machine learning algorithms are incorporated. 3. Aggregation operator. Different operators for fusing the parameters of the clients’ learning models are embodied. 4. Clients. It is where the learning models are run. 5. Federated server. The clients can be orchestrated by different communication strate- gies. 6. Communication among clients and server. Different solutions are encompassed to reduce the communication iterations, to protect the learning from adversarial at- tacks, and to obfuscate the parameters with DP techniques. 7. DP mechanisms. The fundamental DP mechanisms, such as the Laplace mecha- nism, as well as the composition of DP mechanisms are incorporated. The progress of AI is not only supported by the release of software tools, but it needs fun- https://developers.sherpa.ai/privacy-technology/ https://github.com/sherpaai/Sherpa.ai-Federated-Learning-Framework 3 damental guidelines defining how to put together the different software tools’ attributes for reaching the intended learning goal while at same time matching the problem restric- tions. Accordingly, since FL is a machine learning paradigm, we first study the principles of machine learning and how to make them fit the FL requirements. We see that most ma- chine learning methods can be directly adapted to a FL setting, but some of them require ad-hoc amendments. As a result of this study, we define the experimental workflow of FL in terms of methodological guidelines for preserving data privacy in the development of AI services at the edges. These methodological guidelines are grounded in the machine learning workflow, and they have guided the design and development of Sherpa.ai FL, therefore they can be followed with Sherpa.ai FL. It is shown how to follow the mentioned methodological guidelines with Sherpa.ai FL through two examples encompassing a classification and a regression use cases, namely: 1. Classification. We use the EMNIST Digits dataset to describe how to conduct a classification task with Sherpa.ai FL. We also compare the federated classification with its centralised counterpart. Both approaches achieve similar results. 2. Regression. We describe how to perform a regression experiment using the Califor- nia Housing dataset. We compare the FL experiment with its centralised version. In addition, it is shown how to assess and limit the privacy loss using DP. The main contributions of this paper are: 1. To analyse the most recently released FL and DP software tools, revealing the lack a unified view of FL and DP that hinders the possibility of addressing the challenges of AI at the edges with data privacy. 2. To present Sherpa.ai FL, an open-source unified FL and DP framework for AI. 3. To study the adaptation of machine learning models to the principles of FL and, accordingly, to define the methodological guidelines which can be followed with Sherpa.ai FL for developing AI services that preserve data privacy with FL and DP. The rest of the paper is organised as follows: the next section formally defines FL and DP as well as their key elements. Section 3 analyses the main FL and DP frameworks’ features. Section 4 introduces Sherpa.ai FL including software architecture and func- tionalities. Section 5 explains the adaptation of the machine learning paradigm to FL, taking into account the adaptation of core algorithms and the methodological guidelines of an experimental workflow. Section 6 shows some illustrative examples consisting in a classification and regression problem. Finally, the concluding remarks and future work are reported in Section 7. 4 2. Federated Learning and Differential Privacy The development of a framework for FL and DP requires a thorough understanding of what FL is and what its key elements are. Accordingly, we formally define FL in Section 2.1, and we detail each key element of a FL scheme in Section 2.2. Similarly, DP is defined in Section 2.3, and its key elements are described in Section 2.4. 2.1. The definition of Federated Learning FL is a distributed machine learning paradigm that consists of a network of nodes where we distinguish two types of nodes: (1) Data owner nodes, fC , . . . , C g, that possess a col- lection of data, fD , . . . , D g, and (2) Aggregation nodes, fG , . . . , G g, aiming at learning 1 n 1 k a model from data owners. The deployment of these two types of nodes defines, at least, two kind of federated architectures according to Yang et al. [13], namely: 1. Peer-to-peer: It is the architecture in which all the nodes are both Data owner and Aggregation nodes. This scheme does not require a coordinator. The main advan- tages are the elevated security and data privacy while the main disadvantage is the computation cost. This FL architecture is illustrated in Figure 1. 2. Client-server: It consists of a coordinator Aggregation node named server and a set of Data owner nodes named clients. In this architecture, the client does not share its local data ensuring its privacy. We represent the client-server scheme in Figure 2. Data Owner B Data Owner A Model update W Updated Model  Updated Model Model update W Model update Model update Model update Model update Aggregation Node A Aggregation Node B Figure 1: Representation of peer-to-peer FL architecture. In the literature we find different ways to refer to the clients in a FL architecture, namely: nodes, agents or clients. In this paper, we rather prefer the term clients. 5 Server (Aggregation node) Updated Model Updated Model Model update Model update  Updated Model Model update Client A (Data Owner node) Client B (Data Owner node) Client C (Data Owner node) Figure 2: Representation of client-server FL architecture. Since the peer-to-peer model is a generalisation of the client-server model, we consider the latter for the formal definition of FL. In this architecture, each of the clients C has a lo- cal learning model LL M represented by the parameters q . FL aims at learning the global i i learning model G L M, represented by q, using the scattered data across clients through an iterative learning process known as round of learning. For that purpose, in each round of learning t, each client trains the LL M over its local training data D , updating their local t t parameters q . Subsequently, the global parameters q are computed aggregating the local t t parameters fq , . . . , q g using a specific federated aggregation operator D: t t t t q = D(q , q , . . . , q ) (1) 2 n After the aggregation of the parameters in the GLM, the LLMs are updated with the ag- gregated parameters: t+1 t q q , 8i 2 f1, . . . , ng (2) The communication between server and clients can be synchronous or asynchronous. In the first option, the server awaits for the clients updates, aggregates all the local parame- ters and sends them to each client. Nevertheless, in the second option, the server merges the local parameters with the GLM as soon as it receives them, using a weighted scheme based on the age difference among the models. 6 We repeat this iterative process for as many rounds of learning as needed. Thus, the final value of q will sum up the clients’ underlying knowledge. In particular, the learning goal is typically to minimise the following objective function: min F(q), with F(q) := w F (q) (3) å i i i=1 where n is the number of clients, F is the local objective function for the i-th client which is the common objective function of the problem fitted to each client’s data, w  0 and w = 1. 2.2. Key elements of Federated Learning The development of a FL environment requires the right combination of a set of necessary key elements. Since FL is a specific configuration of a machine learning environment, FL shares some key elements with it, namely: (1) data and (2) the learning model. However, the particularities of FL make necessary additional key elements, such as: (1) federated aggregation operators, (2) clients, (3) federated server and (4) communication among the federated server and the clients. The adaptation of the common key elements among FL and machine learning, and the FL specific ones are described as what follows. Data. Data plays a central role in FL as in machine learning. The distribution of data becomes crucial in FL since it is distributed among the different clients. Regarding the splitting of the data among clients, there are two possibilities depending on the data dis- tribution: • IID (Independent and Identically Distributed) scenario: when the data distribu- tion in each client corresponds to the population data distribution. In other words, the data in each client is independent and identically distributed, as well as repre- sentative of the population data distribution. • Non-IID (non Independent and Identically Distributed) scenario: when the data distribution in each client is not independent or identically distributed from the population data distribution. In a real FL scenario, each client only stores the data generated on the client itself, ensuring the non-IID property of the global data. Hence, the non-IID scenario is the most likely one and it represents a real challenge for FL. Learning model. The learning model is the shared structure between the server and the clients, where each client trains a local model using its own data, while the global model on the server is never trained, but instead it is obtained aggregating the clients’ model pa- rameters. Thus multiple models are trained without explicit data sharing, a configuration that is essentially different from the classical (centralised) learning paradigm. 7 Federated aggregation operators. The aggregation operator is in charge of aggregating the parameters in the server. It has to: (1) assure a proper fusion of the local learning models in order to optimise the objective function in Equation 3; (2) reduce the number of communication rounds among the clients and the federated server and (3) be robust against clients with poor data quality or malicious clients. Some of the most commonly used federated aggregation operators in the literature are: • Federated Averaging (FedAvg) [14]. It is based on keeping a shared global model that is periodically updated by averaging models that have been trained locally on clients. The training process is arranged by a central server which hosts the shared global model. However, the actual optimisation is done locally on clients. • CO-OP [15]. It proposes an asynchronous approach, which merges any received client model with the global model. Instead of directly averaging the models, the merging between a local model and the global model is carried out using a weight- ing scheme based on a measure of the difference in the age of the models. This is motivated by the fact that in an asynchronous framework, some clients will be trained on obsolete data while others will be trained on more up-to-date data. Clients. Each client of a federated scenario represents a node of the distributed scheme. Typical clients in FL could be smartphones, IoT devices or connected vehicles. Each client owns its specific training dataset and its local model. Their principal aim is to train lo- cal models on their own private data and share the trained model parameters with the federated server where the parameters fusion is performed. Federated server. The federated server orchestrates the iterative learning of FL, which is composed of several rounds of learning. The server participates in: (1) receiving the trained parameters of the local models, (2) aggregating the trained parameters of each client model using federated aggregation operators and (3) updating every learning model with the aggregated parameters. The learning process involved in both training the local models and updating them, is known as a round of learning. The global model, which is stored in the federated server, represents the final model after the learning pro- cess. Therefore it is used for predicting, testing or any posterior evaluation. Communication among the federated server and the clients. Communication between clients and server is the most tricky element of a FL scheme. On the one hand, an efficient communication is a crucial requirement due to the high communication times needed because of the network speed limitations and availability. For that reason, FL should minimise communications and maximise their efficiency by means of, for example, re- ducing the number of rounds of learning. On the other hand, the interchange of model parameters between the server and the clients constitutes a vulnerability to the federated server scheme, since the original data may be reconstructed from the model parameters through model-inversion adversarial attacks [16], resulting in a great risk of private data leakage. For this reason, DP techniques [12] are commonly used in order to share model parameters [17]. 8 2.3. The definition of Differential Privacy DP is the property of an algorithm whose input is typically a database, and whose en- coded response allows to obtain relatively accurate answers to potential queries [12, 18]. The motivation for DP stems from the necessity of ensuring the privacy of individuals whose sensitive details are part of a database, while at the same time being able to gain accurate knowledge about the whole population when learning from the database. DP does not imply a binary concept, i.e., the guarantee or not of an individual’s data privacy. Instead DP establishes a formal measure of privacy loss, allowing for comparison between different approaches. Thus, DP will rigorously bound the possible harm to an individual whose sensitive information belongs to the database by fixing a budget for privacy loss. The formal definition of DP requires a few preliminary notions. Namely, we define the probability simplex over a discrete set B, denoted D(B), as the set of real valued vectors whose jBj components sum up to one and are non-negative: ( ) jBj jBj D(B) := x 2 R : x = 1, x  0, i = 0, . . . ,jBj (4) i i i=1 A randomised algorithmM : A ! B, with B a discrete set, is defined as a mechanism which is associated with a mapping M : A ! D(B) such that, with input a 2 A, the mechanism produces M(a) = b with probability ( M(a)) = P(bja), for each b 2 B. The probability is taken over the randomness employed by the mechanism M. In general, databases are collections of records from a universe X . It is convenient to jXj express databases x by their histogram x 2 N , where each component x stands for the number of elements in the database of type i in X . This interpretation naturally leads to define the distance between databases: two databases x, y are said to be n-neighbouring if they differ by n entries asjjx yjj = n, wherejjjj 1 1 denotes the ` norm. In particular, if the databases only differ in a single data element (n = 1), the databases are simply addressed as neighbouring. At this stage, DP can be formally introduced. A randomised algorithm (mechanism) M jXj jXj with domain N preserves e-DP for e > 0 if for all neighbouring databases x, y 2 N and all S  Range(M) it holds that: P[M(x) 2 S]  exp(e)P[M(y) 2 S] (5) If, on the other hand, for 0 < d < 1 it holds that: P[M(x) 2 S]  exp(e)P[M(y) 2 S] + d (6) then the mechanism possesses the weaker property of (e, d)-DP. The probability is taken over the randomness employed by the mechanism M. In essence, Equation 5 tells us that for every run of the randomisation mechanism M(x), it is almost equally likely to observe the same output for every neighbouring database y, 9 such probability is governed by e. Equation 6 is weaker since it allows us to exceed e with probability d. In other words, DP specifies a “privacy budget” given by e and d. The way in which it is spent is given by the concept of privacy loss. We define the privacy loss incurred in observing the output m employing the randomised algorithm M in two neighbouring databases x, y: P[M(x) = m] L := ln (7) M(x)jjM(y) P[M(y) = m] Since the privacy loss can be both positive and negative, we consider the absolute value of it in the following interpretation. The privacy loss allows us to reinterpret both e and d in a more intuitive way: • e limits the quantity of privacy loss permitted, that is, our privacy budget. • d is the probability of exceeding the privacy budget given by e, so that we can ensure that with probability 1 d, the privacy loss will not be greater than e. DP is immune to post-processing, that is, if and algorithm protects an individual’s pri- vacy, then there is not any way in which privacy loss can be increased, stated in a more jXj formal way: let M : N ! R be a (e, d)-differentially private mechanism and let 0 jXj 0 f : R ! R , then f M : N ! R is (e, d)-differentially private. 2.4. Key elements of Differential Privacy DP arose as the principal setting for privacy-preserving sensitive data when delivering trained models to untrusted parties. The possibilities of DP are built upon the modular structure of its elements, which allows to construct more sophisticated DP mechanisms, and to design, analyse and post-process DP mechanisms for a specific privacy-preserving learner [12, 19]. These necessary or key elements of DP are the DP mechanisms, the com- position DP mechanisms, and the subsampling techniques to increase the privacy. We subsequently detail them. DP mechanisms. We describe the main privacy-preserving mechanisms as what follows: • Randomised response mechanism. It is aimed at evaluating the frequency of an embarrassing or illegal practice. When answering whether it engaged in the afore- mentioned activity in the past period of time, the following procedure is proposed: 1. Flip a coin; 2. If tails, respond truthfully; 3. If heads, flip a second coin and if heads, respond “Yes”, and respond “No” if tails. This approach provides privacy due to “plausible deniability” since the response “Yes” may have been submitted when both coins flips turned out heads. By direct 10 computation it can be shown that this is an e-differentially private mechanism with e = log(3) [12, Section 3.2]. • Laplace mechanism [18]. It is usually employed for preserving privacy in numeric jXj k jXj queries f : N ! R , which map databases x 2 N to k real numbers. At this point, it is important to introduce a key parameter associated to the accuracy of such queries, namely the ` sensitivity: D f := max k f (x) f (y)k (8) jjxyjj =1 jXj Since the above definition must hold for every neighbouring x, y 2 N , it is also denoted as global sensitivity [19]. This parameter measures the maximum magnitude of change in the output of f associated to a single data element, thus, intuitively, it establishes the amount of uncertainty (i.e., noise) to be introduced in the output to preserve the privacy of a single individual. Moreover, we denote as Lap(b) the Laplace distribution with probability density jXj k function with scale b and centred at 0. Given any function f : N ! R , the Laplace mechanism can be defined as M (x, f (), e) := f (x) + (Y , . . . , Y ) (9) L 1 where the components Y are IID drawn from the distribution Lap(D f /e). In other words, each component of the output of f is perturbed by Laplace noise according to the sensitivity of the function D f . It can be shown that this is an e-differentially private mechanism with e = D f /b [12, Section 3.3]. • Exponential mechanism [20]. It is a general DP mechanism that has been proposed for situations in which adding noise directly to the output function (as for Laplace mechanism) would completely ruin the result. Thus the exponential mechanism constitutes the building component for queries with arbitrary utility, where the goal is to maximise the utility while preserving privacy. For a given arbitrary range R, jXj the utility function u : N R ! R maps database/output pairs to utility values. We introduce the sensitivity of the utility function as Du := max max ju(x, r) u(y, r)j (10) r2R jjxyjj 1 where the sensitivity of u with respect to the database is of importance, while it can be arbitrarily sensitive with respect to the range r 2 R. The exponential mechanism M (x, u,R) is defined as a randomised algorithm which picks as output an element of the range r 2 R with probability proportional to exp (eu(x, r)/(2Du)). When normalised, the mechanism details a probability density function over the possible responses r 2 R. Nevertheless, the resulting distribution can be rather complex and over an arbitrarily large domain, thus the implementation of such mechanism might not always be efficient [12]. It can be shown that this is a (2eDu)- differentially private mechanism [20]. 11 • Gaussian mechanism [12]. It is a DP mechanism that adds Gaussian noise to the output of a numeric query. It has two great advantages over the differentially pri- vate mechanisms stated previously: – Common source noise: the added Gaussian noise is the same as the one which naturally appears when dealing with a database. – Additive noise: the sum of two Gaussian distributions is a new Gaussian dis- tribution, therefore it is easier to statistically analyse this DP mechanism. Instead of scaling the noise to the ` sensitivity, as we previously did with the Lapla- cian mechanism, it is scaled to the ` sensitivity: D ( f ) := max k f (x) f (y)k (11) jjxyjj =1 Moreover, we denote as N(0, s ) the Gaussian distribution with probability density 2 jXj k function with mean 0 and variance s . Given any function f : N ! R , the Gaussian mechanism can be defined as: M (x, f (), e) := f (x) + (Y , . . . , Y ) (12) G 1 k where the components Y are IID drawn from the distribution N(0, s). However, it needs to satisfy the following restrictions to ensure it is a (e, d)- differentially private mechanism: for e 2 (0, 1) and variance s > 2 ln(1.25/d) (D ( f )/e) , the Gaussian mechanism is (e, d)-differentially private. To sum up, the main idea behind DP mechanisms is adding a certain amount of noise to the query output, while preserving the utility of the original data. Such noise is calibrated to the privacy parameters (e, d) and the sensitivity of the query function. Composition of DP mechanisms. An appealing property of DP is that more advanced private mechanisms can be devised by combining DP mechanisms, such as the general building components described in Section 2.4. The resulting mechanism then still pre- serve DP, and the new values of e and d can be computed according to the composition theorems. Before the composition theorems are provided, we state an experiment with an adversarial which proposes a composition scenario for DP [12]. Composition experiment b 2 f0, 1g for adversary A with a given set, M, of DP mechanisms. For i = 1, . . . , k: 0 1 1. A generates two neighbouring databases x and x and selects a mechanism M i i from M. 2. A receives the output y 2 M (x ) i i 12 In the experiments the adversary preserves its state between iterations, and we define A’s view of the experiment b as V = fy , . . . , y g. In order to ensure DP in these Composition 1 k experiments we need to introduce a statistical distance which resembles the privacy loss (Equation 7). The d-Approximate Max Divergence between random variables Y and Z is defined as: P[Y 2 S] d D (YjjZ) = max ln (13) P[Z 2 S] P[Y2S]>d We say that the composition of a sequence of DP mechanisms under the Composition ex- d 0 1 periment is (e, d)-differentially private if D (V jjV )  e. Now, we are ready to introduce the composition theorems: • Basic composition theorem. The composition of a sequence fM g of (e , d )- k i i differentially private mechanisms under the Composition experiment with M = k k fM g, is ( e , d )-differentially private. å å k i i i=1 i=1 • Advanced composition theorem. For all e, d, d  0 the composition of a sequence fM g of (e, d)-differentially private mechanisms under the Composition experi- 0 00 ment with M = fM g, satisfies (e , d )-DP with: 0 e 00 0 e = e 2k ln(1/d ) + ke(e 1) and d = kd + d (14) More advanced versions of Equation 14 that allow the composition of private mech- anisms with diverse e and d values and provide tighter bounds can be found in [21]. Privacy filters [22]. While composition theorems are quite useful, they require some pa- rameters to be defined upfront, such as the number of mechanisms to be composed. Therefore, no intermediate result can be observed and the privacy budget can be wasted. In such situations it is required a more fine grained composition techniques which allows to observe the result of each mechanism without compromising the privacy budget spent. In order to remove some of the stated constraints, a more flexible experiment of compo- sition is introduced [22]: Adaptive composition experiment b 2 f0, 1g for adversary A. For i = 1, . . . , k: 0 1 1. A generates two neighbouring databases x and x and selects a mechanism M i i that is (e , d )-differentially private. i i 2. A receives the output y 2 M (x ) i i In these situations, the e and d of each mechanism is adaptively selected based on the i i outputs of previous iterations. For the adaptive composition experiment, the privacy loss 13 of the adversary’s view V = fy , . . . , y g for each pair of neighbouring databases x, y is 1 k defined as follows: P[M (x) = y jV ] i i i V i=1 L = ln (15) P[M (y) = y jV ] i i i i=1 where we write V = fy , . . . , y g, that is, the adversary’s view at the beginning of the i 1 i th i -iteration of the adaptive composition experiment. In particular, if the adaptive com- position experiment has only one iteration (k = 1), the Equation 15 is the same as the definition of privacy loss (see Equation 7). 2k The function CO MP : R ! f H ALT, CON Tg is a valid privacy filter for e, d  0 e ,d g g 0 if for all adversaries in the adaptive composition experiment, the following "bad event" occurs with probability at most d when the adversary’s view V: jL j > e and CO MP (e , d , . . . , e , d ) = CON T (16) e ,d 1 1 k k g g A privacy filter can be used to guarantee that with probability 1 d , the stated pri- vacy budget e is never exceeded. That is, fixed a privacy budget (e , d ), the function g g g 2k CO MP : R ! f H ALT, CON Tg controls the composition. It returns HALT if the e ,d g g 0 composition of k given DP mechanisms surpasses the privacy budget, otherwise it re- turns CONT. Privacy filters have similar composition theorems to the ones given above: • Basic composition for privacy filters. For any e , d  0, CO MP is valid Privacy g g e ,d g g Filter, where: k k H ALT if d > d or e > e , å å i g i g i=1 i=1 CO MP (e , d , ..., e , d ) = e ,d 1 1 k k g g CON T otherwise • Advanced composition for privacy filters. We define K as follows: k k k exp (e ) 1 t 2 2 K := e + H 2 + ln e + 1 ln (2/d ) + e g j å å å i i H 2 i=1 i=1 j=1 with H = . 28.04 ln(1/d ) Then CO MP is a valid Privacy Filter for d 2 (0, 1/e) and e > 0, where: g g e ,d g g H ALT if d > d /2 or K > e , i g g i=1 CO MP (e , d , ..., e , d ) = e ,d 1 1 k k g g CON T otherwise 14 The value of K might be strange at first sight, however if we assume e = e for all j, it remains: ke exp (e) 1 K = ke + H 2 + ln + 1 ln (2/d) + ke H 2 which is quite similar to Equation 14. Increase privacy by subsampling. The privacy of a DP mechanism can be further im- proved whether instead of querying all the stored data, a random subsample is queried. That is, if an (e, d)-differentially private mechanism is used to query random subsample 0 0 from a database with n records, then an improved (e , d ) parameters can be provided ac- cording to the type of random subsample [23]. If the random subsample of size m < n is performed without replacement then: m m 0 e 0 e = ln 1 + (e 1) and d = d (17) n n 0 0 This expression for (e , d ) is better than the original (e, d) in the sense that it is smaller and so is the privacy budget spent. The noise considered in such situation comes from a different source than the noise added by the DP mechanism itself. That is, the DP mech- anism is adding a certain quantity of noise specified by the (e, d) parameters to a random subsample of the database, therefore the information extracted is influenced by the indi- viduals contained in it. This random subsample is sampled each time the DP mechanism is used, which may result in slightly different results for the same query applied multiple times, that is, a new source of noise is added to the query. Particularly, the improvement is greatly noticeable when e < 1, which makes the Gaus- sian Mechanism ideal, since to achieve (e, d)-DP e must be smaller than 1. That is, the Gaussian Mechanism and the subsampling methods, when applied together, can ensure a minor quantity of noise and a tinier privacy budget expenditure at the cost of accessing a small random subsampling of the data. This technique is particularly suited for FL, where the data does not come from all the clients in each iteration, but it does from a random sample of them. Moreover, it is well suited for programs in which the privacy parameters are hardcoded, so the privacy bud- get must be carefully spent. 3. Software tools: FL and DP frameworks analysis The high demand of AI services at the edges which must preserve data privacy has pushed the release of several software tools or frameworks of FL and DP. In this Sec- tion, we discuss the strengths and weaknesses of these software frameworks, we compare them and stress out their main shortcomings. The discussion covers the state of the development of the software tools until the end of May 2020. 15 3.1. PySyft PySyft is a Python library for secure and private deep learning. PySyft decouples private data from model training, using FL, DP, and Encrypted Computation (like Multi-Party Computation (MPC) and Homomorphic Encryption (HE)) within the main deep learning frameworks like PyTorch and TensorFlow. Features. It is compatible with existing deep learning frameworks such as TensorFlow and PyTorch. Their low level FL implementation allows developing and debugging projects with complex communication networks in a local environment with almost no overhead. It is mainly focused on providing Secure MPC through HE, it thus allows to apply computations on ciphertext which is ideal for developing FL models while pre- serving privately the results of the computations to the participants. Last, they offer many Python notebooks, which greatly softens the learning curve of this framework. Shortcomings. Its low level of FL support is missing some key features: neither it in- cludes any dataset by default nor it implements any model aggregation operators. Its low level implementation and the two drawbacks stated before make this framework quite complex to use, requiring considerable knowledge in this field to correctly assem- ble a FL model. While its webpage advertises many DP mechanisms, they are nowhere to be found. As a matter of fact, in their github documentation they state the following: “Do NOT use this code to protect data (private or otherwise) - at present it is very insecure. Come back in a couple of months”. Overview. We conclude that PySyft is a low level FL framework for advanced users which is compatible with many well-known deep learning frameworks and it does pro- vide neither any DP mechanism nor any DP algorithm. 3.2. TensorFlow TensorFlow implements DP and FL through its libraries TensorFlow Privacy and Tensor- Flow Federated, respectively. Features. TensorFlow Privacy is a Python library for training machine learning models with privacy for training data. It integrates seamlessly with existing TensorFlow models and allows the developer to train its models with DP techniques. In addition they have many tutorials to quickly learn how to use it. TensorFlow Federated is an open-source framework for machine learning and other computations on decentralised data. As TensorFlow Privacy, it integrates easily with ex- isting TensorFlow Models. In addition, it has built-in many known training datasets. https://github.com/OpenMined/PySyft https://www.openmined.org https://github.com/tensorflow/privacy https://www.tensorflow.org/federated 16 Shortcomings. TensorFlow Privacy only focuses on differentially private optimisers and it does not provide any DP mechanisms to implement your own differentially private optimisers. It does not officially support any other deep learning library and it is still not compatible with the latest TensorFlow 2.x. In addition, it is a ”library under continual development” according to its Github documentation , it is not thus mature enough for production usage. While TensorFlow Federated provides both low level and high level interfaces for FL settings and it has some high level interfaces to create aggregation operators, it does not provide any built-in aggregation operators. Last, it is not yet compatible with the latest TensorFlow 2.x. Overview. These TensorFlow frameworks in conjunction allow us to develop FL models, but they are tied to the TensorFlow framework, which greatly denies any portability of the generated model. They are neither compatible with the latest version of TensorFlow nor they are ready for final products. In addition, they lack DP mechanisms to implement new privacy-preserving algorithms. 3.3. FATE FATE is an open-source project initiated by Webank’s AI Department to provide a secure computing framework to support the federated AI ecosystem. Features. It provides many interesting FL algorithms and it exposes a high level inter- face driven by custom scripts. Shortcomings. Its high level interface made of scripts relies too much on command line parameters and on a poorly documented domain specific language. It is unclear how to implement a low level FL model, which makes us think this framework is designed as a black box model. Their modular architecture seems quite complex. Also, it does not feature any DP algorithm, and there are no signs of future plans for implementing them. Overview. This framework is mainly focused on FL, making one of its biggest weak- nesses that it does not implement any DP algorithm, in order to improve its data protec- tion regulation compliance. Secure computation protocols ensure that data is not eaves- dropped by an adversary, but it does not ensure that individuals’ privacy, roughly speak- ing, is preserved. In addition, it is expected to be used as a high level interface which relies on a barely documented custom language. https://github.com/tensorflow/privacy https://fate.fedai.org/overview/ https://fate.readthedocs.io/en/latest/examples/federatedml-1.x-examples/README. html 17 3.4. LEAF LEAF is a benchmarking framework for learning in federated settings, with applications including FL, multi-task learning, meta-learning, and on-device learning. Features. This framework mainly focuses on benchmarking FL settings. It provides some basic FL mechanisms such as the Federated Averaging Aggregator and given its modular design it can be adapted to work on any existing framework. Last, it has some known built-in datasets such as FEMNIST, Shakespeare and Celeba. Shortcomings. It does not provide any benchmark for preserving privacy in a FL setting, even though privacy must be taken into consideration as it is a desired property of many FL settings. Moreover, it does not offer as many official documentation or tutorials as the other frameworks discussed in this section. Overview. LEAF offers a baseline implementation for some basic FL methods but its main purpose is benchmarking FL settings. However, DP benchmarks are not provided, even though nowadays privacy is a concern in most FL settings. 3.5. PaddleFL 2 3 PaddleFL is an open source FL framework based on PaddlePaddle . PaddlePaddle is an industrial platform with advanced technologies and rich features that cover core deep learning frameworks, basic model libraries, end-to-end development kits, tool and com- ponent as well as service platforms. Features. PaddleFL provides a high level interface to develop FL models with DP. In the FL field it implements the Federated Averaging Aggregator and its secure multi-party computation equivalent. When it comes to DP, it provides an implementation of the dif- ferentially private stochastic gradient descent. Shortcomings. This framework has little documentation. It lacks any other DP algorithm so there is great difficulty in developing alternative privacy-preserving techniques. Last, since it is based on PaddlePaddle it is not compatible with other frameworks, and there is little documentation which makes it really hard to use and understand. Overview. PaddleFL provides a high level interface for some basic and well-known FL aggregators and implements a differentially private algorithm, being one of its main drawbacks that it is little documented and it does not implement any tool to easily ex- tend its capabilities. https://leaf.cmu.edu/ https://paddlefl.readthedocs.io/en/latest https://github.com/paddlepaddle/paddle 18 3.6. Frameworks analysis The discussed software tools share some shortcomings for developing distributed AI ser- vices that preserves data privacy. Among them, we stress out the following: 1. They focus on FL or DP, but they do not provide a unified approach for both of them. 2. They lack DP mechanisms and related methods from the DP area. Likewise, they do not allow to develop and integrate new DP mechanisms in the frameworks. 3. Only the most basic federated aggregation operators are implemented. They are mainly focused on deep learning models, and they do not provide support for other machine learning algorithms that may be also used in the FL setting. We summarise and compare the characteristics of the frameworks reviewed in Table 1. We conclude that a unified FL and DP framework is required, and this is the ambitious aim of Sherpa.ai FL, which we present in the following section. TensorFlow PySyft LEAF PaddleFL FL & DP features Federated Learning: Use federated models with different datasets Support for other libraries Sampling environment: IID or non-IID distribution Federated aggregation mechanisms Federated attack simulator Differential Privacy: Mechanisms: Exponential, Laplacian, Gaussian Sensitivity sampler Subsampling methods to increase privacy Adaptive Differential Privacy Desired properties: Documentation & tutorials High level API Ability to extend the framework with new properties Table 1: FL and DP features comparison among existing frameworks. Complete Partial Do not work Unknown 4. Sherpa.ai Federated Learning Framework 1,2 We develop Sherpa.ai FL, which is an open-research unified FL and DP framework that aims to foster the research and development of AI services at the edges and to pre- serve data privacy. We describe the hierarchical and modular software architecture of Sherpa.ai FL, related to the key elements of FL and DP shown in Section 4.1. Likewise, https://developers.sherpa.ai/privacy-technology/ https://github.com/sherpaai/Sherpa.ai-Federated-Learning-Framework 19 we detail the functionalities and the implementation details of Sherpa.ai FL in Section 4.2 and Section 4.3. 4.1. Software architecture The software is structured in several modules that encapsulate the specific functionality of each key element of a FL setting. The architecture of these software modules allows the extension of the framework in relation to the progress of the research on FL. Figure 3 shows the backbone of the software architecture of Sherpa.ai FL, and we describe each module as what follows: • data base: it is in charge of reading the data according to the chosen database. It is related to the data key element. • data distribution: it performs the federated distribution of data among the clients involved in the FL process. It is also related to the data key element and completes its functionality. • private: it includes several interfaces such as the node interface which represents the clients key element and other ones that allow to access and modify the federated data distribution. • learning approach: it represents the whole FL scheme including the federated server model and the communication and coordination among federated server and clients. It encapsulates the federated server and the communication key elements. • federated aggregator: it defines the software structure to develop federated ag- gregation operators. It is linked to the federated aggregation operator key element. Federated  Goverment Federated  Data distribution Model aggregator implements Differential Data base Private privacy Figure 3: Links between the different modules of Sherpa.ai FL. 20 • model: it defines the learning model using predefined models and their functional- ities. This learning model could be any machine learning model that can be aggre- gated by its representation in parameters. It is related to the model key element, as we associate a model object with the clients and the federated server. • differential privacy: it preserves DP of the clients by specifying the data access. It is related with the DP key elements, and also with the data, the clients and the communication FL key elements. 4.2. Software functionalities In this section we highlight the main contributions of Sherpa.ai FL, which are sum- marised in a wide range of functionalities, namely: • To define and customise a FL simulation with a fixed number of clients using clas- sical data sets. • To define the previous FL simulation using high-level functionalities. • To train machine learning models among different clients. Currently, Sherpa.ai FL offers support for a Keras models (neural networks), and for several models from Scikit-Learn (linear regression, k-means clustering, logistic regression). • To aggregate the information learned from each of the clients into a global model using classical federated aggregation operators such as: FedAvg, weighted FedAvg [24] and an aggregation operator for the adaptation of the k-means algorithm to the federated setting [25]. • To apply modifications on federated data such as normalisation or reshaping. • To evaluate the FL approach in comparison with the classical centralised one. • To preserve DP of clients’ data and model parameters in the FL context. The plat- form currently offers support for the fundamental DP mechanisms (Randomized Response, Laplace, Exponential, Gauss), and the composition of DP mechanisms (Basic and Advanced adaptive composition using privacy filters for the maximum privacy loss). Moreover, it is possible to increase privacy by subsampling. In Table 2, we summarise the main contributions of Sherpa.ai FL in comparison with the key points analysed for each framework in the previous Section. Thanks to the hierarchical implementation of each module, the aforementioned func- tionalities can be extended and customised just by adding software classes that inherit from the original software classes. For example, the already available machine learning models, DP mechanisms and federated aggregation operators can be modified, or new ones can be created, simply by overwriting the corresponding methods in the classes TrainableModel, DataAccessDefinition, FederatedAggregator, respectively. 21 TensorFlow LEAF PySyft PaddleFL FL & DP features Federated Learning: Use federated models with different datasets Support for other libraries Sampling environment: IID or non-IID distribution Federated aggregation mechanisms Federated attack simulator Differential Privacy: Mechanisms: Exponential, Laplacian, Gaussian Sensitivity sampler Subsampling methods to increase privacy Adaptive Differential Privacy Desired properties: Documentation & tutorials High level API Ability to extend the framework with new properties Table 2: FL & DP features comparison between existing frameworks and Sherpa.ai FL. Complete Partial Do not work Unknown 4.3. Implementation details 1 2 Sherpa.ai FL has been developed by DaSCI Institute and Sherpa.ai. We developed the software using Python language for the whole architecture. Furthermore, Keras , 4 5 TensorFlow and scikit-kearn APIs are employed for the machine learning part which ensures efficiency and compatibility. It can also be run on computing devices such as CPUs or GPUs. In order to use GPUs, the adequate versions of TensorFlow and CUDA must be installed. For detailed installation 6,7 instructions, please see the installation guide. The framework is licensed under the Apache License 2.0, a permissive license whose main conditions require preservation of copyright and license notices. 5. Machine learning matches federated learning. Methodological guidelines for pre- serving data privacy FL is a paradigm of machine learning, but its particularities force to adapt the machine learning settings to FL. For this reason, Sherpa.ai FL functionalities outlined in Section https://dasci.es/ https://sherpa.ai/ https://keras.io/ https://www.tensorflow.org/ https://scikit-learn.org/stable/ https://github.com/sherpaai/Sherpa.ai-Federated-Learning-Framework/blob/master/ install.md The implementation details described in this paper corresponds to the release 0.1.0 of Sherpa.ai FL. https://www.apache.org/licenses/LICENSE-2.0 22 4 stem from the need to develop machine learning algorithms specialised for Federated Artificial Intelligence that protect clients’ data privacy. However, the adaptation is not only focused on the algorithms, but also on the workflow of machine learning. In this section, we first discuss the key aspects of distributed computing for machine learning in the federated setting. Following, Section 5.2 defines a rather specific paradigm for adapting machine learning models to the federated setting, followed by some remark- able exceptions to the defined adaptation in Section 5.3. Finally, we also define the adap- tation of the machine learning workflow to FL in Section 5.4 as methodological guidelines for preserving data privacy with FL using Sherpa.ai FL. 5.1. Key aspects of distributed computing for Federated Machine Learning The recent introduction of FL [26, 14, 7] responds to the need for novel distributed ma- chine learning algorithms in a setting that clashes with several assumptions of conven- tional parallel machine learning in a data centre. The differences are substantially origi- nated by the unreliable and poor network connection of the clients, since clients are typi- cally mobile phones. Thus reducing the number of rounds of learning is essential as com- munication constrains are more severe. Additionally, the data is unevenly scattered across K clients and must be considered as non-IID, that is, the data accessible locally is not in any way representative of the overall trend. The data is in general sparse, where the features of interest take place on a reduced number of clients or data points. Ultimately, the number of total clients greatly exceeds the number of training points available locally on each client (K  n/K). In the federated machine learning setting, the training is decoupled from the access to the raw data. In fact, the raw data never leaves users’ mobile devices and a high-accuracy model is produced in a central server by aggregating locally computed updates. At each FL round, an update vector q 2 R is sent from each client to the central server to improve the global model, with d the parameters’ dimension of the computed model. It is worth noting that the magnitude of the update q is thus independent from the amount of raw data available on the local client (e.g., q might be a gradient vector). One of the advantages of this approach is the considerable bandwidth and time saved in data communication. Another motivation for the FL setting (but that also constitutes one of its intrinsic advan- tages) is the concern for privacy and security. By not transferring any raw data to the central server, the attack surface reduces to only the single client, instead of both client and server. On the other hand, the update q sent by the client might still reveal some of its private information, however the latter will be almost always dramatically reduced with respect to the raw training data. Besides, after improving the current model by the update q, this can (and should) be deleted. Sherpa.ai FL allows for both IID and non-IID client data. Sherpa.ai FL allows for weighted aggregation, emphasising the contribution of most significant clients to the global model. 23 Additional privacy can be provided by randomised algorithms providing DP [12], as de- tailed in Sections 2.3 and 2.4. In particular, the centralised algorithm could be equipped with a DP layer allowing the release of the global model without compromising the pri- vacy of the individual clients who contributed to its training (see e.g., Abadi et al. [27]). On the other hand, in the case of a malicious or compromised server, or in the case of po- tential eavesdropping, DP can be applied on the local clients for protecting their privacy [28, 29, 30]. 5.2. The Machine Learning paradigm in a federated setting In the following, we describe the federated machine learning paradigm by recognising some relevant attributes that ease the natural adaptation of a ML model in the federated setting. Primarily, we observe that a great number of machine learning methods resemble the minimisation of an objective function with finite-sum as in Equation 3. The aforemen- tioned problem structure encompasses both linear and logistic regressions, support vec- tor machines, and also more elaborated techniques such as conditional random fields and neural networks [26]. Indeed, in neural networks predictions are made through a non- convex function, yet the resulting objective function can still be expressed as F (q) and the gradients can be efficiently obtained by backpropagation, thus resembling Equation 3. A variety of algorithms have been proposed to solve the minimisation problem in Equa- tion 3 in the federated setting, where, as mentioned earlier, the primary constrain is the communication efficiency for reducing the number of FL rounds in the aggregation of local models. In this context, another characteristic trait of the Federated ML paradigm is constituted by the intrinsic compatibility with baseline aggregation operators (e.g. Feder- ated Averaging), and where no ad-hoc adaptation is required. Ultimately, several of these FL algorithms have been supplied with DP [27, 29, 30]. We thus identify a rather important aspect of the Federated machine learning paradigm as being prone to straightforward application of the common building components of DP. In addition, the latter feature eases the task of estimating the privacy loss in the FL rounds by the application of composition theorems for DP. To summarise, a machine learning method is prone to adaptation in the federated set- ting if it adheres to the principles of the federated machine learning paradigm described above, namely: (1) the problem structure resembling the minimisation of an objective function as in Equation 3, (2) the attribute of easy aggregation of local models’ parame- ters, and (3) the direct applicability of DP techniques for additional privacy. Among such machine learning models we cite neural networks [31], linear [32, 33] and logistic [34] Sherpa.ai FL allows to apply sophisticated and customised DP mechanisms on the model’s parame- ters, as well as on client’s raw data (see Section 2.4). Sherpa.ai FL offers support for both common building components for DP, as well as for its basic and advanced composition theorems using privacy filters (see Section 2.4). 24 1 regressions. 5.3. Models deviating from the federated machine learning paradigm It is worth mentioning specific machine learning models whose structure only partially fits in the federated machine learning paradigm described above. Although the problem structure can still be represented by a minimisation of an objective function as in Equation 3, their adaptation to a federated setting requires additional and ad-hoc procedures. One example is found in the k-means clustering algorithm for unsupervised learning [35, 36], where the non-IID nature of the data distribution is seen as a major obstacle in a federated setting. Namely, the direct application of average aggregation is unfeasible due to potentially different number and ordering of local clusters, and more advanced al- gorithms need to be employed. A workable solution is to fix the number of local clusters, and apply an additional k-means clustering in the average aggregation. Alternatively, one might try grouping the clients’ population sharing jointly trainable data distributions, as proposed by Sattler et al. [37] in the context of deep neural networks. An additional complication is constituted by the preservation of clients’ privacy. For instance, the base- line DP building components necessitate some adjustments in order to be applied. Al- though not in a FL context, in Zhang et al. [38] the authors adjust the Laplace noise added to each centroid based on the contour coefficients. Another notable example is represented by the federated version of matrix factorization- based ranking algorithms for recommendation systems [39, 40]. The peculiar architecture of this algorithm involves the communication of only a portion of the update vector for improving the model, thus the round of learning implies additional communications be- tween the central server and the clients. Moreover, the multiple communication iterations necessitates further caution with privacy loss in the DP context. A viable approach is to implement a two-stage randomised response mechanism on the local data, and to allow the clients to specify their privacy level [30]. 5.4. Methodological guidelines for preserving data privacy with federated learning in Sherpa.ai FL The experimental settings of FL and machine learning are very similar because FL is a machine learning paradigm. Nonetheless, the particularities of FL force to revise the ma- chine learning workflow, and adapt it to the FL definition. In this section, we define the workflow of FL based on the machine learning workflow, and we present it as method- ological guidelines. Sherpa.ai FL comply with these methodological guidelines assur- ing the following of good practises in the development of AI services at the edges that preserve data privacy. See also notebooks on deep learning, linear and logistic regressions available in Sherpa.ai FL at https: //github.com/sherpaai/Sherpa.ai-Federated-Learning-Framework/tree/master/notebooks See notebook on k-means clustering in Sherpa.ai FL at https://github.com/sherpaai/Sherpa. ai-Federated-Learning-Framework/tree/master/notebooks 25 We distinguish two scenarios in FL: (1) a real one, where we do not actually know the un- derlying distribution of the data and (2) a simulation of a FL scenario, where it is possible to emulate a federated data distribution in order to analyse its use case. The guidelines are focused on the real FL scenario although we remark the particularities of a simulated FL experiment. Moreover, we assume that the problem is properly formulated, that is, the data features and the target variable are previously defined and agreed upon by the clients. Based on this hypothesis, we show the scheme of the workflow of a FL experiment in Figure 4 and detail it in the following sections. DATA Data Collection Data Preparation Learning model Hyper-parameter Model training Model evaluation LEARNING selection Tuning MODEL Aggregation  operator selection Predictions PREDICTION making Figure 4: Flow chart of a FL experiment. 5.4.1. Data collection In a real FL scenario, the data naturally belongs to the clients. Therefore, the data collec- tion takes place locally at each client, resulting in a distributed approach from the outset. In a strict FL scenario, the server has no knowledge at all of the data. However, there is the possibility for the server to gain minor prior knowledge of the problem if a global validation or test dataset is used. Here, we assume that the server does not have any information, which is the most restrictive and common situation. Remark: When simulating a FL scenario in scientific research, data collection is reduced to accessing a database. The distribution of the data among clients is simulated in the data preparation step. 5.4.2. Data preparation Data preparation involves two tasks: (1) Data partition, where we split data in training, evaluation and test sets, and (2) data preprocessing, where we transform the training data in order to improve its quality. Data partition. The process of splitting data into training, evaluation and test datasets in FL is similar to centralised machine learning process with the difference of replicating the 26 process for the data stored on each client. That is, each client dataset is split into training, evaluation and test sets. Remark: When it comes to a FL scenario for scientific research, it is feasible to have global evaluation and test datasets by extracting them before assigning the rest of the data to the clients as local training datasets. Moreover, in a simulation it could be a good practise to use both global and local evaluation and test datasets combining both methodologies. Data preprocessing. Preprocessing is the most tricky task in FL due to the distributed and private character of the data. The challenge is to consistently preprocess distributed datasets in several clients without any clue about the underlying data distribution. The process of adapting centralised preprocessing techniques to federated data is time- consuming. For the techniques based on statistics of data distributions (e.g. normali- sation) it is necessary to use robust aggregation of the statistics, which is a challenge in some situations. Algorithms based on intervals (e.g. discretisation) require a global in- terval that includes all the possible values. Moreover, there are complicated methods of robust adaptation such as feature selection [41]. Because of these intricacies, it is advisable to rely on preprocessing techniques adapted to distributed scenarios. Regarding distributed data preprocessing, we might take inspiration from different dis- tributed preprocessing techniques that have already been developed [42]. However, most of these methods need to be adapted in order to respect data privacy. A distributed model that suits privacy restrictions is MapReduce [43]. Therefore, big data preprocessing tech- niques [44] that are interactively applicable can be adapted to a FL scenario in compliance with data constraints. Remark: When simulating FL, it is possible to use centralised preprocessing methods be- fore splitting the data between the clients. It is not a recommended practice, but a useful trade-off in terms of experimentation. 5.4.3. Model selection This step implies, besides the choice of the learning model as in any centralised approach, the choice of the parameter aggregation mechanism used in the server. Choosing the learning model. This task consists of choosing the learning model structure stored both in the server and the clients. Clearly, the model has to correspond to the type of problem being addressed. The only restriction is that the learning model has to be representable using parameters in order to get the server learning model by aggregating local parameters. The canonical example of a learning model that can be represented using parameters is deep learning, but this is not the only one. Remark: When we simulate a FL scenario, server learning model can be initialised using global previous information. However, in a strict FL scenario it is initialised with the first aggregation of local parameters. 27 Choosing the aggregation operator. We also need to choose the aggregation operator used for client parameters aggregation at this point. There are different types of aggrega- tion operators: (1) operators which aggregate every client parameters (such as FedAvg), (2) operators which select the clients that take part in the aggregation (e.g. based on the performance) and (3) asynchronous aggregation operators (such as CO-OP). 5.4.4. Model training The iterative FL training process is divided into rounds of learning, and each round con- sists of: 1. Training the local models on their local training dataset, 2. sharing of the local parameters to the server, 3. aggregation of local models’ parameters on the server using the aggregation opera- tor and 4. updating the local models with the aggregated global model. 5.4.5. Model evaluation The evaluation of a FL model consists of assessing the aggregated model after assigning it to each client using the local evaluation datasets. After that, each client shares the performance with the server, which combines the local performances resulting in global evaluation metrics. Since the amount of data per client can be variable, we recommend using absolute metrics on clients (e.g. confusion matrix) and combine them on the server to get the remaining evaluation metrics. Remark 1: When simulating FL, we can use a global evaluation dataset in order to eval- uate the performance of the aggregated model. Moreover, we can use cross-validation methodologies to evaluate the model’s performance by partitioning all the folds at the beginning and replicating the whole workflow for each of the fold combinations. Remark 2: Although it is not the main purpose of FL, it might be worthwhile to evaluate the local models prior to the aggregation for measuring the customisation of the local model to each client. 5.4.6. Hyper-parameter tuning We base the tuning of the hyper-parameters of the learning models on the metrics ob- tained in the previous step, and modify certain learning model parameters in order to improve the performance on the evaluation datasets. Remark: According to the previously mentioned customisation, although it is not the ob- jective of the FL, we could tune each of the local models independently according to the local evaluation performance before the aggregation in order to improve customisation. 28 5.4.7. Predictions making The last step in the machine learning workflow after the training of the learning model, and by extension in the corresponding FL one, is to predict the label of unknown exam- ples. Those predictions are done with test sets of each client. Remark: When simulating FL, we can use global test dataset for prediction. Moreover, we can test local learning models prior to aggregation using instances of other clients (unknown targets) in order to measure the capability of generalisation of local models. 6. Illustrative cases of study One of the main characteristics of Sherpa.ai FL is its development upon the method- ological guidelines for FL detailed in Section 5.4. In this section, we show how to follow these methodological guidelines with Sherpa.ai FL through two experimental use cases, namely: 1. Classification with FL (see Section 6.1): showing how to create each of the key- elements of a FL experiment and combine them using our framework. To end with this example, we also compare the FL approach with a centralised one. 2. Regression with FL and DP (see Section 6.2): we compare the centralised with the FL approach with DP. Moreover, we demonstrate how to limit the privacy loss through the Privacy Filters implemented in Sherpa.ai FL. For more illustrative examples of the framework use, please see the notebook examples. 6.1. Classification with FL In this section we provide a simple example of how to develop a classification experiment in a FL setting with Sherpa.ai FL. We use a popular dataset to start the experimentation in a federated environment, and we finish the example with a comparison between fed- erated and centralised approaches. 6.1.1. Case of study In order to show the functionality of the software, we implement a simple and instruc- tive case of study. We use the EMNIST Digits dataset [45]. It consists of an extended version of the classic MNIST dataset which includes writings of several authors with dif- ferent features. This fact provides the non-IID character to the data which is useful for the simulation of federated environments. Table 3 shows the size of the dataset. For the simulation of the FL scenario we use 5 clients among which the instances of the dataset are distributed following a non-IID distribution. We use as learning model a sim- ple CNN (Convolutional Neural Networks) based neural network represented in Figure https://github.com/sherpaai/Sherpa.ai-Federated-Learning-Framework/tree/master/ notebooks https://www.nist.gov/itl/products-and-services/emnist-dataset 29 1568 28x28x1 28x28x64 14x14x64 14x14x32 7x7x32 conv3x3, 32 maxpool2x2 dense conv3x3, 64 maxpool2x2 stride (1, 1) stride (2, 2) dense stride (1, 1) stride (2, 2) flatten dense Figure 5: CNN-based neural network used as learning model in the illustrative example. Train set Test set Total 240 000 40 000 280 000 Table 3: Distribution of EMNIST Digits dataset. 5, and as federated aggregation operator the widely used operator FedAvg. The code of the illustrative example is detailed in the following section. 6.1.2. Description of the code We start the simulation with the first step of the methodological guidelines, i.e. the prepro- cessing of the data collection. Accordingly, we begin with loading the dataset. Sherpa.FL provides some functions to load the EMNIST Digits dataset. [1]: import matplotlib.pyplot as plt import shfl from shfl.private.reproducibility import Reproducibility # Comment to turn off reproducibility: Reproducibility(1234) database = shfl.data base.Emnist() _ _ _ _ _ train data, train labels, test data, test labels = database.load data() 30 We can inspect some properties of the loaded data, for instance the size or the dimension of the data. [2]: print(len(train data)) print(len(test data)) print(type(train data[0])) train data[0].shape <class 'numpy.ndarray'> [2]: (28, 28) As we see, our dataset is composed by a set of matrix of 28 by 28. Before starting with the federated scenario, we can take a look to a sample of the training data. [3]: plt.imshow(train data[0]) [3]: <matplotlib.image.AxesImage at 0x105ea3450> Now, we simulate a FL scenario with a set of 5 client nodes containing private data, and a central server which is responsible to coordinate the different clients. First of all, we simulate the data contained in every client with a IID distribution of the data. _ _ [4]: iid distribution = shfl.data distribution.IidDataDistribution(database) _ _ _ _ federated data, test data, test labels = iid distribution. _ _ _ ,!get federated data(num nodes=5, percent=50) 31 As a result, we have created federated data from the EMNIST dataset with 5 nodes and using every available data. Hence, the data collection process have finished. This data is a set of data nodes containing private data. [5]: print(type(federated data)) _ _ print(federated data.num nodes()) _ _ federated data[0].private data <class 'shfl.private.federated operation.FederatedData'> Node private data, you can see the data for debug purposes but the data ,!remains in the node <class 'dict'> {'112883278416': <shfl.private.data.LabeledData object at 0x1a486393d0>} As we can see, private data in a node is not accesible directly but the framework provides mechanisms to use this data in a machine learning model. Once data is prepared, the next step is the definition of the neural network architecture (model selection) used along the learning process. The framework provides a class to adapt a Keras (or Tensorflow) model to the framework, so you only have to create a function that will act as model builder. [6]: import tensorflow as tf def model builder(): model = tf.keras.models.Sequential() model.add(tf.keras.layers.Conv2D(32, kernel size=(3, 3), ,!padding='same', activation='relu', strides=1, input shape=(28, 28, ,!1))) model.add(tf.keras.layers.MaxPooling2D(pool size=2, strides=2, ,!padding='valid')) model.add(tf.keras.layers.Dropout(0.4)) model.add(tf.keras.layers.Conv2D(32, kernel size=(3, 3), ,!padding='same', activation='relu', strides=1)) model.add(tf.keras.layers.MaxPooling2D(pool size=2, strides=2, ,!padding='valid')) model.add(tf.keras.layers.Dropout(0.3)) model.add(tf.keras.layers.Flatten()) model.add(tf.keras.layers.Dense(128, activation='relu')) model.add(tf.keras.layers.Dropout(0.1)) model.add(tf.keras.layers.Dense(64, activation='relu')) model.add(tf.keras.layers.Dense(10, activation='softmax')) 32 model.compile(optimizer="rmsprop", ,!loss="categorical crossentropy", metrics=["accuracy"]) return shfl.model.DeepLearningModel(model) The following step is the definition of the federated aggregation operator in order to com- plete the model selection in FL. The framework provides some aggregation operators that we can use immediately and the possibility to define your own operator. In this case, we use the provided FedAvg operator. [7]: aggregator = shfl.federated aggregator.FedAvgAggregator() _ _ federated government = shfl.federated government. _ _ ,!FederatedGovernment(model builder, federated data, aggregator) The framework also provides the possibility of making data transformation for the data preprocessing step, defining federated operations using FederatedTransformation inter- face. We first reshape data and then normalise it using test data mean and standard deviation (std) as normalisation parameters. [8]: import numpy as np class Reshape(shfl.private.FederatedTransformation): def apply(self, labeled data): _ _ labeled data.data = np.reshape(labeled data.data, _ _ ,!(labeled data.data.shape[0], labeled data.data.shape[1], ,!labeled data.data.shape[2],1)) shfl.private.federated operation. _ _ _ ,!apply federated transformation(federated data, Reshape()) [9]: import numpy as np class Normalize(shfl.private.FederatedTransformation): __ __ def init (self, mean, std): __ self. mean = mean __ self. std = std def apply(self, labeled data): _ _ __ labeled data.data = (labeled data.data - self. mean)/self. __ ,! std 33 _ mean = np.mean(train data.data) std = np.std(train data.data) shfl.private.federated operation. _ _ _ ,!apply federated transformation(federated data, Normalize(mean, std)) We are now ready to train the FL algorithm. We run 2 rounds of learning showing test accuracy and loss of each client and test accuracy and loss of the global aggregated model. _ _ _ _ [10]: test data = np.reshape(test data, (test data.shape[0], test data. ,!shape[1], test data.shape[2],1)) _ _ _ _ federated government.run rounds(2, test data, test labels) Accuracy round 0 Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x1a485e2450>: [15.087034225463867, 0.9314000010490417] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x106a0ffd0>: [21.040000915527344, 0.9094250202178955] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x1a48639e90>: [11.712089538574219, 0.9425749778747559] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x1a486396d0>: [10.11756420135498, 0.9498249888420105] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x1a48639c50>: [24.04242706298828, 0.8968499898910522] Global model test performance : [7.954472064971924, 0.9403749704360962] Accuracy round 1 Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x1a485e2450>: [21.94520378112793, 0.9227499961853027] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x106a0ffd0>: [16.780630111694336, 0.9445000290870667] Test performance client <shfl.private.federated operation. ,!FederatedDataNode 34 object at 0x1a48639e90>: [13.413337707519531, 0.9463250041007996] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x1a486396d0>: [9.085938453674316, 0.9628000259399414] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x1a48639c50>: [20.926694869995117, 0.918524980545044] Global model test performance : [10.171743392944336, 0.958299994468689] If we focus our attention on test accuracy in each client, we realise that there are widely varying results. This is because of the scattered nature of the data distribution, which causes disparity in the quality of training data between clients. 6.1.3. Comparison with a centralised convolutional neural network approach We analyse the behaviour of the FL approach in comparison with the equivalent cen- tralised approach, which means training the neural network represented in Figure 5 on the same data using centralised learning. For this experiment, we use 25 clients and 10 rounds of learning with 5 epochs in both IID and non-IID scenario, where the nodes’ data contain only a portion of all labels. For a fair comparison, in the classical approach we train for e pochs  rounds epochs. F L F L IID non-IID Centralised approach 0.9904 0.9901 Federated approach 0.9921 0.9855 Table 4: Accuracy of the FL and the classical approach, in both IID and non-IID scenarios. In the FL case, the data is distributed over 25 clients, and 10 FL rounds of learning with 5 epochs per client are employed. The high performance of the federated approach stands out in Table 4, where the accu- racy for the considered scenarios is reported. In the IID scenario, it beats the centralised approach results, which shows the robustness of the approach caused by the combina- tion of the information learned by each client. In the non-IID scenario, the federated approach attains lower results than the centralised one due to the additional challenge of non-homogeneous distribution of data across clients. However, the results are very competitive highlighting the strength of the federated approach. Running more learning rounds results in better performance as in the next section. The purpose of this example is to show how it works. The performance of the centralised approach using non-IID data is not perfectly identical to the IID case due to the random sampling employed when generating the non-IID nodes’ data. 35 6.2. Linear regression with DP This section presents a linear regression FL simulation with DP following the method- ological guidelines with Sherpa.ai FL. The Laplace mechanism is used when the model’s sensitivity is estimated by a sampling procedure [19]. Moreover, we demon- strate the application of the advanced composition theorem for DP for not exceeding the maximum privacy loss allowed (see Section 2.4). 6.2.1. Case of study We will use the California Housing dataset, which consists of approximately 20 000 sam- ples for median house prices in California. Although the dataset possesses eight features, in this example we will only make use of the first two, in order to reduce the variance in the prediction. The (single) target is the cost of the house. As it can be observed in the code below, we retain 2 000 samples for later use with the sensitivity sampling for DP, and the rest of the data is split in train and test sets as detailed in Table 5. Train set Test set Total 14 912 3 728 18 640 Table 5: Distribution of the California Housing dataset. For the FL simulation we use 5 clients among which the train dataset is IID. FedAvg is chosen as the federated aggregation operator. The code of the example is detailed in the following section. 6.2.2. Description of the code Sherpa.FL allows to easily convert a generic dataset to interact with the platform: import shfl _ _ from shfl.data base.data base import LabeledDatabase import sklearn.datasets import numpy as np from shfl.private.reproducibility import Reproducibility # Comment to turn off reproducibility: Reproducibility(1234) _ _ _ all data = sklearn.datasets.fetch california housing() n features = 2 Sherpa.ai FL offers support for the linear regression model from scikit-learn https:// scikit-learn.org/stable/index.html https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch california housing.html 36 _ _ data = all data["data"][:,0:n features] labels = all data["target"] # Retain part for DP sensitivity sampling: size = 2000 sampling data = data[-size:, ] sampling labels = labels[-size:, ] # Create database: database = LabeledDatabase(data[0:-size, ], labels[0:-size]) _ _ _ _ _ train data, train labels, test data, test labels = database.load data() We will simulate a FL scenario by distributing the train data over a collection of clients, assuming an IID setting: _ _ iid distribution = shfl.data distribution.IidDataDistribution(database) _ _ _ _ federated data, test data, test labels = iid distribution. _ _ _ ,!get federated data(num nodes=5) At this stage, we need to define the linear regression model, and we choose the aggrega- tion operator to be the average of the clients’ models: _ _ from shfl.model.linear regression model import LinearRegressionModel def model builder(): _ _ model = LinearRegressionModel(n features=n features) return model aggregator = shfl.federated aggregator.FedAvgAggregator() 6.2.3. Running the model in a Federated configuration We are now ready to run the FL model. Note that in this case, we set the number of rounds n=1 since no iterations are needed in the case of linear regression. The performance met- rics used are the Root Mean Squared Error (RMSE) and the R score. It can be observed that the performance of the Global model (i.e. the aggregated model) is in general supe- rior with respect to the performance of each node, thus the federated learning approach proves to be beneficial: _ _ federated government = shfl.federated government. _ _ ,!FederatedGovernment(model builder, federated data, aggregator) _ _ _ _ federated government.run rounds(n=1, test data=test data, _ _ ,!test label=test labels) 37 Accuracy round 0 Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a08d0>: [0.8161535463006577, 0.5010049851923566] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a0ac8>: [0.81637303674763, 0.5007365568636023] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a09b0>: [0.8155342443231007, 0.5017619784187599] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a0be0>: [0.8158502097728687, 0.5013758352304256] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a0cf8>: [0.8151607067608612, 0.5022182878756591] Global model test performance : [0.8154147770321544, 0.5019079411109164] 6.2.4. Differential Privacy: sampling the model’s sensitivity In the case of applying the Laplace privacy mechanism (see Section 2.4), the noise added has to be of the order of the sensitivity of the model’s output, i.e. the model parameters of our linear regression. In the general case, the model’s sensitivity might be difficult to compute analytically. An alternative approach is to attain random differential privacy through a sampling over the data [19]. That is, instead of computing analytically the global sensitivity D f , we compute an empirical estimation of it by sampling over the dataset. This approach is convenient since it allows for the sensitivity estimation of an arbitrary model or a black-box computer function. The Sherpa.FL framework provides this functionality in the class SensitivitySampler. In order to carry out this approach, we need to specify a distribution of the data to sample from. This in general requires previous knowledge and/or model assumptions. In order not make any specific assumption on the distribution of the dataset, we can choose a uniform distribution. To the end, we define our class of ProbabilityDistribution that uniformly samples over a data-frame. We use the previously retained part of the dataset for sampling: class UniformDistribution(shfl.differential privacy. ,!ProbabilityDistribution): """ Implement Uniform sampling over the data """ 38 __ __ _ def init (self, sample data): _ _ _ self. sample data = sample data def sample(self, sample size): _ _ _ row indices = np.random.randint(low=0, high=self. sample data. ,!shape[0], size=sample size, dtype='l') _ _ _ return self. sample data[row indices, :] _ _ _ sample data = np.hstack((sampling data, sampling labels.reshape(-1,1))) The class SensitivitySampler implements the sampling given a query, i.e. the learning model itself in this case. We only need to add the method get to our model since it is required by the class SensitivitySampler. We choose the sensitivity norm to be the ` norm and we apply the sampling. The value of the sensitivity depends on the number of samples n: the more samples we perform, the more accurate the sensitivity. Indeed, increasing the number of samples n, the sensitivity gets more accurate and typically de- creases. from shfl.differential privacy import SensitivitySampler from shfl.differential privacy import L1SensitivityNorm class LinearRegressionSample(LinearRegressionModel): def get(self, data array): data = data array[:, 0:-1] labels = data array[:, -1] train model = self.train(data, labels) _ _ return self.get model params() distribution = UniformDistribution(sample data) sampler = SensitivitySampler() n samples = 4000 _ _ _ max sensitivity, mean sensitivity = sampler.sample sensitivity( _ _ _ LinearRegressionSample(n features=n features, n targets=1), L1SensitivityNorm(), distribution, n=n samples, gamma=0.05) print("Max sensitivity from sampling: " + str(max sensitivity)) print("Mean sensitivity from sampling: " + str(mean sensitivity)) Max sensitivity from sampling: 0.008294354064053988 Mean sensitivity from sampling: 0.0006633612087443363 39 Unfortunately, sampling over a dataset involves the training of the model on two datasets differing in one entry [19]. Thus in general this procedure might be computationally expensive (e.g. in the case of training a deep neuronal network). 6.2.5. Running the model in a Federated configuration with Differential Privacy At this stage we are ready to add a layer of DP to our federated learning model. Specif- ically, we will apply the Laplace mechanism from Section 2.4, employing the sensitiv- ity obtained from the previous sampling, namely D f  0.008. The Laplace mechanism provided by the Sherpa.FL framework is then assigned as the private access type to the model’s parameters of each client in a new FederatedGovernment object. This results into an e-differentially private FL model. For example, picking the value e = 0.5, we can run the FL experiment with DP: from shfl.differential privacy import LaplaceMechanism _ _ params access definition = ,!LaplaceMechanism(sensitivity=max sensitivity, epsilon=0.5) _ _ federated governmentDP = shfl.federated government.FederatedGovernment( _ _ model builder, federated data, aggregator, _ _ _ _ ,!model params access=params access definition) _ _ _ _ federated governmentDP.run rounds(n=1, test data=test data, _ _ ,!test label=test labels) Accuracy round 0 Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a08d0>: [0.8161535463006577, 0.5010049851923566] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a0ac8>: [0.81637303674763, 0.5007365568636023] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a09b0>: [0.8155342443231007, 0.5017619784187599] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a0be0>: [0.8158502097728687, 0.5013758352304256] Test performance client <shfl.private.federated operation. ,!FederatedDataNode object at 0x7f63606a0cf8>: [0.8151607067608612, 0.5022182878756591] Global model test performance : [0.8309024800913748, 0.48280707735516126] In the above example we observed that the performance of the model has slightly deteri- 40 orated due to the addition of DP. In general, the privacy increases at expenses of accuracy (i.e. for smaller values of e). 6.2.6. Comparison with centralised and non private approaches It is of practical interest to assess the performance loss due to DP in the FL context. Table 6 reports the performance metrics for the centralised model, for the FL non-private model, and for the FL differentially-private model. In all the cases, the models have learned on the train set, and the performance results have been computed over the test set. The data is IID over 5 clients in the FL cases. For the federated DP cases, different values of e = f0.2, 0.5, 0.8g are used and the total privacy expense is limited at e = 4. Thus, for each case, we took the average over the total runs before the budget is expended (see discussion about advanced composition theorems for privacy filters in Section 2.4). The sensitivity is fixed, employing the value obtained from sampling above. Approach RMSE R Classical 0.81540 0.50192 Federated non-private 0.81541 0.50190 Federated DP (e = 0.2, average of 20 runs) 1.05541 0.04224 Federated DP (e = 0.5, average of 8 runs) 0.84501 0.46457 Federated DP (e = 0.8, average of 5 runs) 0.82171 0.49414 Table 6: Federated linear regression: comparison between the classical centralised model, the non-private FL model, and the FL model with a DP layer using the Laplace mechanism. For the DP cases, the results are the average over the total runs allowed for the maximum privacy budget e = 4. Different values of e = f0.2, 0.5, 0.8g are considered, and the sensitivity is fixed. The data is IID distributed over 5 clients. It can be observed that the centralised model and the non-private FL model exhibit com- parable performance, thus the accuracy is not degraded by applying a FL approach. The application of the Laplace mechanism guarantees e-DP, and the accuracy of the FL model can be leveraged by setting the value of e: for higher values, lower privacy is guaranteed, but the accuracy increases. 7. Concluding remarks The characteristics of FL and DP make them good candidates to support AI services at the edges and to preserve data privacy. Hence, several software tools for FL and DP have been released. After a comparative analysis, we conclude that these software tools do not provide a unified support for FL and DP, and they do not follow any particular method- ological guidelines that direct the developing of AI services to preserve data privacy. Note that, when applying the composition theorems for privacy filters in the present example, we are assuming that the estimated sensitivity is a good enough approximation of the analytic sensitivity [22]. 41 Since FL is a machine learning paradigm, we have studied how to adapt the machine learning principles to the FL ones, and consequently we have also defined the workflow of an experimental setting of FL. The main result of that study is Sherpa.ai FL, which is a new software framework with a unified support for FL and DP, that allows to follow the defined methodological guidelines for FL. The combination of the methodological guidelines and Sherpa.ai FL is shown by means of a classification and a regression use cases. Those illustrative examples also show that the centralised and federated setting of the same experiments achieve similar results, which means that the joint use FL and DP can support the development of AI services at the edges that preserve data privacy. Sherpa.ai FL is in continuous development. Since FL and DP fields are constantly grow- ing, we plan to extend the framework’s functionalities by new federated aggregation op- erators, machine learning models, and data distributions. Moreover, new DP mechanisms such as RAPPOR [46] will be added, together with relaxations of DP such as Concentrated DP [47] or Rényi DP [48]. Acknowledgments This research work is partially supported by the contract OTRI-4137 with SHERPA Eu- rope S.L., the Spanish Government project TIN2017-89517-P. Nuria Rodríguez Barroso and Eugenio Martínez Cámara were supported by the Spanish Government fellowship programmes Formación de Profesorado Universitario (FPU18/04475) and Juan de la Cierva Incorporación (IJC2018-036092-I) respectively. References [1] E. Parliament, Regulation (eu) 2016/679 of the european parliament and of the coun- cil of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (general data protection regulation) (2016). [2] E. H.-L. E. G. on AI, The ethics guidelines for trustworthy artificial intelligence (ai) (2019). [3] A. Jalalirad, M. Scavuzzo, C. Capota, M. R. Sprague, A simple and efficient federated recommender system, 2019, pp. 53–58. [4] T. S. Brisimi, R. Chen, T. Mela, A. Olshevsky, I. C. Paschalidis, W. Shi, Federated learning of predictive models from federated electronic health records, International Journal of Medical Informatics 112 (2018) 59 – 67. [5] D. Kawa, S. Punyani, P. Nayak, A. Karker1, V. Jyotinagar, Credit risk assessment from combined bank records using federated learning, International Research Jour- nal of Engineering and Technology 6 (2019) 1355–1358. 42 [6] N. H. Tran, W. Bao, A. Zomaya, M. N. H. Nguyen, C. S. Hong, Federated learning over wireless networks: Optimization model design and analysis, in: IEEE INFO- COM 2019 - IEEE Conference on Computer Communications, 2019, pp. 1387–1395. [7] J. Konecný, ˇ H. McMahan, F. Yu, P. Richtarik, A. Suresh, D. Bacon, Federated learning: Strategies for improving communication efficiency, in: NIPS Workshop on Private Multi-Party Machine Learning, 2016. [8] P. Kairouz, H. B. M. et al., Advances and open problems in federated learning (2019). [9] A. Bhagoji, S. Chakraborty, P. Mittal, S. Calo, Analyzing federated learning through an adversarial lens, in: Proceedings of the 36th International Conference on Machine Learning, volume 97, 2019, pp. 634–643. [10] M. Fredrikson, S. Jha, T. Ristenpart, Model inversion attacks that exploit confidence information and basic countermeasures, in: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, 2015, pp. 1322 – 1333. [11] K. Chaudhuri, N. Mishra, When random sampling preserves privacy, in: Annual International Cryptology Conference, 2006, pp. 198–213. [12] C. Dwork, A. Roth, The algorithmic foundations of differential privacy, Foundations and Trends® in Theoretical Computer Science 9 (2014) 211–407. [13] Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen, H. Yu, Federated Learning, volume 13, Morgan & Claypool, 2019. [14] B. McMahan, E. Moore, D. Ramage, S. Hampson, B. A. y Arcas, Communication- Efficient Learning of Deep Networks from Decentralized Data, in: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54, 2017, pp. 1273–1282. [15] Y. Wang, CO-OP: Cooperative Machine Learning from Mobile Devices, Ph.D. thesis, University of Alberta, 2017. [16] M. Fredrikson, S. Jha, T. Ristenpart, Model inversion attacks that exploit confidence information and basic countermeasures, in: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Association for Computing Machinery, 2015, p. 1322–1333. [17] H. B. McMahan, D. Ramage, K. Talwar, L. Zhang, Learning differentially private recurrent language models, in: 6th International Conference on Learning Represen- tations, 2018. [18] C. Dwork, F. McSherry, K. Nissim, A. Smith, Calibrating noise to sensitivity in pri- vate data analysis, in: Theory of Cryptography, 2006, pp. 265–284. 43 [19] B. I. P. Rubinstein, F. Aldà, Pain-free random differential privacy with sensitivity sampling, in: Proceedings of the 34th International Conference on Machine Learn- ing, volume 70, 2017, pp. 2950–2959. [20] F. McSherry, K. Talwar, Mechanism design via differential privacy, in: 48th Annual IEEE Symposium on Foundations of Computer Science, 2007, pp. 94–103. [21] P. Kairouz, S. Oh, P. Viswanath, The composition theorem for differential privacy, IEEE Transactions on Information Theory 63 (2017) 4037–4049. [22] R. M. Rogers, A. Roth, J. Ullman, S. Vadhan, Privacy odometers and filters: Pay-as- you-go composition, in: Advances in Neural Information Processing Systems, 2016, pp. 1921–1929. [23] B. Balle, G. Barthe, M. Gaboardi, Privacy amplification by subsampling: Tight anal- yses via couplings and divergences, Advances in Neural Information Processing Systems (2018) 6277–6287. [24] H. B. McMahan, E. Moore, D. Ramage, B. A. y Arcas, Federated learning of deep networks using model averaging (2016). [25] A. Soliman, S. Girdzijauskas, M.-R. Bouguelia, S. Pashami, S. Nowaczyk, Decentral- ized and adaptive k -means clustering for non-iid data using hyperloglog counters, 2020, pp. 343–355. [26] J. Konecný, ˇ H. McMahan, D. Ramage, P. Richtárik, Federated optimization: Dis- tributed machine learning for on-device intelligence (2016). [27] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, L. Zhang, Deep learning with differential privacy, Proceedings of the 2016 ACM SIGSAC Con- ference on Computer and Communications Security (2016) 308–318. [28] M. J. Wainwright, M. I. Jordan, J. C. Duchi, Privacy aware learning, in: F. Pereira, C. J. C. Burges, L. Bottou, K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 25, 2012, pp. 1430–1438. [29] R. C. Geyer, T. J. Klein, M. Nabi, Differentially private federated learning: A client level perspective (2019). [30] J.-Y. Jiang, C.-T. Li, S.-D. Lin, Towards a more reliable privacy-preserving recom- mender system, Information Sciences 482 (2019) 248 – 265. [31] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. M. Kid- don, J. Konecný, ˇ S. Mazzocchi, B. McMahan, T. V. Overveldt, D. Petrou, D. Ramage, J. Roselander, Towards federated learning at scale: System design, in: SysML 2019, 44 [32] A. Gascón, P. Schoppmann, B. Balle, M. Raykova, J. Doerner, S. Zahur, D. Evans, Secure linear regression on vertically partitioned datasets, IACR Cryptol 2016 (2016) [33] A. Gascón, P. Schoppmann, B. Balle, M. Raykova, J. Doerner, S. Zahur, D. Evans, Privacy-preserving distributed linear regression on high-dimensional data, Proceed- ings on Privacy Enhancing Technologies (2017) 345–364. [34] S. Hardy, W. Henecka, H. Ivey-Law, R. Nock, G. Patrini, G. Smith, B. Thorne, Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption (2017). [35] S. P. Lloyd, Least squares quantization in pcm, IEEE Transactions on Information Theory 28 (1982) 129–137. [36] G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning, Springer, New York, NY, 2013. [37] F. Sattler, K.-R. Müller, W. Samek, Clustered federated learning: Model-agnostic distributed multi-task optimization under privacy constraints (2019). [38] Y. Zhang, N. Liu, S. Wang, A differential privacy protecting k-means clustering al- gorithm based on contour coefficients, PLoS ONE 13 (2018) 1–15. [39] S. Funk, Netflix update: Try this at home (2006). Accessed: 2020-05-09. [40] M. Ammad-ud-din, E. Ivannikova, S. A. Khan, W. Oyomno, Q. Fu, K. E. Tan, A. Flanagan, Federated collaborative filtering for privacy-preserving personalized recommendation system (2019). [41] L. Morán-Fernández, V. Bolón-Canedo, A. Alonso-Betanzos, Centralized vs. dis- tributed feature selection methods based on data complexity measures, Knowledge- Based Systems 117 (2017) 27–45. [42] U. S., N. Malaiyappan, Approaches and techniques of distributed data mining : A comprehensive study, International Journal of Engineering and Technology 9 (2017) 63–76. [43] J. Dean, S. Ghemawat, Mapreduce: Simplified data processing on large clusters, in: OSDI’04: Sixth Symposium on Operating System Design and Implementation, 2004, pp. 137–150. [44] J. Luengo, D. García-Gil, S. Ramírez-Gallego, S. García, F. Herrera, Big Data Prepro- cessing: Enabling Smart Data, Springer, 2020. [45] G. Cohen, S. Afshar, J. Tapson, A. van Schaik, EMNIST: Extending MNIST to hand- written letters, in: 2017 International Joint Conference on Neural Networks, 2017, pp. 2921–2926. 45 [46] Úlfar Erlingsson, V. Pihur, A. Korolova, Rappor: Randomized aggregatable privacy- preserving ordinal response, in: Proceedings of the 21st ACM Conference on Com- puter and Communications Security, Scottsdale, Arizona, 2014. [47] C. Dwork, G. N. Rothblum, Concentrated differential privacy (2016). [48] I. Mironov, Rényi differential privacy, 2017 IEEE 30th Computer Security Founda- tions Symposium (CSF) (2017).

Journal

StatisticsarXiv (Cornell University)

Published: Jul 2, 2020

There are no references for this article.