Compressing weight updates by nearly 2000x at every communication round using Autoencoders, to optimize communication overheads in Federated Learning
Lets start with- what is Federated Learning (FL)?
FL is Machine Learning on a distributed device paradigm, where instead of bringing the data to the computation, the computation is brought to the data. This addresses the age-old concern regarding privacy, as in this case, the data never leaves your device! A server (like Google's or Microsoft's) suggests a global model to all participating devices (also called collaborators or edges). Each device locally trains this global model for a few rounds, using its local data, and then communicates the weight updates back to the server. The weights from all edges are then aggregated (a form of averaging), and the communication rounds proceed till the end of training for each device. The final aggregated weights are then assumed to be the model that is more robust and, more often than not, works like a "one size fits all" (the data distribution or "heterogeneity" determines the extent of this, but don't worry if you aren't familiar with it, you get the general overview). Finally, these models are communicated back to all devices, and a popular example of this is the Google Keyboard with the autosuggestions.
"Federated Learning brings the SERVER to YOU, instead of taking your data to the server"
Communication Optimization
Now that you have a picture of what FL is, you must have figured out an obvious drawback in this approach. The communication of models and weights to-and-from the server and clients causes significant overhead in the network bandwidth. There has been significant work over the years in optimizing this communication, aimed at reducing the overall cost of this setup.
Autoencoders
Autoencoders (AE) are Artificial Neural Networks that can be used for compression or encoding of input data, by virtue of its funnel shaped architecture. They are popular in the image compression space and you can check this out to get a detailed introduction to AE. Long story short, an AE network architecture is such that the intermediate layers are smaller (have lesser number of neurons) compared to the equi-sized input and output layer. The network is trained to replicate the inputs at the output stage, via these intermediate smaller representations. The part of the network from input to the compressed intermediate stage is the encoder, and the remaining is the decoder.
AE in the context of FL
The AE defined earlier is used in the context of FL to compress the weight updates at the end of every communication round. The weight parameters are non-isolated events, and there are certain non-linear relations between the parameters, and the AE aims to find these hidden patterns to finally compress (lossy) the input by removing certain redundant dimensions. To achieve this lower dimensional representation, we require data of the weights of the client model. To make this data available, we propose running a Pre-pass round prior to the Federated Learning. The pre-pass round begins with the server communicating a global model to all clients. Next, each client locally trains this model based on its local data (no FL, it is traditional ML at this stage). The weights at the end of every batch is stored in a dataset, to get a glimpse of how the weights change during the course of training. The AE is then trained based on this dataset, and the encoder part of the network resides on the client. The decoder is communicated back to the server, and this marks the end of the pre-pass round.
Now, the FL training starts, and follows the approach mentioned earlier. Here, at the end of every communication round, instead of communicating the weights directly from the client to server; the weights are now AE compressed. The weights are compressed at the client side using the encoder by a factor of around 2000x (highest recorded compression ratio in FL!) and communicated to the server. The server then decodes these weights using the decoder.
To conclude, the main advantage with this setup is obviously the large compression ratio, but other than that, is the dynamic nature of this compression. The AE architecture can be modified to exactly suit a given FL setup's requirement, in terms of computational capacity and complexity available, compression ratio required, accuracy of weights required (because compression is lossy), etc. One thing to note is the overhead of communicating the decoder at the end of pre-pass round, as that adds to the cost of communication. Thus it is useful to perform a trade-off analyses and to note that such an AE compression would be more advantageous in a large scale FL setup with large number of collaborators and communication rounds, as there is that much more compression being achieved in contrast to the overhead of communicating the decoder in the first step.
Finally, this method is one that is orthogonal in nature, and can also be used alongside other traditional compression techniques such as quantization, sparsification, and few other SOTA FL compression techniques like DGC, CMFL, STC (Not linking them, I'll let you hunt for these papers :D ), to make the FL setup all the more better, and as communication efficient as possible!
Thanks to Pravin Chandran, Raghavendra Bhat, and Avinash Chakravarthi for making this project possible.
Comentários