Personalized Federated Learning on NLOS Acoustic Signal Classiﬁcation

,


Personalized Federated Learning on NLOS Acoustic
Signal Classification Hucheng Wang, Suo Qiu, Jingjing Wang, Lei Zhang, Zhi Wang, Xiaonan Luo Abstract-In the process of identifying non-line-of-sight (NLOS), acoustics-based indoor positioning needs to collect audio recordings of sound fields in multiple rooms and upload them to the central server for training. Once the transmission process and server-side suffer malicious attacks, private data will also be leaked. To solve the training difficulty and privacy issues at the same time, we propose a novel Personalized Federated Learning (PFL) model combined with user frequency and room data capacity, taking into account the significant differences in positioning data with room layout. The proposed model can accurately identify the differences between different room data when aggregating on the server-side. By collecting data in the actual indoor environment and comparing the existing algorithms, the accuracy of the proposed method in the data verification of unfamiliar rooms is 90%.

I. INTRODUCTION
I Ndoor positioning technology based on acoustic signal has the advantages of compatibility, stability, and high positioning accuracy. The outliers caused by NLOS are the most severe reason for the decline of positioning accuracy [1]. When identifying NLOS signals, the processing method based on the neural network first collects data in the form of signals, then applies the algorithms of the neural network to the collected data and finally achieves the purpose of identifying NLOS signals(e.g., [2], [3]). The global training model cannot accurately predict the results of rooms with occlusion problems due to different layout types. The information data set related to the indoor environment is uniformly uploaded to the server. Once the malicious information attacker appears on the transmission path or server, the indoor privacy information will be completely exposed [4].
Federated learning (FL) [5] proposes a distributed learning architecture that allows each client to avoid uploading its original dataset to the server but to upload model parameters, like gradients. Meanwhile, FL conveys the idea of edge computing, enabling beacons in the room to respond to user data more quickly. FL not only prevents malicious attackers from directly obtaining the original user's privacy data but also greatly reduces the bandwidth pressure in the upload process.
However, the general FL aggregation model has requirements for the independence and distribution of data. For non-independent and identically (Non-IID) data generated by The private model based on each user node proposed by PFL [6] solves the problem of overfitting a single global model on Non-IID data. The federated averaging (FedAvg) [7] proposed fluctuates wildly and reduces the validation accuracy on some room datasets. Bui [8] updated FL training by embedding personalized parameters. Liang [9] proposed the LG-FedAvg algorithm that combines local representation learning and global federated training. However, due to the large difference in room layout, the LG-FedAvg cannot effectively solve the problem of verification failure. Based on the above problems, we propose our algorithm and compare and verify it with other methods.

II. FL AIDED NLOS CLASSIFICATION DESIGN
The general indoor environment is considered to be composed of multiple different independent rooms, and the sound propagation model caused by different indoor layouts and occlusion distributions meets the Non-IID model. Assuming that the number of speakers in each room is S, the whole amount room is R, the audio received by the microphone A t at time t in speaker s and room r can be expressed as Each A r,s t is considered to be independent of each other, and reverberation only occurs between the LOS signal and the selfreflected signal.

A. Capture audio model
We assume that the indoor acoustic field consists only of direct sound waves, first-reflected and diffuse reflection waves, ignoring the weak multiple reflections and other related waves. The signal-to-noise ratio (SNR) of the transmit power is limited to p snr , which causes the propagation distance of a single speaker to be limited. Due to the propagation speed c of sound waves, the sound cycle T c of each speaker only needs to satisfy T c > S · d max /c to conform to A r t , where multiplying by the number of speakers S means that the next round of sound can be broadcast only after all speaker signals are accepted. The distance between adjacent speakers shall not exceed d max .
Considering the complex indoor environment, the captured audio signal A r,s t has three kinds of mixtures: LOS, reflection and diffusion signal, which can be described as [10]: where S r,s t denotes the raw acoustic signal. To simplify the system, the raw signal S is the same at all R rooms and S speakers. α L , α Re , and α D are the attenuation of LOS, reflection and diffusion, respectively. The superscript m and n are m th and n th path in reflection and diffusion wave. The Blackman window w(t) is used to filter out the low SNR signals that cannot be perceived.

B. Training model
The real-time positioning system needs to instantaneous exchange data from the user to the server. The user to be positioned sends the collected acoustic signal A r,s t to the server, and the server processes to determine the sight status and calculate the coordinates of the user. Further, judging the state of sight is done by the CNN and Bi-LSTM models of the deep neural network architecture. The short-term Fourier transform (STFT) is one of the most effective feature extraction methods left. After STFT, the complex value of the spectrum matrix A r,s t at each time t is obtained as the input layer of CNN. CNN extracts multi-dimensional features by convolution and pooling layers iteratively and passes them to the Bi-LSTM network for classification training. Then, the probability of the current state is calculated by the fully connected layer, and the result of the NLOS state judgment is obtained. This process can be expressed as where N is additional white noise. F and T are the dimension of the spectrum matrix U r,s t , where F is the amount of frequency segment, and T is the amount of time segment in STFT, decided by window length and overlap length. In fact, ϕ is only a binary classification problem, while 0 means NLOS and 1 means LOS results. The federated learning model protects data privacy by transmitting gradients or weights instead of training data. The server weights and aggregates the model of each client into the personal server model and then distributes it to each room. The room combines the self-training state with the personal server model, updates its training model, and completes an update process from the server to each room. Due to the different indoor layouts in each room and different occlusion scenarios, the speaker arrangement's Geometric Dilution Precision (GDOP) is also different, which is a typical Non-IID and identically distributed model. The specific steps are described as follows and the diagram as Figure 1. 1) Room model training: The same method is used to train each room separately, the model of r th rooms can be described as
2) Server Aggregation: The accuracy of individual clients may decrease when the data distribution of each participant in federated learning is inconsistent. The cloud server side corresponds to the private model ϕ ′(r) of each room ϕ (r) , and the total server model ϕ ′ = {ϕ ′(1) , ..., ϕ ′(R) }. To enable models with similar NLOS distributions to better adapt to training, instead of using a global model, the distance ||ϕ E−1 || 2 between ϕ (i) and ϕ (j) becomes important for the global update of the r th room in the server. The amount of data is also an important indicator of aggregation. We allow each room r to perform E epochs of local room model update via minibatch SGD with the size of n r , then r∈R n r is the whole E training batches. The update process of the server personal model ϕ ′(r) corresponding to room r is ϕ ′(r) where ξ (r,1) , ..., ξ (r,R) are the linear combination weights of the model parameter sets ϕ is actually a convex combination of model sets, where ξ (r,1) + · · ·+ξ (r,R) = 1. ϕ Here, the process of E epoch from room model training to room model updating is completed.

III. EXPERIMENTS AND RESULTS
We set R = 4 and speakers in each room S = 4. The four different rooms are Lab 1, Lab 2, Office 1, and Office 2. The speakers are distributed on the ceiling of the corners in each room to fully cover the sound field. Each speaker sends 800 acoustic signals, 400 of which are captured by the microphone in the LOS range, and the other 400 signals are in the NLOS range. Each data is 16-bit sampling rate, 1-second duration, .wav lossless format, and a total of 15Gb of audio data. The dataset is uploaded to IEEE Dataport [11].

A. Results on general training
To evaluate the performance, we employ the centralized training model, global model (FedAvg) [7] and the personal model (proposed) on the existing room. To balance the total training epochs, we set the number of local training times multiplied by the updates times E = 10 is equal to 100 epochs. Although the number of centralized training epochs is up to 100, the accuracy constantly fluctuates, and it does not stabilize until after 80 epochs. For Non-IID room acoustic signals, FedAvg does not consider the difference in NLOS distribution in each room, causing the unstable validation accuracy. The proposed aggregation method takes into account the difference of the room. In the early validation stage, due to insufficient personal model training in each room, the accuracy is not ideal. Starting from the 30 epochs, the proposed method is significantly better than other methods, and there are no repeated fluctuations in the validation.

B. Results on unfamiliar room sound field
To verify the adaptability of the existing model to unfamiliar rooms, we remove the training data of a specific room. Then the removed room data is used to verify the trained model and calculated the NLOS classification accuracy of each model for the unfamiliar room.
As Figure 3 shows, the x-axis is the room that was removed in training. The PFL model showed the best performance when testing unfamiliar rooms. The method we proposed takes into account the room similarity distance ||ϕ individual considerations into account, and large outliers may be generated when the data in an unfamiliar room does not conform to the global model data distribution.