LRP-based network pruning and policy distillation of robust and non-robust DRL agents for embedded systems

), to rank and prune the convolutional filters in the policy network, combined with fine-tuning with policy distillation. Performance evaluation based on several Atari games indicates that our proposed approach is effective in reducing model size and inference time of RL agents. We also consider robust RL agents trained with RADIAL-RL versus standard RL agents, and show that a robustRLagentcanachievebetterperformance(higheraveragereward)afterpruning than a standard RL agent for different attack strengths and pruning rates.

It can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. Unstructured pruning techniques remove individual weights and can achieve high degrees of sparsity, but the resulting unstructured sparse models are not hardware-friendly, and cannot achieve significant speedup in inference time on typical hardware accelerators 11,12 for deep learning. Structured pruning techniques remove groups of neurons and weights together, for example, convolutional filters (also called kernels) or channels in a CNN, and the resulting structured sparse model is more amenable to hardware acceleration. During the pruning procedure, an importance metric, or pruning criterion, is needed to rank the relative importance of different elements (weights, neurons, channels, filters, etc.) to be pruned in increasing importance order, that is, the less important elements are pruned before the more important ones. Importance metrics may be computed based on the magnitude of their (1) weights, 13,14 (2) gradients, 15 (3) Taylor expansion/derivative, 16 and (4) other criteria, for example, LRP-based. 17 Knowledge distillation 18 works by distilling the knowledge from a well-trained teacher network T into a student network S. By mimicking the teacher, for example, matching the softmax output of the teacher, the student may be able to learn more effectively than supervised learning with hard labels. Knowledge distillation is an effective technique for model compression, if the teacher network is a large network, and the student network is a smaller and more efficient network. Policy distillation 19 refers to the application of knowledge distillation to the policy network of an RL agent. The student network S, that is, the policy network of the RL agent after pruning, is trained to match the policy of the teacher network T. Compared to RL, where the agent interacts with the environment and learns by trial-and-error, policy distillation can be much more sample-efficient.
Czarnecki et al. 20 present a comprehensive review and comparison of different policy distillation algorithms. Applications of policy distillation include: training otherwise un-trainable agent architectures; 21 kickstarting/speeding up learning; 22 building stronger policies; 23 driving multi-task learning; 19 and model compression. 24 Most prior work on knowledge and policy distillation focuses on the knowledge to be distilled and the distillation algorithm. The student network architecture is assumed to be given as the input to the distillation algorithm, and its design is typically manual and not addressed explicitly in the articles.
As shown in Figure 1, we achieve model compression of the policy network of an RL agent by performing network pruning and policy distillation iteratively, until the desired pruning rate is reached.
• Network pruning: Instead of manually designing the student network, we obtain the student network S from the pre-trained teacher network T by gradually pruning away unimportant filters starting from T. We adopt the relevance score computed by LRP as the importance metric used for filter pruning, inspired by Yeom et al. 17 's work on LRP-based filter pruning for CNNs.
• Policy distillation: Instead of fine-tuning the student network S with RL, which is not sample-efficient, we fine-tune it with policy distillation, that is, we freeze the parameters of the pre-trained teacher network T, and let it interact with the environment to build a replay buffer, which is used for policy distillation after each pruning step of the student network S.
We adopt Deep Q-Network (DQN) 6 as the DRL algorithm in this article, which is one of the most widely-used variants of DRL algorithms, but the techniques presented in this article are also applicable to any other DRL algorithm as long as the policy network is a CNN.
This article is an extension of our conference publication, 25 with the following new contents: a new Figure 7 to measure the memory efficiency more precisely for different pruning rates; a new Section 4.2 with additional experimental results, to show that robust RL agents obtained by a robust training algorithm 26 can generally achieve better performance (higher average reward) after pruning than standard RL agents.
This article is structured as follows: we first discuss background knowledge on DQN and LRP in Section 2; our approach of LRP-based policy pruning and distillation in Section 3; performance evaluation results in Section 4, including Section 4.1 for experiments with versus without fine-tuning, and Section 4.2 for experiments with robust versus non-robust models; and conclusions in Section 5.

Deep Q learning (DQN)
A reinforcement learning (RL) environment is characterized by a Markov decision process (MDP), M = (S, A, P, R, , s 0 ), where S is the set of states; A is the set of actions that an agent can take; P ∶ S × A × S ′ → R defines the state transition probabilities; R ∶ S × A → R is the scalar reward function; s 0 is the initial state distribution, and is the discount factor. An RL algorithm aims to learn a policy ∶ S × A → R describing the probability of taking an action a given state s. The optimal policy * is the policy that maximizes the cumulative time discounted reward of an episode ∑ t t r t , where t is timestep, and r t , a t , and s t correspond to the reward, action, and state at timestep t.
An action-value function Q(s, a) describes the expected cumulative rewards given current state s and action a. In Q-learning, a policy is constructed by taking the action with the highest Q-value. The optimal Q function Q * (s, a) satisfies the Bellman optimality equation: where r is the reward, is the discount factor, s ′ , a ′ are the next state and the action that can be taken in the next state.
One of the most popular DRL algorithms is Deep Q-Learning (DQN), with a wide range of applications, including game playing, 6 computation offloading in edge computing, 27 and many others. With DQN, a deep neural network is used to approximate Q * (s, a). The training of the neural network is performed by minimizing the loss function defined as: Double-DQN 28 is an improved version of DQN, and uses two Q-networks to evaluate the target Q-value Q target and the Q-value that is being trained Q actor with actor being optimized by minimizing the loss

Layer-wise relevance propagation (LRP)
Layer-wise relevance propagation (LRP) 29 is an Explanation AI (XAI) method that assigns importance scores, also called relevance, to the different input dimensions (e.g., pixels of an image) of a neural network that reflect the contribution of an input dimension to the model's decision, and has been applied to different fields of computer vision. The relevance is backpropagated from the output layer to the input layer and assigned to every neuron at every layer. These relevance scores reflect the importance of every neuron in terms of its contribution to the information flow through the network, hence they can serve as a natural importance metric used as the pruning criterion.
For a given input x, LRP operates by propagating the prediction f(x) backward layer-by-layer until the input x with local propagation rules. Let j and k be the indices of neurons at two consecutive layers (l) and (l + 1) of the neural network. The basic LRP rule is used to propagate relevance scores R k at a given layer l + 1 onto neurons of the preceding layer l: where z jk models the extent to which neuron j has contributed to making neuron k relevant. The propagation procedure terminates upon reaching the input x.
F I G U R E 2 An example to illustrate the basic LRP rule.
Consider the application of LRP to a neural network with the ReLU nonlinear activation function. Each neuron's activation is computed as: where the sum ∑ 0,j runs over all lower-layer activations a j , each with a connection of weight w jk to neuron k, plus an extra neuron representing the bias, whose activation a 0 = 1 and weight w 0k equal to the bias value.
The basic LRP rule Equation (4) becomes: where z jk = a j w jk . Larger activation a j and larger weight w jk cause a larger share of R l+1 k to be distributed to R l j . If neuron j is inactive, that is, its output after the ReLU activation function is 0, then its relevance R l j = 0, since it does not affect the subsequent layers. Consider the neuron at level l with activation a 1 in Figure 2. Applying Equation (4), we can compute its relevance R l 1 as follows: The relevance of the other two neurons at level l, R l 2 , R l 3 , can be computed similarly. There are several variants based on the basic LRP rule, as discussed in Reference 30. We use the variant LRP-1 0 rule 17 in our experiments.
Yeom et al. 17 present LRP-based filter pruning for CNNs, using the relevance computed by LRP as the importance metric to decide which filters to prune. They show that this approach is competitive or better compared to state-of-the-art pruning criteria when successive retraining/fine-tuning is performed, and it also outperforms previous criteria for pruning without fine-tuning in the resource-constrained application scenario where labeled data are scarce. In this article, we apply LRP-based filter pruning to an RL agent's policy network, which is a CNN trained with RL, instead of a CNN trained with supervised learning addressed in Reference 17.

OUR APPROACH: LRP-BASED POLICY PRUNING AND DISTILLATION
We propose a structured pruning scheme based on the LRP algorithm, which prunes the network filters according to the importance of each filter calculated by the LRP method. Algorithm 1 shows our overall algorithm. Here are the key parameters used in Algorithm 1: • Pruning rate (total percentage of filters pruned): r. are as follows: • Line 1: Let the teacher network T interact with the environment to build a replay buffer RB used for policy distillation.
• Line 2: Initialize the student network S to be the same as the teacher network T.
• Line 7: Compute each filter's importance as the L1 norm of the importance of all its constituting neurons, that is, the sum of the absolute values of all the neurons' importance.
• Lines 8-10: Sort all filters by importance, and prune PStep percent of filters with the lowest importance. Remove orphaned connections of each pruned filter. (This corresponds to hard filter pruning, as opposed to soft filter pruning, 31 where the pruned filters are reset to zero weight, but may be recovered during fine-tuning if they are critical for the model architecture.) • Lines 11-13: At a high pruning rate, it is possible for the student network to be disconnected from the input layer to the output layer after one pruning iteration. If this happens, then skip Step 2 and return the student network S from the previous iteration.
• Lines 14-16: Step 2: Fine-tune S by policy distillation for FT Epochs number of epochs using the replay buffer RB collected at Line 1.
We would like to comment on possible network disconnections mentioned in Lines 11-13: our current pruning algorithm, as well as all the baseline methods, do not consider network connectivity during pruning, hence the CNN may be disconnected from the input layer to the output layer after one pruning iteration. It is conceivable to add an additional check for network connectivity before pruning each filter, but this causes significant additional complexity without too much benefit. We can observe from Figures 3 and 4 that just before the network becomes disconnected, the average reward has dropped to a very low value since the network has been so severely pruned. A connectivity check that prevents the network  from being disconnected may enable even higher pruning rates, but the resulting network's performance is likely to be too low to be useful at very high pruning rates.
The trained teacher agent T is used to collect a dataset {(x i , q i )} N i=1 , where each sample consists of an input video frame x i and a vector q i of Q-values output by the policy network, with one value per action. We use only the highest valued action from T, a i,best = argmax(q i ), and construct the training dataset as D T = {(x i , a i,best )} N i=1 ) for supervised learning, where x i is one input sample, and a i,best is the label for x i given by the teacher. The student agent S is trained with supervised learning, with negative log likelihood (NLL) loss to predict the same action as T: 19 where S is the parameters of S. (Actions a i and a i,best correspond to a s and a t in Figure 1, respectively.)

PERFORMANCE EVALUATION
The hardware platform is a GPU workstation equipped with a CPU (Intel i5-10400 processor at 2.90 GHz), a GPU (NVIDIA 2080Ti), with 32 GB RAM. We use the PyTorch framework for deep learning, and OpenAI Gym toolkit for RL algorithms. We consider three Atari games in OpenAI Gym: Road Runner, Bank Heist, and Pong, which are commonly used as benchmark problems for RL algorithms.
We As comparison baselines, we consider the following commonly-used techniques for ranking and pruning filters: • Weight-based (L 1 or L 2 norm): 14 Filters with smaller L p -norm values have smaller magnitude weights and/or smaller number of weights, hence are deemed to be less important than filters with larger L p -norm values. We consider both L 1 norm and L 2 norm in our experiments. Suppose a filter F contains N number of weights w i , i = 1, … , N, then its L 1 norm is defined as ||F|| 1 = ∑ N i=1 |w i |, and its L 2 norm is defined as 15 propose a sparsified back-propagation approach for neural network training using the magnitude of the gradient to find essential and nonessential features in multi-layer perceptron (MLP) and long short-term memory (LSTM) models, which are used for pruning. We perform two sets of experiments: in Section 4.1, we evaluate the RL agent performance with versus without fine-tuning with knowledge distillation; in Section 4.2, we evaluate the RL agent performance starting from adversarial robust versus non-robust models.

Experiments with versus without fine-tuning
For pruning with fine-tuning, Figure 3 shows each game's average reward, that is, its cumulative reward per episode averaged over 10 episodes, with increasing pruning rates. We make the following observations: • For all methods, including our method (marked LRP in the figure) and the baseline methods, we can observe that while the average reward generally decreases with increasing pruning rate as expected, it does not decrease strictly monotonically, but exhibits significant fluctuations at different pruning rates. This is typical for an RL agent due to the inherent randomness of each RL game environment. Nonetheless, our LRP-based method consistently achieves the highest average reward for most pruning rates for each game.
• For each method, sometimes pruning may lead to higher cumulative rewards, esp. at low pruning rates, for example, for BankHeist and RoadRunner. This is not surprising, for example, Han et al. 13 show that model compression by pruning and quantization may occasionally cause a slight increase in model accuracy due to the regularization effect of model compression.
• The agent can achieve close to perfect average reward for Pong for pruning rates up to about 90%. We think this is because it is a simple game in terms of the input image, which consists of two paddles and one ball against a uniform-color background, hence the neural network is highly redundant in the number of filters.
Next, we perform pruning without fine-tuning by removing the fine-tuning step after pruning (Step 2 in Algorithm 1) to observe the effect of filter pruning on agent performance. This is a more direct measurement of how well the filter pruning criteria correlate with the actual importance of filters. (This is also applicable to domain adaptation/transfer learning tasks in resource-constrained application scenarios in which labeled data of the task to be transferred to are very scarce. 17 ) Figure 4 shows the results. We make the following observations: • Fine-tuning with policy distillation is crucial to achieving good performance for the pruned/compressed model.
• For RoadRunner and BankHeist, LRP-based pruning achieves significantly better performance than the baselines.
• For Pong, LRP-based pruning achieves comparable performance as the baselines, likely due to relative simplicity of the game. Figure 5 shows the number of remaining convolutional filters in each pruning iteration. For LRP-based and weight-based methods: as pruning rate increases, the filters in earlier layers (e.g., Conv1) that are associated with generic features, such as edge and blob detectors, tend to generally be preserved more than those in latter layers (e.g., Conv3), which are associated with abstract, task-specific features. For gradient-based and Taylor-based methods, the filters at different layers are pruned more uniformly. This is consistent with Yeom et al., 17 which shows that LRP-based and weight-based methods tend to retain more filters in the early layers (as long as they serve a purpose) despite the iterative global pruning process.
To measure the computational efficiency of a neural network after deployment, we use the number of FLOPs (floating-point operations per second) for forward inference. Figure 6 shows that the number of FLOPs for forward inference of the pruned model decreases roughly linearly with increasing pruning rate, which is expected. It also shows the pruned models from our method consistently require a slightly lower number of FLOPs than the baselines, which is advantageous for resource-constrained devices. Our experience indicates that this is because LRP-based pruning tends to remove filters with larger number of weights.
Since the filters in the CNN may have different sizes, for example, 4×4, 8×8 and so forth to measure the memory efficiency more precisely at different pruning rates, Figure 7 plots the number of remaining parameters (as obtained by THOP PyTorch OpCounter) versus pruning rate (as measured by number of pruned filters). For the same pruning rate on the x-axis, a higher value on the y-axis implies that more parameters remain after pruning, hence more smaller filters are pruned than larger filters. We can see that LRP pruning results in almost a straight line, indicating roughly even pruning of smaller and larger filters during LRP pruning, and its slope and shape are relatively stable across different games.

Experiments with robust versus non-robust models
Well-known adversarial attacks against DNNs include algorithms such as fast gradient sign method (FGSM) 33 and projected gradient descent (PGD). 34 Huang et al. 35 showed that deep RL agents also are vulnerable to adversarial perturbations, including adversarial perturbations on agents' observations and actions, among others. Robust training methods can be used to train robust models that are less prone to adversarial attacks. It has been shown that robust models learn meaningful feature representations that align well with salient data characteristics, 36 hence it may be more resilient against model pruning. In this section we prove this point experimentally by comparing RL agent performance with different pruning rates, starting from either the robust RL agent trained with robust DRL training, or the standard RL agent trained with standard DRL.
where L nom is the nominal loss function for DQN defined in Equation (3), and L adv is the adversarial loss. is a hyperparameter used to control the trade-off between standard performance and robust performance with value between 0 and 1. They propose two approaches to constructing L adv : (1) construct an strict upper bound of the perturbed standard loss; (2) design a regularizer to minimize overlap between output bounds of actions with large difference in outcome. For DQN, the 2nd approach works better, so we briefly describe it here.
Suppose the Q function is lower and upper-bounded by Q _ (s; a; ) and Q(s; a; ), with -bounded perturbations to input s. The goal of robust training is to minimize the weighted overlap of activation bounds for different actions. If there is no overlap, the original action's Q-value is guaranteed to be higher than others even under perturbation, so the agent won't change its behavior under perturbation. However not all overlap is equally important. If two actions have similar Q-values, then overlap may be acceptable, as taking a different but equally good action under perturbation is not a problem. Therefore, a weighting factor Q diff is added, which helps by multiplying overlaps with similar Q-values by a small number and overlaps with different Q-values by a large number. Therefore, L adv is constructed to minimize the overlap between output bounds of actions with large differences in their Q values, as shown in Equations (11) and (12): where a is the taken action, and = 0.5 * Q diff (s, y) is the margin that the network is incentivized to have.

4.2.2
Experimental results with robust versus non-robust models Our evaluation of model robustness is based on the PGD attack with different attack strengths = 0.3∕255, 1∕255, 3∕255, 8∕255. In Figures 8   and 9, the horizontal axis is the pruning rate, and the vertical axis is the average reward of the RL agent, that is, its cumulative reward per episode averaged over 10 episodes. The suffix robust denotes that the agent is trained with the robust training algorithm RADIAL-RL with the loss function defined in Equation (11), and the results are drawn with solid lines; the suffix std denotes that the agent is trained with the standard DQN, and the results are drawn with dotted lines.
From Figures 8 and 9, we can see that while the average reward generally decreases with increasing pruning rate for both robust and standard RL agents, robust RL agents generally achieve higher rewards than standard RL agents, as indicated by the solid lines being generally higher than the dotted lines. (Since RL algorithms typically exhibit significant variations in performance in different runs due to the stochasticity of the environment, hence we use the term "generally" instead of "always".) In addition, LRP pruning generally leads to more stable RL agent performance, compared to other pruning methods. For example, comparing Figures 8A and 9A with no attack with the other subfigures in Figures 8 and 9, we can see that the standard RL agent suffers significant performance degradation even with low attack strength of = 0.3∕255, while the robust RL agent generally only suffers minor performance degradation for = 0.3∕255. For the game bank in Figure 8, with no attack and no pruning (pruning rate 0%), the standard RL agent achieves an average reward of about 700 ( Figure 8A); with attack strength of = 0.3∕255, its average reward suffers a significant drop to less than 50 ( Figure 8B). For the game road in Figure 9, with no attack and no pruning, the standard RL agent achieves an average reward of about 31,500 ( Figure 9A); with attack strength of = 0.3∕255, its average reward suffers a significant drop to less than 14,000 ( Figure 9B). In contrast, Figure 9B shows that the robust RL agent can achieve the same average reward with pruning rate greater than 80% with most pruning methods (except the gradient-based method, which achieves reward of 10,000). We conclude that robust RL agents trained by RADIAL-RL can achieve better performance (in terms of average reward) than standard RL agents for different attack strengths and pruning rates. This implies that a robust RL agent trained by RADIAL-RL is more resilient against model pruning, and maintains good performance even after significant pruning.

CONCLUSIONS
In this article, we perform model compression of the policy network of a DQN agent for deployment on resource-constrained embedded systems by performing network pruning and policy distillation iteratively. We use LRP-based network pruning to obtain the student network S from the pre-trained teacher network T, by gradually pruning away unimportant neurons starting from T. We then fine-tune the student network with policy distillation from the teacher network. Performance evaluation based on several Atari games indicates that our proposed approach is effective in reducing model size and inference time of DQN agents, facilitating their deployment on resource-constrained embedded systems. We also considered robust RL agents trained with RADIAL-RL versus standard RL agents, and show that a robust RL agent can achieve better performance (higher average reward) after pruning than a standard RL agent for different attack strengths and pruning rates.