Efﬁcient ﬁeld-programmable gate array-based reconﬁgurable accelerator for deep convolution neural network

Deep convolutional neural networks (DCNNs) have been widely applied in various modern artiﬁcial intelligence (AI) applications. DCNN’s inference is a process with high calculation costs, which usually requires billions of multiply-accumulate operations. On mobile platforms such as embedded systems or robotics, an efﬁcient implementation of DCNNs is signiﬁcant. However, most previous ﬁeld-programmable gate array-based works on accelerators for DCNNs just support one DCNN or just support convolution layers. In order to address this limitation, this work proposes a reconﬁgurable accelerator. The accelerator is ﬂexible and can support multiple DCNNs and different layer types, such as convolution, pooling, activation function, and full connection layers. It is equipped with a ﬁve-level pipeline convolution engine whose main component is two processing element arrays. Furthermore, a design space exploration method is proposed to make full advantage of the proposed accelerator. This accelerator is implemented with the ZYNQ-7 ZC706 evaluation board and achieves a high performance of 53.29 Giga operations per second (GOPS) on AlexNet and 45.09 GOPS on YOLOv2-tiny at 100 MHz. Further performance of the accelerator is compared with the previous works, and it achieves multiple advantages: High performance, high conﬁgurability, and efﬁ-cient resource utilisation.

Deep convolutional neural networks (DCNNs) have been widely applied in various modern artificial intelligence (AI) applications. DCNN's inference is a process with high calculation costs, which usually requires billions of multiply-accumulate operations. On mobile platforms such as embedded systems or robotics, an efficient implementation of DCNNs is significant. However, most previous fieldprogrammable gate array-based works on accelerators for DCNNs just support one DCNN or just support convolution layers. In order to address this limitation, this work proposes a reconfigurable accelerator. The accelerator is flexible and can support multiple DCNNs and different layer types, such as convolution, pooling, activation function, and full connection layers. It is equipped with a five-level pipeline convolution engine whose main component is two processing element arrays. Furthermore, a design space exploration method is proposed to make full advantage of the proposed accelerator. This accelerator is implemented with the ZYNQ-7 ZC706 evaluation board and achieves a high performance of 53.29 Giga operations per second (GOPS) on AlexNet and 45.09 GOPS on YOLOv2-tiny at 100 MHz. Further performance of the accelerator is compared with the previous works, and it achieves multiple advantages: High performance, high configurability, and efficient resource utilisation.
Introduction: Deep convolutional neural networks (DCNNs) have become one of the most popular approaches to many robotics' visual processing tasks, such as object detection, image classification, and scene understanding. Recently, DCNNs are usually implemented on platforms including graphical processing unit (GPU), field-programmable gate array (FPGA) [1], and application-specific integrated circuit (ASIC) [2]. GPU has high power consumption, and ASIC requires significant fabrication cost. FPGA has advantages of high configurability, low power, and reasonable price, which make FPGAs attractive for DCNN implementation on robotic. Besides DCNNs, the neural networks using a noniterative training mechanism [3,4] can be accelerated by FPGA.
Numerous FPGA-based works are proposed for accelerating DCNNs. However, most works just support convolution layers and do not support pooling, activation function, and full connection layers, such as the designs in [1,[5][6][7]. Furthermore, among these works, the designs in [1,5,6] only support one DCNN. The design in [5] supports AlexNet and achieves a performance of 38.4 GOPS. The design in [6] supports YOLOv2-tiny and achieves a low performance of 21.6 GOPS. NullHop in [8] can achieve different layers' operations. However, it supports limited kernel sizes (1 × 1, 3 × 3, 5 × 5, 7 × 7) and one kernel stride (S = 1) and has a low performance of 17.196 GOPS on FPGA platform. There are some exceptions, such as the designs in [9,10] supporting multiple whole DCNNs.
With the ongoing advancements in visual processing applications on robotics, a wide variety of DCNNs appear. Therefore, designing a flexible accelerator supporting different DCNNs is all that matters. This work highlights the applicability and configurability of the accelerator for DCNNs, as well as performance. A five-level pipeline convolution engine (ConvEngine) and configurable processing element (PE) are designed for supporting different DCNNs and layer types. For the sake of better performance, a design space exploration method is proposed to search for the optimal design corner. In order to show the effectiveness of the proposed accelerator, it is compared with the emerging FPGA- based accelerators [5][6][7][8]. From the experiment results, there is no doubt that our accelerator is better than other state-of-the-art works in the metric of configurability and performance.
Fundamental theory: Figure 1 shows the YOLOv2-tiny, which consists of convolution, pooling, activation function, batch normalisation layers.

Convolution layer and the full connection layer
Convolution is used to extract different features in the image. It involves billions of multiplications and additions between the filters and input feature maps. The operations can be described below: It is notable that the full connection layer can also be illustrated by formula (1).

Batch normalisation layer
Batch normalisation is often followed by the convolution layer to provide any layer with inputs that are zero mean or unit variance. It is illustrated as follows: where σ 2 stands for variance, ξ is constant. It can be implemented with multiplication and addition.

Activation function layer
Activation function is used to transform the input before the pooling layer. On hardware, the sigmoid activation function is often implemented by a piecewise linear function, which can be realised with multiplication and addition. Rectified linear unit (ReLU) and LeakyReLU activation functions are also be realised with multiplication and addition. They are illustrated as follows:

Pooling layer
Pooling is a form of a dimensional reduction in DCNN by the way of throwing away redundant information so that the critical information can be preserved. Max and mean poolings are given as follows: mean pooling : where the max pooling can be completed by subtraction and comparison with the sign bit and mean pooling by addition and shifting.
Reconfigurable accelerator: As described above, all layers can be implemented by the combination of multiplications and additions. We can design a configurable PE to realise these computing functions. In this work, we propose a reconfigurable accelerator based on customised PEs for DCNNs as shown in Figure 2.  Convengine: ConvEngine is a five-level structure as shown in Figure 3. The first level is two input register arrays (IREG), exchange pixels with InBuf in ping-pong mode. The second level is the input share register array (ISREG), which broadcasts pixels to improve PE utilisation for different kernel strides. The third level contains a weight register array (WREG) and a partial sum register array (PREG). ISREG and WREG are used to improve PE utilisation and to reduce critical path delay. The fourth level is PE array, which efficiently supports different computing of various layer types. The fifth level is the output register array REG (OREG). ConvEngine contains two CEs and each CE has individual WREG, PREG, PE array, and OREG. Two CEs share the same IREG and ISREG. Two CEs use the same input, along with different weights and partial sums (psum) of different channels, after which they generate outputs of different channels. In this way, the hardware resources for IREG and ISREG will be reduced.
Processing element: As described earlier, convolution, full connection, batch normalisation, activation function, and pooling can be realised with multiplications and/or additions. Therefore, a flexible PE is proposed in this work and is shown in Figures 4-6. It contains a multiplier, an adder, a register, and several multiplexers. The multiplier and adder are reused in almost every operation, except for pooling and ReLU, so the utilisation of PE in different operations can be improved as much as possible. Figures 4-6 show the configured data paths of major operations, where red and pink lines are data paths. Since the convolution layer adopts a tiling strategy, convolution operations have two kinds of data paths, as the psum may be stored in psum buffer between different tiling convolution (Figure 4(a)), or the psum may be stored in PE's register during the same tiling convolution (Figure 4(b)). Since full con- Design space exploration: The design space exploration is used for finding the optimal tiling parameter combination {Tr, Tc, Tm, Tn}. For a reconfigurable architecture, different tiling parameter combinations will result in various performances, and these performances can vary considerably. Thus, finding the optimal tiling parameter is significantly important. As demonstrated, the tiling strategy adopted in this work is shown in Figure 7.  We modify the roofline model to fit our accelerator, illustrated as follows: where Peak Perf. stands for the highest performance supported by PEs, and OptInt × BW represents the highest performance supported by offchip memory bandwidth; OptInt is operational intensity. The maximum attainable performance depends on the smaller of the above two variables.

Experiment:
The accelerator is written in Verilog-HDL language and is implemented on the Xilinx Zynq XC7Z045 device. The design space exploration is simulated by Matlab to obtain the optimal design corners for each layer. Then, according to the optimal design corners, Mentor ModelSim is used to simulate the actual performance of each layer. Table 1 shows a summary of the implementation results of our accelerator, and Table 2 illustrates a comparison with other state-of-the-art accelerators. Among these designs supporting only one DCNN or only convolution layers, our accelerator has the largest configurability, supporting different kernels (size from 1 × 1 to 11 × 11, stride 1/2/4), pooling operations (2 × 2 max/mean pooling, 3 × 3 max pooling, stride 1/2), activation function operations (ReLU, LeakyReLU, sigmoid), full connection operation, and batch normalisation operation. Besides, our accelerator still achieves a good performance of 57.02 GOPS on ResNet-34 with 392 digital signal processing (DSP), 175.9 k LUTs, and 190.5 BRAM resources. The performance of our accelerator is higher than the designs in [5,6,8].
There are some architectures adopting quantisation and approximation techniques to accelerate DCNNs, such as the designs in [2, 11,