FPGA‐accelerated deep convolutional neural networks for high throughput and energy efficiency

Recent breakthroughs in the deep convolutional neural networks (CNNs) have led to great improvements in the accuracy of both vision and auditory systems. Characterized by their deep structures and large numbers of parameters, deep CNNs challenge the computational performance of today. Hardware specialization in the form of field‐programmable gate array offers a promising path towards major leaps in computational performance while achieving high‐energy efficiency.


INTRODUCTION
Emerging applications such as micro-UAV (unmanned aerial vehicle), domestic robot and internet media data analysis need fast computing systems to perceive the real world. Deep neural networks achieve state-of-the-art perception in both vision and auditory systems [1,2]. Meanwhile, characterized by their deep structures and large numbers of parameters, deep neural networks challenge the today's computational performance. Take the well-known model, AlexNet proposed in [1] as an example, 1.36 10 9 operations (usually 32-bit single-precision floating-point operations) with 60 million parameters are needed by one forward propagation alone. A typical vision perception application, object detection, using the model repeatedly usually requires a computation capability of 10 9 to 10 12 operations per second [3]. general-purpose accelerators [6,7] (such as GPGPU [8]) can hardly satisfy. Specific accelerator is thus becoming a research hot spot.
Specialized accelerator in the form of Filed Programmable Gate Array (FPGA) offers a promising path towards major leaps in performance and energy efficiency. In this paper, we focus on accelerating the forward propagation of deep Convolutional Neural Networks (CNNs) using an FPGA-based accelerator.
We adopt a matrix multiplier-based accelerator architecture which has two superiorities: flexibility and consistency. The structures of different CNNs may differ greatly. CNN accelerators for fixed network structures [4,9] are inflexible. A common design that can efficiently handle variable network structures is needed. Recently, the authors of [10] discussed the design space of a CNN accelerator, but it still needs to adjust the accelerator structure to a fixed network. Converting convolutions to matrix multiplications (we will use 'unrolling' to describe it) is a good choice to handle broader spectrum of network structures [11,12]. With respect to consistency, traditional works [4,9,10] design different accelerating units for different parts of CNNs. As most of the computational workload of a CNN exists in convolutional layers and full connection layers, the matrix multiplierbased accelerator can handle both the two kinds of layers. Compared with traditional architectures, the on-chip resources used to build accelerating module for full connection layers are saved.
It is a common practice to use some developed software framework running on CPUs or GPUs (such as Caffe [13], Torch [14]) to develop CNN applications at first. Most of the existing CNN accelerators use custom frameworks for simplicity, which makes it difficult to take advantage of the rich achievements based on widely used frameworks. We assert that it is important to make the accelerator supports some representative developed frameworks. In this paper, we choose Caffe, a popular high-performance CNN framework implementation.
In the process of designing our accelerator, we see that there are some challenges: (1) Overhead of unrolling the convolutions to matrix multiplications can not be ignored. If we only use a matrix multiplier to accelerate the matrix multiplication part and still leave the unrolling part to the host processor, the upper bound of speed-up is limited according to the Amdahl's law. (2) The nonsequential external memory access brings an obvious memory bandwidth slow-down that greatly limits the accelerator's performance. (3) The unrolled matrix multiplications in different CNN structures have different sizes. It is challenging to efficiently map these matrix multiplications of different sizes to a fixed structure accelerator. (4) The Caffe framework relies on Linux, which has a memory management mechanism. FPGA accelerator accesses Dynamic Random Access Memory (DRAM) through physical space, but the Caffe framework accesses DRAM through the operating system's user space. The data communication between the different spaces may cause extra overhead.
To overcome these challenges, some novel contributions are made by this paper: A stream-mapper unit is designed to handle the convolution unrolling task. For the reason that convolution unrolling and matrix multiplication can be overlapped simultaneously, the time overhead of unrolling is eliminated. We propose a prefetch strategy and implement a prefetch unit structure to make the address stream to the external memory sequential. As memory access with sequential addresses can benefit from the burst characteristic of the memory port, the memory bandwidth can be fully used. We optimize the blocking strategy to make matrix multiplications of different sizes perform efficiently.
We propose a novel memory management scheme for our FPGA accelerator to efficiently share the data with the host processor.
We implement an FPGA-extended version Caffe based on two Xilinx Zynq-zq7045 FPGA SoC chips using 1600 Dsp48Es (the basic computational resource of the Xilinx FPGA) at 150 MHz clock rate. A performance of 77.8 Gflops is achieved. Compared with an Intel Xeon X5675 (3.4 Ghz, 6core) processor, 3.54 speed-up is achieved. The energy efficiency is also better than an Nvidia K20 GPGPU by a factor of 4.7 .
The rest of this paper is organized as follows: In Section 2, brief background knowledge is presented. Section 3 describes our design and implementation details. Section 4 shows the experiment 3 of 20 results and makes a simple comparison between our implementation and existing works. Section 5 reports related work. Section 6 concludes the paper.

Overview of the convolutional neural network algorithm
Convolutional neural network is a trainable architecture inspired by the research in neuroscience [15]. It has two computational paths, a forward propagation path for classification and a backward propagation path for training. In practice, many applications first train the CNN off-line using high-performance clusters to reduce the training time and then run the trained CNN at job site using dedicated devices. In this paper, we only focus on speeding up the forward path. A typical CNN structure consists of a feature extractor and a classifier. The feature extractor extracts an input image's features and sends them to the classifier. According to these features, the classifier decides the category that the input image belongs to. A feature extractor consists of several similar stages. The input and output of a stage are called feature maps. The output feature maps of a stage are the input of the next stage. The input image is the input to the first stage. Each stage consists of three layers: a convolutional layer, a non-linearity layer, and a sub-sample layer. The output feature maps of the last stage are organized as a feature vector of the original input image and sent to the classifier. A classifier is a traditional MLP (multi-layer perceptron) composed of several full connection layers. It takes the feature vector as input and calculates the probability of each category that the input image may belong to. At last, the classifier chooses the category with highest probability as the output. Figure 1 shows the structure of a representative deep CNN AlexNet that won the ImageNet 2012 contest [1].
Most of the workload of CNN exists in the convolutional layers. Figure 2 shows the computational procedure of a convolutional layer. A convolutional layer has Q input feature maps X 0 . . . X Q 1 and R output feature maps Y 0 . . . Y R 1 . To produce an output feature map Y r , all input feature maps X 0 . . . X Q 1 are first individually convolved with convolutional kernels K r;0 . . . K r;Q 1 . Then, all the Q convolved maps are summed to be one map. At last, a bias value is added to each pixel of the map to get Y r . Q*R convolutions are performed per convolutional layer. Equations (1) and (2)   mathematical form of the procedure. Here, conv< X q , K r;q > means the convolution between input feature map X q and convolutional kernel K r;q . In (2), K si´e is the size of the convolutional kernel, and st ride means the distance that the convolutional window slides each time.

Matrix multiplication accelerators
Traditional matrix multiplication accelerators [16] adopt the architecture with a linear array of processing elements (PEs). Such a structure divides a matrix multiplication task into several sub-block computational tasks.
The matrix blocking method is described by Equations (3) and (4). In order to get the product matrix C (M x N) of matrix A (M x K) and matrix B (K x N), we need the following steps: first, dividing C to m rows and n columns blocks, as shown in Equation (3) and then computing C i;j (i D 0; 1; 2 : : : m 1; j D 0; 1; 2 : : : n 1) individually. Equation (4) describes the procedure. A i;k (k D 0; 1; 2 : : : K 1) are column vectors, and B j;k (k D 0; 1; 2 : : : K 1) are row vectors. The algorithm finishes after all the C i;j blocks are computed. C 0;0 : : : C 0;n 1 : : : : : : : : : C m 1;0 : : : C m 1;n 1 3 5 (3) D OEA i;0 ; A i;1 : : : A i;K 1 OEB j;0 ; B j;1 : : : A classic structure of matrix multiplier [16] is organized as Figure 3 shows. Every time, the matrix multiplier works as described by Equation (4). It gets column vectors A i;k and row vectors B j;k (k D 0; 1, ,K 1) to compute a C i;j block. The matrix multiplier consists of a PE chain. A i;k and B j;k are passed through the chain. Each PE every time holds an element of A i;k and multiplies it with all the elements of B j;k to get a row vector of the intermediate result C k i;j and accumulates it. The FIFO_A and FIFO_B are used to pass the elements of A i;k and B j;k to next PE. The local ports allow local registers RA (working as a double buffer, selector MUX1 chooses the correct data that will be used) and RB to get the required elements. A multiplier and an accumulator form an MAC (multiply accumulate) unit. The multiplier receives the input element from RA and RB. The result of the multiplier is sent to an accumulator to be added with the intermediate result from the 5 of 20 local memory unit MEM_C, and the sum is written back to MEM_C. The address of MEM_C is generated from a local state machine. The final result is written to FIFO_C to be passed to the external memory. The writing back order is determined by a selector MUX2, and the local result must be written back before FIFO_C receives the result from the next PE. For one PE chain, only one element from matrix A and one element from matrix B need to be loaded in one clock cycle. The time of writing back usually can be ignored when K is far larger than the block size.

DESIGN AND ARCHITECTURE
The top view of our SoC design with a CNN accelerator is given in Figure 4. The host processor communicates with the accelerator through a system bus (It can be an AXI bus for a typical ARM SoC or a Wishbone bus for an Open RISC SoC) and handles the workload except for the convolutional layers and full connection layers. The accelerator and the host processor share the external memory (the main memory of the host processor, the type is DRAM memory). The accelerator consists of several chains. Each chain has a stream-prefetcher, a stream-mapper, and a matrix multiplier. The matrix multiplier accelerates the matrix multiplication workload existing in the convolutional layers and the full connection layers. It consists of a stream S/L (Store/Load) unit and hundreds of PEs (processing elements). The stream S/L unit loads operands to the PE chain (computational unit of the matrix multiplier) and then stores the results. The stream-mapper remaps the data stream to the stream S/L unit to unroll convolutions to matrix multiplications. The stream-prefetcher is used to ensure efficient external memory access.

Using stream-mapper to unroll convolutions to matrix multiplications
To utilize the efficient matrix multiplier structure, we need to unroll the convolutions to matrix multiplications first. Figure 5 shows a simple example of unrolling the convolutions of one convolutional layer to a matrix multiplication. (A) is the normal computational procedure of the convolutional layer as the last section introduced. The input feature maps of this layer are X 0 , X 1 , X 2 , and the output feature maps of this layer are Y 0 , Y 1 . There are R Q D 6 convolutional kernels in this layer. (B) shows how to unroll the convolutions in this layer to a matrix multiplication. Input feature   maps, convolutional kernels, output feature maps are organized as Inmap_matrix, Kmatrix, and Outmap_matrix, respectively. Because a two-dimensional matrix is stored as an array in the memory, both Kmatrix and Outmap_matrix keep their data forms unchanged. Only Inmap_matrix needs to be reorganized. There are four 2 2 convolutional windows in every input feature map. Each of them is organized as a four-length column vector, and the four vectors are combined as a 4 4 matrix block. For the three input feature maps, we obtain three matrix blocks, and then, we pile the three blocks up to obtain the 12 4 Inmap_matrix. After multiplying Kmatrix with Inmap_matrix, we obtain the Outmap_matrix, and the result is as the same as procedure (A).
Using software to rearrange the data first is time-consuming and requires extra memory space. We run Caffe with AlexNet on a CPU. Figure 6 shows the unrolling overhead of its five convolutional 7 of 20  layers. About 10% of the total time is consumed in average. If we only use a matrix multiplier to accelerate the computational part and still leave the data rearrangement part on the host processor, according to Amdahl's law, the upper bound of speed-up is only 10 .
A hardware stream-mapper is designed to eliminate the overhead of unrolling convolutions to matrix multiplications. The operation (KmatrixxInmap_matrix = Outmap_matrix) is executed on the matrix multiplier. The matrix multiplier is driven by the data stream from stream S/L unit, which loads elements of Kmatrix and Inmap_matrix to the PE chain and stores the result. A streammapper is placed between the stream S/L and the system bus. All the data access to the elements of Inmap_matrix are directly mapped to the elements of the input feature maps. Thus, Inmap_matrix does not occupy any memory space. The data are only stored in the form of input feature maps.
The mapping algorithm is given in Figure 7. The input is the location of the element Inmap_matrix[Bx,By] that the stream S/L attempts to access, and the output is the address of the Inmap_matrix[Bx,By] in input feature maps which are stored in the external memory. All the parameters needed to compute the address are set by the host processor. Ksize denotes the convolutional kernel size, and win_num is the number of convolutional windows in one dimension of an input  feature map. The image_size parameter is the number of pixels in one dimension of an input feature map. The img_addr is the address of the first feature map's address.
To get the address, it first calculates the element's offset in a convolutional window (ofs_cwin_x, ofs_cwin_y). Then, it calculates the offset of the convolutional window in a input feature map (cwin_x, cwin_y). Thereafter, it can calculate the pixel (the required element) offset in the feature map (ofs_pix), followed by calculating which input feature map the element belongs to (im_num), and calculating the feature map's offset (ofs_im). Because the first feature map's address (img_addr) is known, the mapper can get the address of the element by adding all the offsets to img_addr.
The structure of the stream-mapper is shown in Figure 8. It is a logic unit to implement the process of the mapping algorithm. The stream-mapper receives one element location from stream S/L and generates one address every clock cycle. The addresses are generated as a stream. It is timeconsuming to map the address in one clock cycle, so a 35-level pipeline is used. Three divisions are the most time-consuming parts of the mapper, and each of them needs 16 cycles to obtain one result. Although we overlap two of the divider, there are still 32 cycles left. Some stack registers for synchronizing are not shown in the figure for simplicity. Most of them exist where the blue dash lines are.

Conv times
Equations (5) (13) quantify the memory access times of software and our hardware data rearrangement, respectively. Software rearrangement needs two memory areas. One for the input feature maps whose size is Size inmap , the other is for the Inmap_matrix whose size is Size inmap_matrix . The host processor first reads the elements from the input feature maps and then writes them to the Inmap_matrix. After the data are organized as the matrix form, the host processor starts the multiplier. The multiplier accesses Inmap_matrix for˛times and Kmatrix forˇtimes, the value of˛anď are determined by the matrix size and the matrix blocking method. The hardware implementation only needs the memory area for input feature maps. It maps all the access to the Inmap_matrix 9 of 20 to the input feature maps directly. Compared with a software implementation, a memory space of Size inmap_matrix and Size inmap C Size inmap_matrix times memory access are eliminated, and the data rearrangement and the matrix multiplication can be overlapped completely.

A prefetch strategy to make the address stream sequential
The address stream from the stream-mapper to external memory is not sequential, which prohibits the incremental addressing burst characteristic of the memory port (A burst memory transaction processes multiple memory access in a pipelined way which can take full use of the memory bandwidth). It will significantly reduce the memory bandwidth utilization. Figure 9 shows a simple memory test in an FPGA platform with DDR3 memory. The physical memory bandwidth is 1066 MHz, 32-bit and the frequency of memory requests is 150 MHz. The horizontal axis presents the communication mode, s means simplex, d means duplex, nb means non-sequential address stream without burst transmission, and b means sequential address stream with burst transmission. The vertical axis presents the memory access per cycle. Obviously, memory access with the burst characteristic performs much better than the non-burst access.
To keep the address stream sequential, we prefetch the data as a sequential way. A novel prefetch strategy for CNN is shown in Figure 10, taking the situation that the Ksize D 3, the Img_size D 8 and the stride D 1 as a example. There are two input feature maps in this example, and the length of the PE chain is 8. The original memory access order is given in Figure 10(a). Once a time, the stream S/L loads PE_NUM elements from the Inmap_matrix as a row vector. Although the location stream of the Inmap_matrix elements is sequential, the mapped addresses from the stream mapper are not sequential. We divide the Inmap_matrix into two parts as shown in Figure 10(b), the upper part consists of the elements from X0, and the lower part consists of the elements from X1. Back to Figure 10(a), the stream of the row vectors is divided to blocks from different input feature maps. From Figure 10(c), we find that, according to the map algorithm, all the elements from a block can be located in a sequential area in one input feature map. We prefetch these sequential areas to an on-chip prefetch buffer before the matrix multiplier accesses them. Figure 10(b) shows the first four blocks' prefetch areas. Two prefetch parameters need to be determined. They are the start address and prefetch length. The start address is location of the first element of one block. It is timeconsuming to get the ideal prefetch length, because of the variable network structure. We have used an approximate value shown in Equation (14), which is a little bigger than the ideal one. In most cases, the prefetch area is smaller than the block, whose size is PE_N UM Ksize 2 , especially when stride D 1.
A prefetch unit has been designed to overlap the prefetch and computing. We use the double buffering technique. When the stream S/L unit is loading a block from one prefetch buffer, the next prefetch area is being loaded into the other prefetch buffer. The address stream becomes sequential, and the number of memory access decreases after the prefetch unit is integrated.
10 of 20 Y. QIAO ET AL. Figure 10. The procedure of prefetching.

Accelerating matrix multiplications of different sizes
Most of the computational workload now can be formulated as matrix multiplications. The matrices unrolled by different convolutional layers may differ greatly in size. Mapping such different sized matrix multiplications efficiently to our accelerator has two challenges. The first is the scale of the matrix being too small to be allocated to multiple chains or even PEs in one chain. This problem limits the parallel potential of the network structure. Although processing multiple data at the same time may help, this paper does not focus on this approach. The other problem is that some dimensions of the matrices are too small to obtain appropriate sub-blocks to take full use of the PEs of one chain. As shown in Figure 11, in general, Matrices A and B for example can be divided into four submatrices, where and are sub-matrices with sizes that are multiples of the number of PEs in a chain. Thus, the computation can be divided into four parts: , , and .M > S i > S 0 i ; N > S j > S 0 j /. According to the matrix multiplier mentioned in Section 2, when the size of sub-blocks is equal to the number of PEs in a chain (S i ), all PEs are used. When S 0 i < S i , only S 0 i PEs can be used. The computation of and may affect the computational efficiency of the total matrix. If M is large enough, the affect can be ignored, but for many unrolled matrices in a CNN, the efficiency drop is obvious.
The blocking strategy can be improved to alleviate the problem. First, we construct an objective function to calculate the computation time of the workload based on a single chain structure. T 1;3 , T 1;4 , T 2;3 , T 2;4 are the computation cycles of , , and , respectively. The objective function is the sum of T 1;3 , T 1;4 , T 2;3 , T 2;4 because the matrix multiplier calculates the four parts sequentially.
11 of 20 Figure 11. An illustration of non-uniform matrices blocking The algorithm needs to prefetch S i elements of matrix A at first, which takes S i cycles. Thus, the setup time of the PE chain is S i cycles. Each element of A is used S j times. When S i ¤ S j , according to the algorithm, it takes max¹S i ; S j º K cycles to calculate a sub-block. We assume m D bM=S i c, n D bN=S j c, so a S i S j sub-block computation requires S i C max¹S i ; S j º K cycles, thus the total computation cycles of is Because the prefetch time of the first column of each S i K sub-block can be ignored when compared with the computation time, we simplify Equation (16) as In a similar way, we obtain The values of k 1 and k 2 follow the following relations: There are constraints on the values of S i and S j . First, the value of Si can not be greater than the number of PEs according to the blocking algorithm. Second, the data hazards in pipeline should be taken into consideration, which occurs when data required for an MAC operation are delayed in the addition pipeline. Therefore, we have 8 < : S i 6 P max¹S i ; S j º > Stage add max¹M S i m; N S j nº > Stage add (23) Stage add is the needed number of additional stages to avoid data hazards. P is the number of PEs in a chain. To simplify the discussion, we assume S i D S j and S 0 i > S 0 j ; therefore, the following conclusion holds: 12 of 20 Y. QIAO ET AL.
When k 1 D 0, k 2 D 1; Because m D M=S i , n D bN=S j c, we therefore obtain When k 1 D 1, k 2 D 0; Similar to case 2, because m D bM=S i c, n D R=S j , the following holds: When k 1 D 1, k 2 D 1; When k 1 and k 2 are both 0, it can be seen that the optimal block size is equal to P according to Equations (24)(25). When k 1 and k 2 are not simultaneously 0, we can, for example, use MATLAB to find the minimum value of f .S i ; S j /.
For the reason that all the sub-blocks' optimized size are the same, we directly parallelize the sub-blocks to the multi-chain structure.

Batch processing for full connection layers
In the full connection layers, the main computational type is matrix-vector multiplication. The ratio of computation/memory access of a matrix-vector multiplication is low. Every element in parameter matrix W only does one multiply accumulate operation. Thus, the computational resources can not be fully used because of the limited memory bandwidth. We compute multiple images' full connection layers together that merges a batch of matrix-vector multiplications into a matrix-matrix multiplication as Figure 12 shows. Every element in matrix W needs to be multiplied with batch size Figure 12. Merging matrix-vector multiplications to a matrix-matrix multiplication. elements in X. The ratio of computation/memory access increases when the batch size increases. We bypass the stream-mapper and prefetch unit in the full connection layers. In such design, the full connection layers share the matrix multiplier with the convolutional layers, which saves computational resources and improves the efficiency.

Supporting the software framework Caffe
To ease the implementation, we make use of a popular CNN framework Caffe. The implementation of the convolutional layer and full connection layer of Caffe are replaced, keeping the interfaces to all other parts of the framework unchanged. All the implementation details are transparent to the user; thus, they can use Caffe as usual. To change the network structure, they only need to modify the Caffe's configure file.
For a traditional accelerator (like GPU), the host processor first copies the data to be processed to a dedicated memory device (like video memory) by DMA. After the data are processed by the accelerator, they are copied back to the host processor's main memory. For our case, we only offload the convolutional layer and the full connection layer to the accelerator and leave the other parts of the Caffe framework on the host processor. If we use the traditional way, the overhead of frequent data copies is unacceptable. It is a better choice to make the host processer and the accelerator to work in the same memory space.
Although the main workload of Caffe is offloaded to the accelerator, the framework is still running on the host processor. Caffe's framework relies on many libraries of the operating system (Linux). It accesses DRAM through the operating system's user space. The user space is a virtual space that is mapped to some discontinuous physical memory pages. The mapping from the user space to physical space is managed by the host processor's memory management unit (MMU). In today's FPGA-based SoC, the host processor exists as an IP core, and it is hard for the accelerator to use the host processor's MMU to access the user space. Most accelerators can only access DRAM through the physical space, which is continuous, so Caffe needs to allocate continuous physical space for the accelerator. Two problems exist. First, the maximum size the operating system can allocate for continuous physical space is limited in Linux (only several MB) [17], which can not satisfy the Caffe's requirement (at least hundreds of MB). Second, the continuous physical space only can be allocated in the kernel space in Linux, an extra data copy overhead between the user space and the kernel space is needed. We thus propose a unified virtual memory support mechanism to solve these two problems.
Unlike the traditional accelerator memory access mechanism that allocates continuous physical space in the kernel space as Figure 13(a) shows, we leave enough continuous physical memory space as device memory space in the Linux boot time. Such device memory space can be based on dedicated memory devices of the FPGA or just a part of the main memory device of the host processor. The space can be accessed by both the host processor and the accelerator using physical address. All the data needed to be processed by the accelerator are allocated in the device memory space. Although the accelerator can not use MMU to access host processor's user space, the host  processor uses the MMU to map a part of user space to the device memory space, which can be achieved by using Linux system function mmap(). Figure 13(b) shows our implementation. In this way, the continuous physical space for accelerator is large enough, and the framework can access the device memory in the user space. There is no extra copy overhead anymore. We discuss the implementation details of our memory management scheme in [18].

Evaluation methodology
To evaluate our design, we have implemented a prototype system, which is based on two Xilinx Zynq-zq7045 FPGA SoC chips. The experimental environment is summarized in Table I. The accelerator is implemented in the FPGA part of the Zynq chip working at 150 MHz. As a baseline software implementation, we first port Caffe to the host processor, which is a dual core ARM-cotexA9 integrated in the Zynq chip, working at 800 MHz. Then, we offload the workload of convolutional layers and full connection layers from the host processor to the accelerator. Data parallelism is implemented between the two chips, and the only communication between the two chips is for collecting the results via the Ethernet. We first evaluate the impact of the stream-mapper and stream-prefetcher and test the performance of the accelerator for different CNN structures. Then, we make a comparison between our design and existing FPGA-based CNN accelerators. At last, we also compare with general purpose devices such as a high-end GPU and a high-end CPU. The specification of the general purpose devices is shown in Table II. Figure 14 shows the accelerator performance of the structure with only a matrix multiplier on AlexNet. The columns denoted by 'soft' refer to the soft implementation running at the host processor. For five convolutional layers (conv1 conv5), we use the host processor to rearrange the input feature maps data to Inmap_matrix. Different accelerator sizes are tested, 50*2 means that there are two accelerator chains in each chip and each of them has 50 PEs. The result shows that the performance does not improve noticeably when the PE number or the number of the chains increases. For conv1, the speed-up compared with the software implementation is even less than 10 for all accelerator scale sizes. The reason is the software unrolling overhead. As shown in Figure 6, the unrolling overhead of conv1 is more than 18%. It means that the maximum achievable speed-up is 5.5 . As we use two host processors to parallelize the task, the maximum speed-up increases to  11. The problem does not exist in the three full connection layers (fc1 fc3). There is no software data rearrangement part in these layers, and they use the matrix multiplier directly. For this reason, speed-up of the full connection layers is much higher than the convolutional layers. Figure 15 shows the accelerator performance with the stream-mapper but without the prefetch unit. Although an obvious improvement is achieved compared with the bare matrix multiplier, there is a big difference from the speed-up of the full connection layers. Because of the non-sequential external memory address stream from the stream-mapper, the memory requests are discrete, and the burst characteristic of the memory can not be used. It causes low bandwidth utilization and decreases the performance. The phenomenon is serious when all the four chains are used. The frequent memory access conflicts even make the performance poorer than using two chains. Figure 16 shows the accelerator performance with both a stream-mapper and a prefetch unit. The time overhead of unrolling is eliminated, and the memory burst characteristic can be used. Compared with the former two structures, the performance of the convolutional layers is improved dramatically. For the reason that the software edition has extra overhead of unrolling, the convolutional layers obtain an even higher speed-up than the full connection layers. This is most obvious in conv1 whose software unrolling overhead is the highest, where a speed-up of more than 160 is achieved.

Performance portability of different convolutional neural network structures
We have used three different CNN structures to evaluate the performance portability of our accelerator. Figure 17    size is 4*50 for each chip. Three different CNNs structures are tested. The first one is a four convolutional layer structure for street scene recognition (SR) [19]. The second one is the AlexNet for ImageNet classification (IM). The third one is a 10 convolutional layer structure for face verification (FV) [20]. For each structure, we give two group results; one uses our optimized blocking (OB) method, and one does not.
The sizes of the unrolled matrices for all the convolutional layers are shown in Table III. It can be observed that some layers perform poorly. The performance is decided by the scale of each convolutional layer rather than the number of layers of the whole network. FV has the deepest network, but its first convolutional layer is too small to utilize the accelerator efficiently. There are two reasons for the performance decline. One is the layer's scale is too small to be distributed to multiple chains, such as in conv3 and conv4 of SR, only 2 or 3 chains can be used. The other is some dimensions of the matrices are too small to obtain appropriate sub-blocks to take full use of the PEs of one chain. As mentioned in Section 3, our optimized block method is meant to reduce the negative effect from the second reason. Table IV shows the OB size for each layer. The results show that after using our blocking strategy, the performance is improved obviously. However, in some extreme cases, the OB strategy does not work well. In conv1 of SR, the dimension M is only 12, which is far smaller than the chain size 50 (one full block size), while the dimension N is far bigger than the chain size. It means, in dimension M, the blocking strategy does not work, and in dimension N, the incomplete block almost will not affect the performance. In conv1 of FV, the dimension K  48  81  2592 384  169 2304  64  2500  576  conv4  128  50  3888 192  169 1728 128  2500  576  conv5  ---128  169 1728  96  625  1152  conv6  ------192  625  884  conv7  ------128  169  1728  conv8 -   conv1  12  49  32  conv2  32  49  41  conv3  48  50  41  conv4  43  49  44  conv5  -43  49  conv6  --49  conv7  --43  conv8  --43  conv9  --46  conv10 --46 SR, scene recognition; IM, ImageNet classification; FV, face verification. is too small with respect to the computational pipeline, so the start-up costs can not be hidden. In conv1 of IM, the original blocking method works efficiently enough.

Comparison with other Field Programmable Gate Array Convolutional Neural Network accelerator designs
A comparison between our design and existing FPGA-based float-point CNN accelerators is shown in Table V. The on-chip resources are fully used by our accelerator prototype system as shown in Table VI. The comparison shows that our accelerator could achieve a comparable performance as the state-of-the-art CNN accelerator design. In [5],   Figure 18. Power consumption of the prototype system. up to date, and our accelerator gets a comparable performance density using the GOPS/DSP as the performance metric (Design [5] uses a high-end device that can achieve higher work frequency than ours. Because the work frequency is not revealed in [5], we can not provide a more detailed comparison in this paper). The architecture shown in article [10] can only handle the convolutional layer. Extra on-chip resources are needed to build an accelerating module of the full connection layer. Our matrix multiplier-based accelerator adopts the architecture using a linear array of PEs. Modules in such a structure are only connected with their neighbor modules. Almost no global connection is needed. Thus, our accelerator can achieve a relatively high-work frequency. The author of [4] shows a typical highly customized structure that uses fixed-point PEs and a fixed-size convolution engine. It is three times faster than ours using only 1 Zynq 7045 chip. The first reason is that building a 16-bit fixed point PE only needs two DSPs in today's FPGA, while a 32-bit floating point PE needs four DSPs. A fixed-point computation engine needs fewer DSP units and less on-chip memory resources but also reduces the computational precision. We choose the floating-point data presentation for supporting Caffe. Most of the state-of-the-art trained CNN modules are presented in the floating-point format, so our design can use them directly. The second reason is that [4] only works efficiently for CNNs whose convolutional kernel size is 10 10. Fixed structure is very inflexible.

Comparison with the general purpose device
As shown in Figure 18, the power consumption of the board is measured using a watt-meter. The comparison between our CNN FPGA accelerator and high-end CPU and GPGPU implementations is shown in Table VII. From the result, we can see that the Intel Xeon X5675 (3.4 Ghz) CPU is inferior in both performance and energy efficiency. Only 22 GFLOPS can be achieved. Only 0.23 Gflops can be provided per watt. The GPGPU provides the highest performance while achieving an energy efficiency of 2.07 Gflops/W. Although our FPGA implementation does not provide a comparable performance with GPGPU, it provides the highest energy efficiency of 5.4 Gflops/W. Actually, the PCB board we used to build the prototype is not customized for this application, so extra energy is wasted on unused device. If we only consider the power consumption of the two FPGA chips (8W), the energy efficiency is 9.725 Gflops/W, which is 4.7 better than a GPU. With respect to performance, a speedup of 3.54 to a CPU implementation with MKL (Intel Math Kernel Library) is achieved.

RELATED WORK
Many CNN accelerators have been previously proposed. The authors of [9] propose a CNP structure based on a fixed-size convolver. However, it can not efficiently handle convolutions with different kernel sizes. The design presented in [21] gives an improved CNP, the accelerator structure can be reconfigured dynamically to adapt different net structures and multiple feature maps can be processed in parallel, but the size of convolver is still fixed. The authors of [11] propose an accelerator for several machine learning algorithms. The design is also based on a matrix multiplier, but it needs to store all the network parameters on-chip, for a deep CNN, it is unacceptable. And for CNN, it needs to unroll the convolutions to matrix multiplications first by the host processor. The overhead can not be ignored. The authors of [10,22] make efforts to balance the communication overhead and computational of convolutional layers, but they have not discussed the full connection layers. In fact, the convolutional layers are typically compute-bound, while the full connection layers are memory-bound. The authors of [23] use an ASIC implementation of CNN, it also needs high external-memory bandwidth to support full connection layers. Implementations based on GPU [13,24] adapt the convolutions to the matrix multiplications method and obtain the state-of-the-art performance. These works enlighten us to rethink the CNN accelerator architecture. However, GPU suffers from relatively high-energy consumption compared with a FPGA design. Our design also benefits from [16], where the authors have used FPGA to implement a 64-bit double-precision matrix multiplier.
Researchers have tried to decrease the deep networks' computational workload. Using fixed-point data format [25] is an attractive way. The authors of [26] propose the idea that many connections in deep CNN (especially in full connection layers) are unnecessary that they can be pruned to reduce the computational workload and memory space. The pruned connection owns a weight of zero. Thus, the weight matrix may become a sparse matrix. The authors of [27,28] accelerate the computation of the sparse matrix using GPGPU.

CONCLUSION
In this paper, we have proposed a CNN accelerator architecture that can handle a broad spectrum of network structures efficiently. We focus on accelerating the convolutional layers and full connection layers that constitute most of the CNN's computational workload. We adopt a matrix multiplierbased accelerator architecture. Effort is made to optimize the data arrangement, memory access, and the workload balancing among the accelerator PEs. To make it easy to use, we extend the Caffe framework to FPGA. A prototype system is built, a performance speed-up of 3.54 is achieved compared with an Intel Xeon X5675 CPU, and the energy efficiency is 4.7 better than an Nvidia K20 GPGPU.