MSR-FAN: Multi-scale residual feature-aware network for crowd counting

Crowd counting aims to count the number of people in crowded scenes, which is important to the security systems, trafﬁc control and so on. The existing methods typically using local features cannot properly handle the perspective distortion and the varying scales in congested scene images, and henceforth perform wrong people counting. To alleviate this issue, this study proposes a multi-scale residual feature-aware network (MSR-FAN) that combines multi-scale features using multiple receptive ﬁeld sizes and learns the feature-aware information on each image. The MSR-FAN is trained end-to-end to generate high-quality density map and evaluate the crowd number. The method consists of three parts. To handle the perspective changes problem, the ﬁrst part, the direction-based feature-enhanced network, is designed to encode the perspective information in four directions based on the initial image feature. The second part, the proposed multi-scale residual block module, gets the global information to handle the represent the regional feature better. This module explores features of different scales as well as reinforce the global feature. The third part, the feature-aware block, is designed to extract the feature hidden in the different channels. Experiment results based on benchmark datasets show that the proposed approach outperforms the existing state-of-the-art methods.

its practical significance and hence attracted wide attention of people.
The methods to solve the crowd counting task are divided into two categories: the detection-based methods [4] and the map-based methods. The detection-based methods use detection algorithms to find all the people in the image and count them. The benefit is that it can locate the person. But for very dense scenes, such methods [5,6] always tend to fail. The mapbased methods extract image features and do regression operation to generate the density map and get the number of people. Such approaches [7,8] can well overcome the crowd occlusion problem. However, the generalization ability of these models is poor. And some excellent works also try to solve the real street scenes, such as [9]. These methods have also driven the development of real-world data-driven crowd counting domain.
To alleviate the above problems, a multi-scale residual feature-aware network is proposed. This method combines the local and global feature using multiple receptive field sizes and learns the feature aware information on each image. It is an endto-end approach that can generate high-quality density map and count the crowd number. Specially, the initial image is sent into the convolution layers of VGG to extract the image feature. To handle the perspective changes problem and enhance the initial image feature, the direction-based feature-enhanced network is employed. In the high-density crowd environment, the occlusion between people is relatively serious. The single-pixel feature cannot provide a reliable estimation of the number of people. So, this work proposes multi-scale residual block that encodes different scales information and reinforces the initial image feature. In addition, the feature-aware block is also used to learn the importance of the regional feature. The experiments based on benchmark datasets show that the proposed approach outperforms the existing methods.
The main contributions of this study are summarized as follows: • This study proposes the MSR-FAN for crowd counting, which outperforms the existing state-of-the-art methods. • The multi-scale residual block is proposed to detect the image features at different scales adaptively and encode multi-level contextual information into the features. • The combination of the direction-based feature-enhanced network and the feature-aware block is used to solve the perspective changes problem and extract the difference between local features and image content for crowd counting.
The rest of this paper is organized as follows. Section 2 discusses related works. Section 3 introduces the whole method and the details of the proposed MSR-FAN for crowd counting. Section 4 shows the experiment results, and Section 5 gives the conclusions and future research directions.

RELATED WORK
Crowd counting is a rising research topic. Many excellent methods have been proposed for this difficult task. Generally, detect crowd counting approaches can be divided into two main categories: one is detection-based and the other is map-based.

Detection-based methods for crowd counting
In the early period, the object detection methods are used for crowd counting. The researchers used the detection-based method to detect all the people in the image. It needs the people in the image can be clearly detected. Wang et al. [10] proposed a new framework of adapting a pre-trained generic pedestrian detector to traffic scene. It can automatically select both confident positive and negative examples from the target scene. Zhao et al. [11] proposed a model-based approach that combined the Bayesian framework and a joint image likelihood for multiple humans based on the appearance of the humans. The visibility of the body is obtained by occlusion reasoning, and foreground separation. In addition, many people developed multi-class object detection methods to solve it. Barinova et al. [12] proposed a new approach with Hough transform for object detection. [13][14][15][16] introduced models for multi-class object detection that cast the problem as a structured prediction task.

Map-based methods for crowd counting
With the development of deep learning methods [17,18], and the crowd scenes become more crowded. Even the person in the image cannot be detected by object detection methods. So, researchers pay attention to map-based method, which can handle large scale and dense crowd scenes. Wu et al. [19] proposed a featured channel enhancement block-based method, which can enhance the positive characteristic channel. Xiong et al. [20] proposed a spatial divided-and-conquer network. They thought a dense region can always be divided until sub-region counts are within the previously observed closed set. Liu et al. [21] introduced a structured feature enhancement module based on conditional random fields to refine features mutually with a message passing mechanism. [22][23][24][25][26][27][28] proposed data-driven and adaptive methods that can understand highly congested scenes and perform well. Other works [29][30][31][32][33] used the multi-scale networks for image crowd counting, which are both accurate and cost-effective for practical applications. Cheng et al. [34] proposed a novel architecture called spatial awareness network to incorporate spatial context for crowd counting. Liu et al. [35] introduced an end-to-end trainable deep architecture that combines features obtained using multiple receptive field sizes and learns the importance of each feature. Besides these methods, some researchers also constructed some novel model and proposed new challenge for this task. Such as [36] first constructed a lightweight model, which is composed of an image feature encoder and a simple but effective decoder. Wang et al. [37] proposed a neuron linear transformation method for cross-domain crowd counting. It can transfer the source model to the target model. Although with all the above methods can be applied to crowd counting, the perspective changes and the person scale variation are still the problems that need to be solved and better works should be proposed.

THE PROPOSED MSR-FAN FOR CROWD COUNTING
The goal of this work is to estimate the density map and count the crowd number by using a novel framework named The feature maps generated by first ten layers of pre-trained VGG. Row 1: The initial input image. Row 2: The heatmap generated by VGG network, the red regions are activated by convolution operation multi-scale residual feature-aware network (MSR-FAN). The proposed method mainly contains three submodules: the direction-based feature-enhanced network, the multi-scale residual block, and the feature-aware block. The general framework of the proposed method is shown in Figure 1, which can generate a high-quality crowd density map.
The initial image is sent into convolution layers to create the initial feature map. Such feature map contains the initial detailed information about the image. The direction-based feature-enhanced network is proposed to encode the perspective changes in four directions. This operation can enhance the initial image feature and effectively improve the quality of density maps for the densely crowded areas. The proposed multiscale residual block can combine the multi-scale feature maps and fuse multiple channel information. The 3 × 3, 5 × 5 and 7 × 7 kernel size are used in this work. They are three normal sizes in many other convolutional neural networks, which have been proved they have good performance. In addition, the feature-aware block is employed to capture the differences between the features at a specific location and those in the neighbourhood. The whole approach is an end-to-end training process. The input crowd image can be converted to the density map directly and get the number of crowds.

The direction-based feature-enhanced network
To get the initial image feature, the image is sent into the first ten layers of a pre-trained VGG network. As Equation (1) shows, the image I is dealt with F VGG to get the initial feature map f D . The images are from nature crowd scenes and the most of the crowd areas are activated by the convolution operation, as the Figure 2 shows. It means that the CNN can catch the visual cues about the crowd, which will help count the number of the crowd. In fact, many other works also use such structure to get The direction-based feature-enhanced network the image feature. But the feature got by VGG has limitation, [22] discussed that it just encodes the same receptive field over the entire image. So, the feature f D needs to be processed further.
To handle such limitation, this work also used direction-based feature-enhanced network to enhance the initial image feature, which is shown as Figure 3. The direction-based module is proposed in [31] to solve the perspective changes problem for crowd counting, which also is a useful tool. The network adopts f D as input and contains four submodules: the down layer, the up layer, the right layer, and the left layer. Every submodule is a convolution layer with an ReLU activation function. The goal of such structure is to encode the perspective information in four directions.
The reason for the emergence of this network is that the convolution of each image pixel line should be different due to the perspective changes problem. However, the traditional convolution operation in each image part is the same, which cannot fully represent the local and global aggregation effect. Moreover, this is actually a set of density characteristics that has different aggregation effects for different rows. Therefore, such information needs to be reprocessed by the direction-based feature-enhanced network to extract more useful information that can represent the crowd density in the image.
This network selects four directions to do the convolution operation for extracting as much information hidden as possible due to perspective. In addition, this network also has other benefits. It does not change the input feature dimension and introduce the global information into the local feature map. It also encodes the crowd scenes efficiently. After encoding the direction-based image feature, the initial image feature is added with the output of direction-based module to enhance the initial feature. Then, the fused image feature is sent into the multi-scale residual block.

3.2
The multi-scale residual block As shown in Figure 4, the multi-scale residual block (MSRB) is designed to extract the image features at different scales. It mainly includes the multi-scale convolution layer and residual The MSRB has two convolution layers which all consist of three bypasses. The input feature map is firstly sent into three bypasses with different convolution kernels, whose size is W 0 × H 0 × C 0 . The ReLU activation function is followed by convolution operation as Equation (2) shows. The three bypasses will generate three outputs. Such structure with three kinds of convolution kernels can extract different scale image features and fuse them. The traditional convolution operation keeps the channel independent, but such independence will limit the network's representation ability. So, the feature between different channels is then concatenated and sent to the next convolution layer.
The concatenation operation can combine multiple features. The feature extracted by multiple convolutions can be fused and the crowd information of the image can be represented better. This work concatenates the feature map on channel dimension. Due to the scale variation, every network bypass uses different kernel sizes to do the convolution operation. And different kernel sizes will extract multi-scale information.
Given the initial feature map f , when the feature map is operated by i × i convolution kernel, the output feature map can be expressed as f i . As the Equation (3) shows, the f i can be obtained by f .
The w j i * i represents the weight matrix in j layer with convolution kernel i × i. f denotes input feature map. b j denotes the weight bias. After ReLU activation function, the three The structure of the feature-aware block output feature maps will be concatenated. This operation will be helped to integrate the image features of different size channels, which can fit the scale variation of crowd scenes. The new concatenated feature map can be denoted as Equation (4). The parameter k is used to distinguish the different output features.
The F k which contains multi-scale information is then sent into the second layer of MSRB. The second layer also has three kinds of convolution kernels. The difference with the first layer is that the second layer only do once concatenation operation. After this, the output feature map size can be changed to In order to keep the final feature map size as the same as the initial input feature due to the concatenation operation, the output feature F k ′ after second concatenation operation is dealt with 1 × 1 convolution kernel. Such 1 × 1 kernel can reduce the scale of the feature map. The function of 1 × 1 kernel is the same as pooling layer. But the pooling layer just adjusts the image size instead of reducing the image channel. And 1 × 1 kernel will also help obtain more metric information with the combination of three kinds of the convolution kernel.
In addition, the residual block is also used in MSRB. This structure is often used to solve the problem of gradient disappearance. In this work, the most efficient way to use the multi-scale information is to concatenate it to the original input features. This operation adds the input feature to the output feature as f O . In the multi-scale network structure, this operation shown as Equation (5) will also keep more initial information after the multi-scale convolution operation. It can combine the initial image feature and the multi-scale information. Such combination will also make the network more efficient.

The feature-aware block
After MSRB, the output feature f O is sent into the feature-aware block, as Figure 5 shows. Some work [35] also used such feature aware module, which can extract multi-scale features hidden in the different channels of the feature map. The whole procedure can be described as Equation (6): With the use of average pooling, the feature map f O is averaged into K(j) × K(j) blocks with Avg function. The size of f O is W j × H j × C (weight, height, channel). The average pooling layer operates on the C channel. It will generate K(j) × K(j) blocks. Such combination of the different channels can avoid the independence of each feature channel, which will limit the representation power. Ref. [35] used four different scales here, with corresponding block size K(j). But this work uses the MSRB before, which has extracted the scale information. So, the size of K(j) is decided by the f O . Then, the averaged blocks are dealt with a convolutional network Φ 1 with kernel size 1. This operation aims to fuse features between different channels without changing the channel dimensions. And function U represents bilinear interpolation to up-sample the array of feature. It aims to recover the size of U O .
The 1 × 1 convolution kernel can fuse the different channel features. To capture the feature-aware information hiding in the different channels, this paper defines the feature-aware feature as S O , which is shown as Equation (7).
Equation (7) denotes the difference between U O and f O . It can make the salient information more outstanding and can understand the difference between multiple channels better. Then, S O is concatenated with the input feature f O to enhance the representation of the image feature, as Equation (8) shows.
Finally, the feature C O passes to a decoder consisting of several dilated convolutions that produces the density map.
The following details reflect the structure of the proposed MSR-FAN: • The combination of the direction-based feature-enhanced network, the multi-scale residual block, and the feature-aware block are used to construct the MSR-FAN. • The multi-scale residual block can extract the image features at different scales and encode multi-level contextual information into the features. • The feature-aware block can fuse the different channel's features and extract the difference between local features and content features. It can improve the quality of the generated density map.

EXPERIMENTS
The This work uses the Adam optimizer with a learning rate of 0.00001 and L2Loss loss function.

Evaluation metrics
This work uses the common standard evaluation metrics to test the methods: mean absolute error (MAE) and the root mean squared error (RMSE) [45]. They are defined as follows, Equations (9) and (10).
The parameter N represents the total number of test images, y i is the ground-truth number of people inside the whole ith image andŷ i is the estimated number of people.

Comparisons with state-of-the-art
This work compared the state-of-the-art methods on four benchmark datasets: ShanghaiTech [38], UCF_CC_50 [39], UCSD [40] and Mall [41]. ShanghaiTech dataset contains 1198 annotated images with 330,165 people. It consists of two parts: part_A with 482 images and part_B with 716 images. The two parts are divided into training data and testing data. UCF_CC_50 dataset only contains 50 images. The number of each image varies a lot, which makes it become a challenging image set. UCSD dataset has 2000 frames captured by surveillance cameras. Each image contains less than 50 persons. But the initial image' size is small, which is difficult to generate the high-quality density map. The mall dataset contains 2000 annotated images. The researchers took pictures of pedestrians in the shopping mall. The experiment results show that the proposed MSR-FAN can get a satisfactory result on these four datasets.   performance. And the multi-scale model this paper proposed also tests on this dataset and obtains good performance. It gets the 210.7 MAE and 239.6 RMSE, which is better than other state-of-the-art methods. It can also be found that the MSR-FAN and CANNet [35] are close to each other in MAE values. The feature-aware module can extract the difference between local features and context features. It can be inferred that the direction-based feature-enhanced network and the MSRB can pay more attention to the scale information, which is useful to solve the perspective changes and the size various. Table 3 shows the comparison results of MSR-FAN with other methods on UCSD dataset. This dataset is mainly collected in the outside scenes, which is less intrusive and has less traffic. This paper compared with [8,22,23,34] and the experiment results show that the proposed MSR-FAN can handle this kind of simpler dataset. It can be found that the MAE value  of ours is close to [34]. But the accuracy this work obtained is sufficient for practical applications in this dataset. Table 4 shows the comparison results of MSR-FAN with other methods on the mall dataset. The mall dataset is similar to UCSD, which contains single scene and a small number of pedestrians. On this dataset, this work gets the best result compared with [7,8,42,43,44]. It shows that MSR-FAN has good generalization ability for general datasets. The results of other methods all get a good performance. This result indicates that this dataset is not challenging.
In addition to making comparisons on these four benchmark datasets, this paper also visualizes the experimental results on ShanghaiTech part_A dataset. As the Figure 6 shows, the MSR-FAN and the MSR-FAN without direction-based featureenhanced network are tested. The five images are chosen and each image contains abundant person, which is representative of crowd scenes. The row 3 shows the estimated results of the MSR-FAN. The estimated density map can reflect the density of the crowd as the ground-truth (row 4). To show the usefulness of the introduction of direction-based feature-enhanced network, this work also removes this network from MSR-FAN to test the model performance. From the row 2, it can be found that the MSR-FAN without the network loses some detailed information compared with ground-truth.

Ablation studies for the proposed MSR-FAN
To understand the validity of different sub-modules, this work does some ablation studies on ShanghaiTech dataset. The different feature extraction feed-forward network and the MSR-FAN without direction-based feature-enhanced network are considered to be tested. This work considers the VGG and the ResNet as the different feature extraction network, due to their good representational capacity of extracting features. The following configurations are the detailed information: From Table 5, it can be found that the replacement of ResNet cannot get better accuracy and even lead to lower MEA and RMSE. This work considers that the deep structure of ResNet will lose more useful and significant information when facing the perspective problem and the very dense scenes. The configuration (b) confirms the effectiveness of the use of the DB-FEN, which encodes the input feature in four different directions. The ResNet feed-forward network also performs not well here, the MAE on part_A just reaches to 78.6, which is far away VGG-MSR-FAN (MSR-FAN). So, the combination of the direction-based feature-enhanced network, MSRB, and the feature-aware block can get satisfactory result.

CONCLUSION
This paper proposes a multi-scale residual feature-aware network that combines multi-scale feature using multiple receptive field sizes and learns the feature aware information of each image. To represent the regional feature better, this work proposes multi-scale residual block to get the multi-scale information. Besides, the feature-aware block is also employed to learn the importance of the regional feature. Experiment results based on benchmark datasets show that the proposed approach outperforms the existing methods.
In the future, this work will continue to improve the multiscale structure and combine new technology to promote the quality of the generated density map.