Multi‐level feature fusion network for crowd counting

Luyang Wang, 96 Jinzhai Rd, Hefei, Anhui 230026, China. Email: ly1105@mail.ustc.edu.cn Abstract Crowd counting has become a noteworthy vision task due to the needs of numerous practical applications, but it remains challenging. State‐of‐the‐art methods generally estimate the density map of the crowd image with the high‐level semantic features of various deep convolutional networks. However, the absence of low‐level spatial information may result in counting errors in the local details of the density map. To this end, a novel framework named Multi‐level Feature Fusion Network (MFFN) for single image crowd counting is proposed. The proposed MFFN, which is constructed in an encoder– decoder fashion, incorporates semantic and spatial information for generating high‐resolution density maps of input crowd images. Skip connections are developed between the encoder and the decoder so that low‐level spatial information and high‐level semantic features can be combined by element‐wise addition. In addition, a dense dilated convolution block is placed behind the encoder, extracting multi‐scale context features to guide feature fusion by a channel attention mechanism. The model is trained by multi‐ task learning; semantic segmentation supervision is introduced to enhance feature representation. Extensive experiments are conducted on three crowd counting datasets (ShanghaiTech, UCF_CC_50, UCF‐QNRF), and the results show that MFFN outperforms state‐of‐the‐art methods. In addition, sufficient ablation studies are performed to verify the effectiveness of each component in our proposed method.


| INTRODUCTION
The rapid growth of the urban population has made massive crowd gatherings commonplace such as sports, concerts and subway stations. In recent years, stampedes have occurred frequently around the world, causing tremendous damage to human life and property, which reminds us of the importance of crowd control. Crowd counting, which aims to precisely estimate the number of people in an image, is essential for crowd control and public safety in crowded scenes. In addition, accurate crowd estimates can also play an important role in journalism and urban planning. However, it remains a challenge due to various issues such as occlusions, non-uniform distribution, scale variations and perspective. Figure 1 shows some representative images reflecting the above problems.
Early solutions for crowd counting rely on detection [1][2][3] and regression [4][5][6] with hand-crafted features, and limited feature representation capabilities restrict the increase in counting performance. Recently, numerous CNN-based methods for crowd counting have been proposed to estimate the density map of the crowd image. The integral of the density map denotes the global count of the crowd image. Most of these methods [7][8][9] focus on multi-column architectures to address large variations of crowd sizes in an image. Multiple CNN columns with receptive fields of different sizes are designed in these architectures to extract features at different scales. More recently, researchers have further improved the performance of crowd counting by deepening the network to enlarge the receptive field [10] or generating high-resolution density maps [11,12]. Compared with previous works using low-resolution density maps, the intuitive advantage is that the high-resolution density map contains more subtle details that contribute to crowd density estimation.
However, inherent drawbacks exist in these methods. On the one hand, multi-column networks consume a lot of training time, computing and memory resources, and the limited number of columns restricts the scale diversity of features. Furthermore, different columns in a multi-column architecture learn similar features, which may result in ineffective branch structures [10]. On the other hand, all of the above methods generate density maps using only high-level semantic features, and none of them take into account spatial details. Recent solutions typically leverage the backbone of classification networks to extract crowd features and then stack layers to transform the crowd features into the density map. As the network grows deeper, the spatial detail information contained in the low-level of the network gradually disappears, and the high-level features are occupied by semantic information due to the increase in the size of the receptive field. The lack of spatial information may lead to inaccurate crowd density estimation in local details.
To cope with the above issues, we introduce an encoderdecoder architecture called Multi-level Feature Fusion Network (MFFN) to fuse semantic and spatial information and generate high-resolution density maps, as shown in Figure 2. The highresolution density map is designed as the regression objective, since it can describe crowds with more pixel points to retain more accurate details of small objects. The encoder-decoder network can output maps with the same size as inputs. It balances feature extraction and spatial resolution preservation by learning a pixel-wise mapping. In MFFN, we divide the first ten layers of VGG-16 [13] into four blocks as the encoder and deploy a set of convolutions and upsampling layers as the decoder. At the meanwhile, we hold the view that crowd density estimation as a low-level vision task is different from classification. For density map estimation, only the semantic feature is not enough, and spatial details are also indispensable. Spatial information can compensate for the limited ability of semantic information to describe local details in density maps. Therefore, we fuse semantic and spatial information in MFFN to promote crowd density regression.
As mentioned above, the features of different layers in the network have different levels of abstraction. A natural idea is to combine low-level and high-level features together to fuse semantic and spatial information. Inspired by FPN [14], we build skip connections between the encoder and the decoder by element-wise addition. However, we believe that not all fused features can facilitate crowd density map estimation, due to the diverse consistency of different stages and the lack of global context during feature fusion. While skip connections enhance multi-level feature fusion, the fact that some fused features may have a negative impact is neglected. Based on this consideration, it is desirable to adjust the weight of fused features at every stages by consulting global context. A dense dilated convolution block (DDCB) is designed to extract global context features, which contains four densely connected dilated convolutional layers behind the encoder. Extra semantic segmentation supervision is introduced to classify pixels into backgrounds and crowds for auxiliary learning. Then, we adjust the weight of the feature channel at each fusion stage under the guidance of DDCB context features. We recalibrate the fused feature responses using a channel attention module (CAM) to highlight the channels that facilitate performance improvements. The main difference between our MFFN and FPN is that we refine the fusion process by introducing fused feature adjustments on the base of feature fusion.
The full framework of our MFFN is trained in a multi-task learning fashion. Different from some other works using crowd density level classification as an auxiliary task, we integrate semantic segmentation supervision during training to make the context feature more distinguishable. In summary, the main contributions are as follows: F I G U R E 1 Examples of crowd scenes in the ShanghaiTech dataset. The large variations in background, perspective, crowd distribution and pedestrian size limit the performance of crowd counting WANG ET AL.
-61 � We propose a novel encoder-decoder network called Multilevel Feature Fusion Network to incorporate spatial and semantic information for generating high-resolution crowd density maps. � We improve the fusion process at every stage by fused feature channel adjustment. A dense dilated convolution block in combination with a channel attention module is designed to recalibrate the weight of the fused feature channel in accordance with global context. � Extensive experiments conducted on three crowd counting datasets demonstrate that our MFFN achieves superior performance compared with recent state-of-the-art methods.
The rest is organized as follows. We first briefly introduce the related work of crowd counting in Section 2. Section 3 describes the architecture of our proposed MFFN in detail, and Section 4 shows the experimental results. Finally, we conclude in Section 5.

| RELATED WORK
A large number of methods have been proposed for crowd counting, and they can be roughly summarized into two categories: traditional methods and CNN-based methods. Early traditional approaches focus on either detection or regression using hand-crafted features. CNN-based approaches significantly improve counting performance due to the powerful feature representation capabilities of CNNs. We briefly review several representative methods in each category, and a more comprehensive review of crowd counting can be found in previous studies [15][16][17][18].

| Traditional methods
Early solutions [2,3,[19][20][21] for crowd counting generally tend to detect pedestrians in an image. These methods typically train a head or body detector to locate people and sum detected individuals up as the final count. Lin and Davis [22] matched a part-template tree to images hierarchically to detect humans and estimate their poses. Wu and Nevatia [2] trained a human body part detector by a boosting method based on edgelet features. However, detection-based methods suffer from occlusions and clutters in highly congested scenes. Obscured pedestrians and small-scale individuals in dense crowd scenes are difficult to detect.
To address the above issues, researchers turn their attention to regression-based approaches [23][24][25][26][27], where they learn a mapping from the features of local patches to their counts. Feature extraction and regression modelling are two main components of these methods. On this base, Idrees et al. [28] estimated crowd counts by incorporating features of multiple sources, such as Fourier analysis, head detection and SIFT [29] features. Chen et al. [5] proposed a multi-output regression model to discover the inherent importance of different features for crowd counting. Although these methods perform well for dealing with occlusions and clutters, only the final count is provided and the crowd spatial distribution information is ignored. Lempitsky et al. [30] introduced a method for counting tasks to learn a linear mapping from features to object density maps. The density map depicts the distribution of crowds, and its integral gives the global count. Pham et al. [31] learned a nonlinear mapping using random forest regression for estimating density maps. The density map is also widely used as the regression objective in CNN-based methods, since it can balance the descriptions of spatial distribution and the global count. All of the above approaches use hand-crafted features, which is the main difference from CNN-based methods.

| CNN-based methods
Inspired by the success of deep learning in various vision tasks such as image classification [32], semantic segmentation [33,34] and object detection [35], a large number of CNN-based approaches have also been proposed to address crowd counting and achieve performance improvements. In the study by Wang et al. [36], the authors modified AlexNet [37] to directly predict the number of people in a crowd image. Shang et al. [38] designed an end-to-end CNN to estimate the global and local counts of crowd images, simultaneously. In contrast, more methods employ CNNs to estimate crowd density maps. Zhang et al. [39] incorporated perspective information into the density map and proposed an alternately trained CNN for cross-scene crowd counting. However, the use of perspective maps limits the availability of this method in practical applications. Huang et al. [40] formulated the human body, head and context structure as semantic models and converted the crowd counting task into a multi-task learning problem.
To address the issue of scale variations, multi-column networks are introduced to cover diverse pedestrian scales. Zhang et al. [7] proposed Multi-column CNN (MCNN) containing three convolution columns with receptive fields of different sizes. A similar idea was depicted in the study by Onoro-Rubio and López-Sastre [41], where the authors introduced a model with multi-scale inputs to extract features at different scales. Inspired by MCNN, Sam et al. [8] found that generating density maps by late fusion of features from multiple columns may degrade performance. They instead introduced Switch-CNN that uses a density level classifier to select the optimal regressor for a particular input patch. Sindagi et al. [9] incorporated global and local density information into a multi-column network for generating highquality density maps. Multiple CNN columns result in more parameters to be optimized and consume more memory and computing resources. Due to the lack of explicit supervision, different convolutional branches are difficult to learn distinguishable features, which is contrary to the original design intent.
Recently, deeper network structures and the use of highresolution density maps further improve the performance of crowd counting. In the study by Li et al. [10], a deep singlecolumn network referred to CSRNet was presented, which uses dilated convolutions to enlarge the receptive field and extract deep features. Cao et al. [11] proposed an encoderdecoder structure named SANet for generating high-resolution density maps. Wu et al. [42] presented an adaptive scenario discovery framework with two deep pathways for modelling the dynamic scenarios implicitly. In the study by Babu Sam et al. [43], the authors tackled crowd counting with a growing CNN which can progressively increase its capacity to account for the wide variability seen in crowd scenes. In [44], a deep detection network with only point supervision required is proposed. It can simultaneously detect the size and location of human heads and count them in crowds. Gao et al. [45] designed a multi-task counting architecture, which simultaneously performs density map estimation, high-level density classification and foreground segmentation. Liu et al. [46] introduced an attention-injective deformable convolutional network. It first detects crowd regions in images and then generates crowd density maps. Zeng et al. [47] proposed a deep-scale purifier network that can encode multi-scale features and reduce the loss of contextual information for dense crowd counting. However, all of these methods fail to consider including spatial details into crowd density map estimation. Based on the above observations, we propose an encoder-decoder network to incorporate spatial details and semantic features for generating high-resolution and high-quality crowd density maps.

| PROPOSED METHOD
A detailed description of MFFN is presented in this section. We first introduce the overall network architecture illustrated in Figure 2, and the setting of each module, and then give the training details.

| Architecture
As aforementioned, the multi-column architecture for crowd counting has inherent defects, and existing methods only use the high-level features of the network to estimate crowd density maps. The higher layers of the network mainly contain semantic information, since the spatial details are gradually lost as the network deepens. The generation of density maps relying solely on semantic information leads to estimation errors in local details, as the density map directly describes the crowd spatial distribution. Moreover, most of these methods generate low-resolution density maps due to the presence of pooling layers, which may also lose spatial details. Therefore, the main idea of the proposed MFFN is to incorporate high-level semantic features and low-level spatial features and generate the density map with the same size as the original image.
The backbone of MFFN is an encoder-decoder structure, as shown in Figure 2. Following previous works [10,48], we choose the first ten convolutional layers of VGG-16 as the encoder because of its powerful transfer learning capabilities and appropriate depth. The ten layers are divided into four blocks according to the locations of the three pooling layers. These four blocks have different feature representation capabilities and receptive field sizes. The encoder takes the crowd image as input, and its output feature maps are directly input to the decoder. The decoder transforms high-dimensional features into the high-resolution density map by a set of convolutional layers and upsampling operations. Two convolutional layers are first used to refine the feature maps, and then bilinear interpolation is performed for upsampling. To incorporate low-level and high-level features, we build skip connections between the encoder and the decoder. After a convolution, the features of Block 3 in the encoder are fused with the upsampled features in the decoder by element-wise addition. Although the fused features merge spatial details and semantic components, not all of them are helpful for crowd density estimation due to the lack of global context information. To emphasize informative feature maps and suppress useless ones, the channel-wise feature responses are recalibrated by the collaboration between two modules we designed, DDCB and CAM. The DDCB placed behind the encoder is used to extract the global multi-scale context with extra semantic segmentation supervision. CAM leverages the global context that DDCB outputs as a guide to assign weights to each channel of the fused feature. The same operations are performed one more time to fuse the features of Block 2, and the fused features are also recalibrated by CAM. After that, the size of the feature map is further increased to the same size as the input image through a series of convolutions and an upsampling layer. The features of Block 1 are not chosen for feature fusion, considering that only two convolutions are conducted on the input image. Finally, the decoder uses a 1 � 1 convolutional layer to generate the high-resolution density map.
The backbone of the proposed encoder-decoder network fuses multi-level features by skip connections, which is closely related to FPN [14] for object detection. Both two models combine high-level semantic features and low-level spatial features using element-wise addition. However, this operation just simply sums features up, which ignores the diverse consistency of features at different stages. The fused features are employed directly by subsequent layers or used for prediction, but lack global multi-scale context. Therefore, the main difference between FPN and our MFFN is that we improve the fusion process by adjusting fused feature channels as mentioned above. For crowd counting, this design uses crowd context to promote the screening of fused features, and experimental results prove its effectiveness in Section 4.5.
In addition to high-resolution density map estimation as the major training objective, a semantic segmentation task is also introduced for auxiliary context feature learning. The crowd image in actual application scenarios inevitably contains various complex backgrounds. Some background parts are similar to the congested crowds leading to incorrect estimation of crowd density, since a single pedestrian in the dense crowd only occupies limited pixels. Semantic segmentation supervision can force the model to focus on the crowd areas and provide crowd location information, thereby promoting crowd density map estimation. Therefore, pixel-wise classification is employed as an auxiliary task in our proposed MFFN. As shown in Figure 2, a convolution layer behind DDCB classifies pixels into two categories: backgrounds and crowds, which makes DDCB features distinguishable and further improves feature fusion. Annotating foreground may consume a lot of labour and resources, we therefore generate the ground truth label map of semantic segmentation by density map binarization. Pixels that are not equal to zero in the density map belong to crowds and others are backgrounds. Although the segmentation label is relatively coarse, accurate classification for each object is not critical. As an auxiliary, it is enough to provide attentive information of crowds for density map estimation. Previous works use density-level classification as an auxiliary task and only employ the rough global density feature, while the segmentation task forces the model to learn discriminative features for the distinction between backgrounds and crowds. Experimental results show that these discriminative features contribute more to density map estimation.

| Dense dilated convolution block
As shown in Figure 1, people in unconstrained scenes exhibit large variations posing great challenges for recognition, and therefore discriminative feature extraction is crucial. To enhance feature learning, we propose a dense DDCB to exploit discriminative context features. DDCB consists of a set of densely connected dilated convolutional layers, which not only increases the size of the receptive field but also covers a wide range of scales. Li et al. [10] first introduced the dilated convolution for crowd counting, because it is able to achieve a larger receptive field without losing the size of the feature map by inserting zeros into the convolution kernel. However, we believe that only a large receptive field is not enough to extract discriminative context features. Motivated by DenseNet [49], we therefore consider connecting dilated convolutional layers in a dense way. Figure 3 depicts the structure of DDCB containing four dilated convolutional layers and a traditional 1 � 1 convolutional layer. The dilated convolution is formulated as follows: where x and y are the input and output of the dilated convolution, respectively. w denotes a filter with the length of L and the width of M. r is the dilation rate. To simplify notations, we use D k,r (x) to represent a dilated convolution with a filter of k � k and a dilation rate of r. Consequently, the densely connected dilated convolution in our DDCB can be written as: where x l means the l th output features, and [x 0 , …, x lÀ 1 ] is the concatenation of the features in 0, …, l À 1 layers. DDCB stacks four dilated convolutional layers with 256 3 � 3 filters and dilation rates of (2, 2, 4, 4). A 1 � 1 convolutional layer can reduce the number of output feature maps. Finally, the output features y of DDCB can be formulated as: The architecture of the dense dilated convolution block (DDCB). d denotes the dilation rate of a convolutional layer 64 -WANG ET AL.

F I G U R E 4
The architecture of the channel attention module (CAM) The proposed DDCB aims to extract multi-scale context features and guide fused feature adjustments for dealing with various variations in crowd counting rather than simply combining dense connections and dilated convolutions. In [50], the authors demonstrate through experiments that extracting multi-scale features at a relatively late stage of the network is more efficient than at early layers. DDCB located behind the encoder densely concatenates high-level features with large respective fields, which can facilitate multi-scale context representation. Different from DenseNet which aims to deepen the network, DDCB increases receptive fields of different sizes by dense connections and covers a wide scale range of objects. Dense connections in DDCB mainly play the role of extracting multi-scale features and coping with scale variations.

| Channel attention module
The crowd features at different stages retain inconsistencies, which influences the fused feature representation. To improve the representation ability of fused features, a CAM is introduced to adjust the weights of fused features at each stage, as illustrated in Figure 4. According to the context features of DDCB, CAM adaptively assigns larger weights to the feature channels that contribute to density estimation and suppresses the useless feature channels.
The output features of DDCB are first fed into a global average pooling layer to aggregate spatial information and produce a channel descriptor embedded the global context. The descriptor z c of the c th channel is generated by shrinking spatial dimensions H � W, which can be calculated by: It is followed by two 1 � 1 convolutional layers to capture channel-wise dependencies. The channel attention is defined as: where δ is the ReLU function, and σ means the sigmoid activation. W 1 and W 2 denote two convolutions in CAM.
The final output f of the module is obtained by zooming the fused features f, as formulated in Equation (6).
A similar technique referred to SENet [51] is proposed to model interdependencies between channels. SENet learns the channel attention of a layer only based on the features of that layer, therefore its context information is limited. Our CAM, however, uses the global context derived from DDCB to guide channel attention learning for promoting more accurate adjustments. The multi-scale context features extracted by DDCB provides more diverse feature representation compared with single layer features.

| Training details
The proposed network is trained in an end-to-end manner with two supervisions. We introduce the training details about the ground truth and loss function in this section.

| Ground truth
For density map estimation, we generate the ground truth following the method in [7]. The geometry-adaptive Gaussian kernel normalized to one is used to blur head annotations, and the ground truth density map can be defined as follows: where x i is the two-dimensional coordinate of the i th head annotation, and it is represented with a delta function δ(x À x i ). N h means the total number of head annotations. The parameter σ i that equals β � d i denotes the standard deviation of a Gaussian kernel, in which d i is the average distance of k nearest neighbours. In our experiments, we set β ¼ 0.3 and k ¼ 5.
The ground truth segmentation label is generated based on the density map due to the lack of precise annotations. The pixels are roughly classified into the background and the crowd. The segmentation label is rough but effective, since we focus on density map estimation and mainly employ segmentation for auxiliary learning. It can be defined as follows, where D[⋅] denotes 8 times down-sampling, and F(x) means the ground truth density map. 1 and 0 represent crowds and backgrounds, respectively. Pixels that are not equal to 0 in the density map correspond to the crowd category of the segmentation label. In contrast, the background of segmentation is the point with a value of 0 in the density map.

| Loss function1
The pixel-wise Euclidean loss is used to measure the difference between the estimated density map and the corresponding ground truth, which is formulated as follows: where N p is the number of pixels, and Θ is the parameters to be optimized. F and F(X; Θ) are the ground truth and estimated density map, respectively. As we know, the Euclidean loss is sensitive to outliers and leads to blurred results. Therefore, another attempt aims at minimizing the SSIM loss to overcome these issues and improve the quality of results. The SSIM loss measures the local pattern consistency of the ground truth and estimated density map in the study by Cao et al. [11], which is defined below: For the semantic segmentation task, we choose the cross entropy loss as the loss function, where t i and y i are the ground truth and estimated values, respectively. The overall loss L is calculated by weighting three loss functions, where α and β are hyperparameters, and we empirically set α ¼ 10 À 3 and β ¼ 10 À 5 .

| EXPERIMENTS
We evaluate the performance of our MFFN on three crowd counting datasets including ShanghaiTech [7], UCF_CC_50 [28] and UCF-QNRF [52]. Data augmentation is conducted to increase the number of training samples as well as avoid overfitting. We crop nine patches for each training image and flip them horizontally so that each image produces 18 training samples. The size of each patch is 1/4 of the original image. The encoder in MFFN is initialized with the parameters of a well-trained VGG-16 on ImageNet. The initial values of other layers follow a Gaussian distribution with a standard deviation of 0.01. The Adam [53] optimizer with a learning rate of 10 À 5 is applied to train the model. All the experiments are based on the Pytorch [54] framework. In this section, we first introduce the evaluation metrics, and then compare the proposed network with previous state-of-the-art methods on three datasets. Finally, ablation studies report the effectiveness of each module in MFFN.

| Evaluation metrics
Following previous works [7,11,55], we choose the mean absolute error (MAE) and the mean squared error (MSE) as evaluation metrics. They are defined as follows: MSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where N is the number of test images, y i and b y i are the ground truth and estimated crowd count of the i th image, respectively. MAE denotes the accuracy of the estimated count, and MSE indicates robustness.

| ShanghaiTech dataset
The ShanghaiTech dataset is first introduced in the study by Zhang et al. [7] and consists of 1198 annotated crowd images with a total of 330,165 people. It is divided into two parts: Part_A and Part_B. Part_A has 482 images collected from the Internet, of which 300 are training images and the remaining 182 are for testing. Part_B includes 716 images taken from the busy streets of Shanghai. The training and test sets of Part_B contain 400 and 316 images, respectively. The scenes of Part_A are highly congested, while the images in Part_B are relatively sparse with large-scale variations. Figure 5 shows examples of ground truth and estimated density maps in the ShanghaiTech dataset. We compare the count error of MFFN on the ShanghaiTech dataset with 11 recent state-of-the-art methods, as shown in Table 1. All of these methods are CNN-based approaches proposed in recent years which have delivered superior results. Our proposed method obtains the lowest MAE and MSE among all the methods on both Part_A and Part_B. Specially, MFFN achieves an improvement of 4.3% for MAE in comparison to the second best solution RAZ-Net on Part_A. It indicates that MFFN is effective in both congested and sparse scenes.
To further evaluate the count errors of our proposed method at different density levels, we follow the setting of this dataset and divide the test images into 10 groups according to the number of people in the image. The first nine groups have 18 and 31 test images in Part_A and Part_B, and there are 20 þ 37 images in the last group. Figure 6 depicts the ground truth and estimated counts at different density levels. On both Part_A and Part_B, the lines of the ground truth and MFFN are very close and almost coincident, which proves that our method can achieve small count errors at every density level.

| UCF_CC_50 dataset
The UCF_CC_50 dataset [28] is extremely challenging, since only 50 images with head annotations are provided. The crowd count of each image in this dataset varies widely, ranging from 94 to 4543. A total of 63,874 annotations are marked with an average of 1280 per image. Figure 7 shows examples in UCF_CC_50 and the visual comparison of the ground truth and estimated density maps.
We perform a fivefold cross-validation to evaluate the performance of the proposed method, following the standard setting in the study by Idrees et al. [28]. As tabulated in Table 2, our method achieves the best MAE and a competitive MSE compared with other recent methods. The MAE of MFFN is 10.1% lower than that of the second best method TEDNet. The results denote that our MFFN not only performs well in relatively sparse scenes but also works in dense scenarios. Moreover, it also indicates that the fusion of spatial details and semantic information is effective for count error reduction, since many objects in UCF_CC_50 are quite dense and only occupy a few pixels.
We also compare the estimated count with the ground truth for each image in UCF_CC_50, as shown in Figure 8. Most points in the estimated line are close to their actual counts, while several images with extremely large crowd counts present obvious estimation errors. We think this is due to the fact that the proportion of extremely dense images in the dataset is too small.

| UCF-QNRF dataset
To the best of our knowledge, UCF-QNRF [52] is currently the largest crowd counting dataset in terms of the number of people annotated. This dataset consists of 1535 crowd images of different resolutions with 1,251,642 people. There are 1201 and 334 images in the training set and test set, respectively. The number of people in an image ranges from 49 to 12,865, whereas the median and mean counts are 425 and 815. Figure 9 shows the density maps estimated by MFFN and their ground truth maps in the UCF-QNRF testset. Although the scenes and scales in the dataset vary greatly, our MFFN can produce results that approximate the ground truth density maps.

| CONCLUSION
Herein, we propose a novel encoder-decoder network named MFFN to incorporate spatial details and semantic information for generating high-resolution crowd density maps. The first ten layers of VGG-16 are used as the encoder, and the decoder containing a set of convolutional layers and upsampling operations generates the final density map. The low-level spatial features and high-level semantic features are combined by skip connections between the encoder and the decoder. In addition, we improve the fusion process to deal with the diverse consistency at different stages. A dense DDCB extracts context features to guide channel-wise fused feature adjustment by a CAM. The overall architecture is trained by multi-task learning, foreground segmentation as an auxiliary task promotes crowd density map estimation. Extensive experiments on three crowd counting datasets are conducted to compare with state-of-theart methods, and our proposed method outperforms all recent approaches.