SSRNet: A Deep Learning Network via Spatial-Based Super-resolution Reconstruction for Cell Counting and Segmentation

Cell counting and segmentation are critical tasks in biology and medicine. The traditional methods for cell counting are labor‐intensive, time‐consuming, and prone to human errors. Recently, deep learning‐based cell counting methods have become a trend, including point‐based counting methods, such as cell detection and cell density prediction, and non‐point‐based counting, such as cell number regression prediction. However, the point‐based counting method heavily relies on well‐annotated datasets, which are scarce and difficult to obtain. On the other hand, nonpoint‐based counting is less interpretable. The task of cell counting by dividing it into two subtasks is approached: cell number prediction and cell distribution prediction. To accomplish this, a deep learning network for spatial‐based super‐resolution reconstruction (SSRNet) is proposed that predicts the cell count and segments the cell distribution contour. To effectively train the model, an optimized multitask loss function (OM loss) is proposed that coordinates the training of multiple tasks. In SSRNet, a spatial‐based super‐resolution fast upsampling module (SSR‐upsampling) is proposed for feature map enhancement and one‐step upsampling, which can enlarge the deep feature map by 32 times without blurring and achieves fine‐grained detail and fast processing. SSRNet uses an optimized encoder network. Compared with the classic U‐Net, SSRNet's running memory read and write consumption is only 1/10 of that of U‐Net, and the total number of multiply and add calculations is 1/20 of that of U‐Net. Compared with the traditional sampling method, SSR‐upsampling can complete the upsampling of the entire decoder stage at one time, reducing the complexity of the network and achieving better performance. Experiments demonstrate that the method achieves state‐of‐the‐art performance in cell counting and segmentation tasks. The method achieves nonpoint‐based counting, eliminating the need for exact position annotation of each cell in the image during training. As a result, it has demonstrated excellent performance on cell counting and segmentation tasks. The code is public on GitHub (https://github.com/Roin626/SSRnet).


Introduction
Cell counting is an important process in medical testing as it provides valuable information about the number of cells in a sample, which can help determine the presence or absence of disease. [1]Cell counting also plays a key role in various medical research studies, where accurate and precise measurement of cell numbers and distribution is required.This information helps better understand cell behavior, growth, and response to different treatments, which is crucial in developing new treatments and medications.Furthermore, cell counting also helps in quality control for various medical products and in monitoring the health of patients undergoing treatments. [2]Therefore, cell counting is a fundamental aspect of medical science and plays a crucial role in advancing our understanding of human health and diseases.
The traditional method of cell counting, called the plate-based counting method, is manually counting cells under a microscope by medical workers.The aim is to determine the density of cells in the sample.However, this method has several drawbacks.First, it is a time-consuming and repetitive manual task.Second, manual counting can lead to subjective errors and a lack of accuracy due to the repetitive nature of the work.Third, the manual counting method does not estimate the error range.Finally, while there are devices that can count cells quickly, they can be expensive and may not have the capability to capture images of the counted cells.
[5][6] In contrast to the traditional "physics-based or manual-based" method, DCNNs are a type of data-driven artificial neural network that is particularly well-suited for processing image data and has proven to be highly effective in recognizing patterns and features in images. [7]This has led to increased interest in using DCNNs for cell segmentation and counting in medical imaging, as they can automate and improve the accuracy of these tasks compared to traditional manual counting methods. [5,8,9]arlier methods for cell counting were mainly based on target detection or creating a density map.These methods required the precise coordinate information of each cell in the image as a ground truth for training the model.Based on these marker points, the neural network learnt the features of the cells and their surroundings.As a result, these methods are point-based and require detailed cell coordinate annotations, which can be time-consuming and labor-intensive.The transformation of these annotations into a density map can also lead to errors, affecting the accuracy of the results.Currently, most cell datasets only provide count annotations and segmented ground truth without specific cell location markers.So if we can implement counting without points labeling, these datasets that are difficult to use by traditional point-based counting methods can be used well.
If a model can be developed that can count cells in an image and detect their distribution range, then such datasets can be used without the need for detailed cell location markers, reducing the time and effort required for cell counting and increasing its accuracy.This method can also potentially improve the scalability of cell counting and make it more accessible to a wider range of researchers and medical professionals.Additionally, using deep learning models for cell counting can help overcome some of the limitations of traditional manual counting methods, such as subjectivity and a lack of quantification of the error range.By automating the cell counting process and providing more accurate results, this method can lead to better and more reliable medical diagnoses and treatments.
[12] These models show good performance.Therefore, we propose a deep learning network for spatial-based super-resolution reconstruction (SSRNet) to predict both the number and distribution of cells simultaneously.We call this method of combining direct counting with cell distribution prediction "nonpoint-based counting method".
The model is based on an encoder-decoder architecture and performs two tasks: cell counting and generation of the cell distribution.In the encoder part, we have greatly simplified visual geometry group (VGG)16 so that our model can quickly extract the cell features in the image.At the same time, we propose a spatial-based super-resolution fast upsampling module (SSRupsampling).This module can reduce the loss or distortion of image information caused by the fast upsampling process by performing a one-step fast upsampling on the feature map and enriching the spatial information of the feature.In order to process two tasks at the same time in our model, we propose an optimal multitask loss function.In this way, we use one module to complete the encoder process, greatly reducing the complexity of the model.
In addition, to better train the multitask network, we propose an optimized multitask loss function to coordinate the training of counting tasks and area division tasks.In short, our contributions are as follows.1) We propose an SSRNet to predict both the number and distribution of cells.

Background
At present, the mainstream cell counting methods can be divided into the detection-based counting method and the density map-based counting method. [5,8,13]Some early studies mark each individual cell in an image with a bounding box and count the number of boxes to determine the total cell count. [14,15]These methods are called detection-based counting methods, which are known for the highest accuracy.However, it can be timeconsuming and requires a high level of expertise to create the ground truth for each image.In addition, it may not be feasible to use this method on certain types of images or with certain types of cells.
On the other hand, in the density map-based counting method, a Gaussian filter is used for density map creation based on the cell's coordinate. [16,17]This method uses the brightness of the density map to distinguish the position of cells and uses the total pixel value of the density map as the counting result. [16]This method can be a good balance between counting and location, but it also requires detailed object coordinates, similar to the detection-based counting method. [18]dditionally, its counting accuracy is dependent on the parameters of the Gaussian filter used and the choice of filter size. [19]he density map-based counting method has some advantages over the detection-based method.The density map-based counting method is that it can handle images with varying densities of cells, and it can also be more robust to noise, as the Gaussian filter can help to smooth out the image.In addition, it does not require the edge annotation of the individual cells, which may save time and reduce errors. [20]However, it still requires the annotation of the approximate cell locations, which may still require a certain level of expertise.Additionally, this method may not be as accurate as the detection-based method, especially for crowded or overlapping cells. [21]Overall, the density map-based counting method can be a good alternative to the detection-based method when the objects are incomplete or blocked.But it also has its own limitations.
Both detection-based and density-based cell counting methods require a dataset accurately labeled with each individual cell.Accurate labeling of cell coordinates is a time-consuming and labor-intensive task.This might be the reason why there are few datasets with detailed coordinate point annotation.Most datasets only provide the counts of the cell, the cell classification, and segmentation information.
Direct regression counting methods offer an alternative to point-based counting methods, such as detection-based and density-map-based counting, as they don't rely on datasets with precise object coordinates.In direct regression counting methods, foreground and edge feature extraction is based on the relationship between object features and numbers, like background subtraction [22] and Blob detection. [23,24]However, the limitation of direct counting methods is that they are sensitive to the variations of lighting, background, and object appearance, and they lack the position information of the objects, which may be needed in some applications. [25]he current methods for cell counting, such as detectionbased and density-map-based counting, have their own limitations.We hope to have a method that can not only accurately count cells and display their distribution but also is easy to design and train.We have found a new method called nonpoint-based counting, which combines the simplicity and efficiency of basic regression methods with the ability of density map regression to retain spatial distribution information. [26,27]This method utilizes the recognition of individual characteristics of the target cells, whether counting or locating them.As a result, the counting task and the positioning task require the same type of feature information, and we only need to make different task predictions for highly abstract features to predict the number and distribution of the target cells at the same time.This method only requires a simple and efficient network to extract effective information and train two different predictors.

SSRNet
Based on the idea of nonpoint-based counting, [26] we need to predict the number and distribution of cells in a picture.This requires us to build a multitask generator.At the same time, we also want to use a generator that is as lightweight as possible.
28] VGG networks are known for their strong feature extraction capabilities. [29]They use many convolutional layers with small filters, which allow them to learn numerous features from the input image.The depth of the VGG networks also contributes to their ability to extract rich feature representations.[32][33] In medical image processing, FCNs are often used for medical image segmentation, as seen in studies by Schmitz, et al. [33] and Vigueras-Guillén, et al., [34] who found that an FCN-based network is better for cell segmentation than a traditional detection network.
The previous research mainly used the encoder-decoder architecture models.This kind of model gradually reduces the size of input data through multiple down-sampling units in the encoder stage, thus abstracting the required image features step by step.This architecture allows for the preservation of spatial information while extracting high-level features, making it useful for tasks that require a combination of detailed information and semantic understanding of the input image.However, this process may cause some details to be lost.To solve this problem, ResNets researchers introduced residual layers to retain advanced features. [35]However, in ResNets, there is no direct connection between the two adjacent bottleneck layers.Therefore, features may still be lost between the bottleneck layers.
Based on the encoder-decoder architecture and FCNs, U-Net has demonstrated outstanding performance in medical imaging tasks, particularly segmentation. [28,36]The U-Net architecture, first proposed for biomedical image segmentation, uses skip connections to transfer features learnt by the encoder to the decoder, allowing the network to make use of both high-level and low-level features. [28]his improves the network's ability to preserve detailed information and spatial resolution in the segmentation output.In addition, U-Net uses multiple parallel convolutional layers of different scales with different scales to capture features at different levels of abstraction, which is effective for image segmentation tasks. [28]39] Therefore, we propose a deep learning network for SSRNet for multitask prediction.We constructed an encoder-decoder FCN and incorporated a Unet-based feature fusion structure at the beginning and end of the network to improve the quality of the generated images.To enable the model to generate both the number and distribution of cells through nonpoint counting, we devised two distinct predictors within the model to predict the cells' number and distribution, respectively.
Through the investigation of existing classical models, we noticed that the VGG is a good basic network model which is widely used in deep learning. [29]It can simplify the network while maintaining good feature extraction ability.Therefore, we may take the VGG network as the model of the feature extractor, take advantage of its excellent feature extraction ability, and make full use of the redundancy in the VGG network.As shown in Figure 1, we propose a slimming VGG as the encoder of SSRNet, which only uses five down-sampling units and is one-third of the size of VGG16.
In contrast to Unet, which conducts feature fusion at each stage, our network performs feature fusion only twice, at the bottom and top layers.This reduction in the frequency of feature fusion simplifies the model and lowers computational complexity.To ensure high-quality image generation, we propose a SSRNet-upsampling as the decoder of SSRNet.After the fast down-sampling through the slimming VGG in the encoder stage, we input the extracted features from the bottom layer into the spatial-based super-resolution fast upsampling module to generate a high-resolution feature map.We have leveraged the SSRNet-upsampling in our proposed model to not only reduce the model's parameter count but also to enhance the model's computational efficiency, thereby bolstering its predicted image quality, as well as its pixel-level representation ability and accuracy.
We used this module to perform a 32-fold upsampling on the deepest feature map at one time and then conducted a round of feature fusion with the uppermost down-sampling output feature map.In this way, our feature map is integrated into the upper-level feature map with the richest information, and the upsampling process is completed quickly.Compared with the traditional encoder-detector model, this fast upsampling method can greatly reduce the number of upsampling rounds and feature fusion times.

Spatial-Based Super-Resolution Fast Upsampling Module
In the decoder stage of an image generation network, up-sampling the feature map is a crucial step.This is typically achieved through traditional image interpolation up-sampling methods that expand the image by operating on adjacent pixels.Interpolation upsampling methods are used to increase the resolution of an image by introducing new pixels between existing ones.It can estimate the value of the new pixel by the pixels around the new pixel.There are several interpolation methods that can be used for image upsampling, each with its own advantages and disadvantages.Nearest-neighbor interpolation method is considered the most basic upsampling method.This method assigns the value of the nearest known pixel to the new pixels.It is simple to implement but can produce jagged edges and stairlike patterns in the up-sampled image.Another popular interpolation method is the bilinear interpolation.This method uses a weighted average of the surrounding pixels to estimate the value of the new pixels.Bilinear interpolation produces smoother results than nearest-neighbor interpolation but can still produce blurriness in the up-sampled image.
To avoid errors in interpolation calculation, researchers sometimes use the noninterpolation method.[43] This causes the image to appear as an alternating pattern of brighter and darker squares, similar to a chessboard.When the image is enlarged on a large scale, the image quality will be seriously affected.
Because both interpolation and noninterpolation methods have their own advantages and disadvantages, many researchers have proposed different methods to improve the performance of image upsampling.[44] This kind of method can use convolutional neural networks to enhance the feature map during the upsampling process and further improve the performance of the upsampling. [45]he traditional upsampling methods usually use the convolutional neural network to up-sample the feature map.These methods are usually based on the spatial relationship mapping of the feature map constructed by Markov random field (MRF) or conditional random field (CRF). [46]In convolutional neural networks, random fields are implemented by calculating images by convoluting channels with large cores and standardizing operations, such as activation layers.For example, in dense CRFs, the size of the core usually covers the entire image.In this process, the CNN output is usually considered a univariate potential.The final output can be obtained by repeating iterations.
However, MRF/CRF methods are often computationally redundant, and various studies are attempting to construct spatial relationships using other methods.For example, Zhang, et al. [47] used spatial pyramid pooling (SPP) to improve the detection accuracy of spatial-domain steganography, and Pan, et al. [45] used spatial convocational neural network (SCNN) to capture the pixel spatial relationship of image rows and columns, which strengthens the prediction ability of neural network for semantic objects with strong shape priori but weak appearance (image pixels) coherence.These methods demonstrated promising results in terms of both image quality and computational efficiency.
To reduce the number of network parameters and improve the network operation speed, we propose a very efficient upsampling method: SSR-upsampling.The SSR-upsampling module can enlarge the characteristic map by 32 times or more at one time and hardly introduce the information of interference error.
As shown in Figure 2 and Algorithm 1, in the SSR-upsampling module, we first propose a spatial fine-grained processing on the bottom feature map to improve the quality of the feature map and information redundancy.It should be noted that the term "spatial" used here is not referring to "spatial convolution".Rather, it refers to propagating spatial information through a specifically designed CNN structure.Since the convolutional kernel weights are shared across all slices, the "spatial fine-grained processing" can be considered as a type of recurrent neural network.
In the "spatial fine-grained processing," the input lowermost feature map will first be cut into m small slices on the width, and each small slice will first pass through a convolutional layer for spatial information learning.The features processed by a convolutional layer will compose of a new feature graph.Then these feature images with x-axis spatial information will be further cut into multiple small slices on the width, and a spatial convolutional calculation will be performed again so that the spatial information between the previous slices can be correlated.This process enables information to be transmitted sequentially along the order of pixel arrangement and forms a sequential information transmission logic.The feature map enhanced by spatial information is rich in more spatial information, which makes it possible to form an association between pixels in the process of fast upsampling so as not to lose too much information.The final feature map will receive the fusion of multiple features from different spatial scales, with stronger semantic expression ability and spatial information transmission ability, which is conducive to output more accurate and accurate answers in image processing tasks.
To be more precise, let's suppose we have a 3D kernel tensor W where each element W i,j,k represents the weight between an element in channel k of the previous slice and an element in channel j of the current slice, with an i-column offset between the two elements.If we divide a 3D tensor X into m slices on the horizontal axis and n slices on the vertical axis.As shown in Figure 2, for X i,j,k , where i, j, and k indicate the feature map's horizontal slice, vertical slices, and channel, respectively, the convolutional kernel W will successively operate from the leftmost slice to the rightmost slice, then from the rightmost slice to the leftmost slice, with a total of 2m rounds of operation.Similarly, in the vertical direction, the convolutional kernel W will operate from the top slice to the bottom slice on the horizontal axis and then from the bottom slice to the top slice.The forward calculation logic for a single calculation process is as follows X 00 00 ¼ where f represents a nonlinear activation function, which is ReLU function in our method as default.X 0 , X 00 , X 000 , and X 000 are the iterative results of the feature map in four information transfer directions.In general, spatial fine-grained processing needs four rounds of sequential convolutional calculation in four directions: top, bottom, left, and from.The mathematical expression is as follows where H and V indicate the horizontal and vertical axes, respectively;' indicates the calculation opposite to the original calculation direction; W represents the weight parameter of the convolutional layer.
For more visual expression, as shown in Figure 3, in dense MRF/CRF, each pixel receives messages from all other pixels directly, resulting in much redundancy.In contrast, our method uses a sequential propagation scheme where each pixel only receives information from a few neighboring pixels, resulting in efficient sequential information transfer.For example, if an image has H rows and W columns, there would be W Â H information transferred in dense MRF/CRF between two pixels.Therefore, for N iterations, the total number of MRF/CRF calculations is NW 2 H 2 .In the process of spatial fine-grained processing, each pixel only exchanges information with the neighboring k pixels.Due to the existence of the sequence, the iteration number of this method is set as 4. Therefore, its total calculation time is 4 kWH.
As shown in Figure 3, in the same case where each pixel is iterated three times, our method is much simpler than the traditional MRF/CRF method, which means the total number of calculations in our method is very little.This makes our spatial fine-grained processing more efficient for larger images while still allowing each pixel to receive information from all other pixels via information propagation along a few directions.This sequential information transmission can effectively support the identification of continuous structures. [45]n the fast upsampling stage, we used an image amplification method based on depth expansion.This method reduces the parameter requirements of the convolutional layer used to predict new pixels by adding pixel information on the dimension of the graphics channel and increases the integrity and compactness of pixel prediction.That is the new predicted pixel and the original pixel form a back-to-back relationship on the channel.
For upsampling a feature map d times that from x ∈ R w,h,c to a new feature map û ∈ R ŵ, ĥ,ĉ , the channels c of x will be expanded to ĉ Â d at first.And then, after expanding on one side, the feature map will expand the channel again from ĉ to ĉ Â d.It is worth noting here that we need to ensure that the two characteristic graphs before and after zooming in are mathematically connected as follows where w, h, c are the width, height, and channels of the feature map.
The expansion of feature map x on channel number through the convolutional layer.The expanded feature map x will be 1D flattened into a vector v and then recombined in the horizontal direction as a matrix v ∈ R w,hd,ĉ .In this way, the new pixels are added to the width of the feature map.Similarly, the matrix v is also expanded on the channel through a convolutional layer to v ∈ R w,hd,ĉd .Then the new matrix v is stretched into vector u and then rebuild to the new feature map u ∈ R wd,hd,ĉ in the vertical direction.
In formal language, that is In addition, the super-resolution fast upsampling module can also cooperate with the deep learning model to extract features from large-scale data and achieve high model performance effectively.By effectively learning the internal features of the image, the accuracy of semantic segmentation and object detection can be effectively improved, thus saving additional computing time and memory consumption.In this study, we can use the VGG feature extraction network and super-resolution fast upsampling module to further optimize the multitask generator to improve its training speed, reduce the complexity of the model, and maximize the performance of the model.
The above super-resolution fast upsampling module can make the model more accurately extract and save the most important features to better predict the target on the image, effectively compress the image data, and effectively improve the performance of the image processing task.Additionally, the super-resolution fast upsampling module can also be used for semantic segmentation or object detection tasks to achieve fast feature extraction and provide effective information input for more accurate semantic segmentation and object detection.function Spatial fine-grained processing (x d ): 2.
for i in horizontal slice m:

3.
Extracting the features of x d from left to right of the pixels on the height side of the feature map through a convolutional layer according to Equation (1)

4.
Put the slides back into the feature map in order as X 0 5. end for

6.
for i in horizontal slice m: 7.
Extracting the features of X 0 from right to left of the pixels on the height side of the feature map through a convolutional layer according to Equation (2); 8. put the slides back into the feature map in order as X 00 ; 9. end for 10.
for i in vertical slices n: 11.
Extracting the features of X 00 on the height side of the feature map from top to bottom through a convolutional layer according to Equation (3);

12.
Put the slides back into the feature map in order as X 000 ; 13. end for

14.
For i in vertical slices n: 15.
Extracting the features of X 000 from bottom to top of the pixels on the height side of the feature map through a convolutional layer according to Equation ( 4 Expanding the channels of the feature map x s from c to ĉ Â d through a convolutional layer;

24.
Vector v recombined as a matrix v ∈ R w,hd,ĉ in the horizontal direction;

25.
The channels of the feature map v is expanded from ĉ to ĉ Â d through a convolutional layer;

27.
Vector u recombined as a matrix u ∈ R wd,hd,ĉ in the vertical direction; 28.
û will be fused with x t as the final feature map x f ; 29. return x f ;

Optimized Multitask Loss Function
To handle both tasks simultaneously in our model, we aim to construct a multitask loss function that is universally applicable.L1 loss and L2 loss are frequently utilized loss functions in machine learning, particularly in the area of computer vision and deep learning.
L1 loss, also referred to as mean absolute error (MAE), calculates the average absolute difference between the ground truth and the predicted values.It is determined by summing the absolute differences between the ground truth and predicted values and then dividing the result by the number of samples.L1 loss is highly sensitive to outliers, meaning that even a single incorrect prediction can greatly affect the overall loss.In practice, it is often utilized when the target outputs have an arbitrary scale, and equal weight is desired for both large and small errors. [48]n addition, L2 loss, also called mean squared error (MSE), is calculated by summing the squared differences between the ground truth and the predicted number and then dividing the result by the number of samples.Unlike L1 loss, L2 loss gives more weight to the overall magnitude of the error and is less sensitive to outliers.
The L1 loss correctly captures the low frequencies, while the L2 loss measures the square difference between the two variables.Therefore, when building a multitask loss function, L1 loss and L2 loss can meet different error requirements of different tasks to better improve the learning efficiency of the model. [49]e use our optimized multitask loss function (OM loss) to coordinate the training of two tasks.Our OM loss function consists of two components, the counting loss and the pixel loss, which work together to improve the performance of the model.
For the counting loss, L2 loss is used for measuring the count accuracy.For N images, if the counting result of i-th image is c pred , while the ground truth is c gt .Thus the counting loss is For pixel loss, L1 loss is used to improve the structural similarity of generated images.The L1 loss measures the absolute pixel loss between the generated and ground truth and is sensitive to the difference in the structures of the images.Therefore, for the generated image I pred i and its ground truth I gt i , the pixel loss is calculated by The use of both L2 loss and L1 loss allows for a comprehensive evaluation of the model's performance, ensuring both the accuracy of counting and the structural similarity of the generated images.Therefore, the total training loss can be calculated as follows.
As the process shown in Algorithm 2, during training, we utilize a matrix of various loss function weights and automatically select a loss function for each training batch.The best results achieved during training are recorded for each batch, and the average of these results is used as the threshold for the next batch.If the verification results are not satisfactory after several epochs, training for that batch is terminated, and the next batch is trained.This approach saves time during model training.

Evaluation Methods
Our nonpoint-based counting method covers two tasks: predicting the area where cells are distributed and counts of cells.To predict the cell distribution task, we evaluated the performance using the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) between the generated segmentation map and the ground truth map.SSIM is capable of evaluating the level of structural distortion present between two images, providing a measurement in the form of a value between 0% and 100%.A higher SSIM value indicates superior image quality.
PSNR is a metric that measures the quality of an image by comparing it to the original image.It represents the ratio of the maximum possible power of a signal to the power of the noise that degrades the signal's accuracy.The PSNR value is expressed in decibel units and is commonly used to evaluate the quality of image reconstructions.A higher PSNR indicates less distortion in the image.
SSIM can measure the structural distortion between the two images.The SSIM value ranges from 0 to 100%, with higher values indicating better image quality.
where, μ x and μ y are the mean pixel value of input image x and original image y respectively; σ 2 x and σ 2 y are the variance of Â and y respectively; σ xy is the cross-correlation of Â and y; c 1 and c 2 are constants respectively to avoid system errors caused by the denominator being 0 and which can be calculated as , where normal constant K 1 defaults to 0.01, K 2 defaults to 0.03, and dynamic range L defaults to 255 when the input is an 8-bit image.
In addition, the MAE and the root-mean-square error (RMSE) are used for the cell counting task.MAE is a metric that quantifies the average number of false cell identifications per image.MAE measures the accuracy of cell counting across N images.The root mean squared error (RMSE) reflects the offset of the count error, which can indicate the error deviation of the prediction.Their calculation formulas are where C gt i and C pred i represent the ground truth number of cells and the predicted cell number of the i-th image, respectively.Both MCE and RMSE values should be as small as possible, as a smaller value indicates better performance.In addition, we also use MAE and mean square error (MSE) to measure the pixel value difference between the generated image and the ground truth as a supplement of PSNR and SSIM.To distinguish them from the indicators of counting tasks, we named them IMG_MAE and IMG_MSE.
Since PSNR and SSIM mainly reflect the visual similarity between the generated image and the original image.When images have similar visual effects, there may still be large differences in pixel values.When the model is used in a field that requires high pixel value accuracy, such as density map-based crowd counting, the deviation in pixel values may seriously affect the accuracy of the model output results.So we additionally used IMG_MAE and IMG_MSE, which are calculated based on image pixel values, in the experiments used to verify the image generation performance of the model.For the test set containing N images, IMG_MAE and IMG_MSE are calculated based on the prediction results I pred i of each image and its ground truth I gt i .The calculation process is as follows IMG_MSE can better reflect the homogeneity of model generation ability.That is if there are abnormal points, the value of IMG_MSE will increase significantly.
Additionally, we measured the complexity of our model using floating-point operations (FLOPs).FLOPs can be calculated as follows.
where H and W represent the height and width of the image, respectively; C in represents the channel of the input feature map; C out represents the channel of the output feature map; K represents the kernel size of the convolutional layer.

Datasets
We use the BBBC004 dataset and BBBC05 dataset to evaluate our method (Figure 4).These datasets are from the Broad Bioimage Benchmark Collection (BBBC). [50]We have provided a public link to the datasets.The BBBC04 dataset (https:// bbbc.broadinstitute.org/BBBC004) is divided into five groups according to the degree of cell overlap, each with 20 images.In it, each image includes 300 cells.The images were created by the SIMCEP platform. [51,52]The dataset contains images of size 950 Â 950.To enable the image to be used in the neural network, a border of 0-pixel values was added to the image, and it was resized to 960 Â 960.According to BBBC's official website, our team is the first to use this data for training a DCNN for segmentation and counting tasks.We also used the BBBC05 dataset (https://bbbc.broadinstitute.org/BBBC005),which comprises 9600 images.The image shows stained human U2OS cells under the microscope, which has a clustering probability of 25%.The dataset provides images with different degrees of Gaussian blur, of which 1200 images are labeled.
In our experiments, the images in the dataset were not subjected to any additional image enhancement processing (e.g., enhancing brightness and increasing contrast).We normalized the images in the training subset at first to enhance the robustness of the model.These images were cropped into four pieces with a size of 128 Â 128.Because the image size of BBBC05 is 690 Â 520, we used zero padding on the left and bottom of the image.To make it a multiple of 128 pixels in length and width for easy calculation.As there was no pre-existing training-testing segmentation for the dataset, we evaluated the counting model using a fivefold cross-validation.
Our experiment is based on a Python environment.A Quadro P4000 graphics card is used for the experiments.Due to the limitation of training equipment, we used the minimum batch size, which is 1, to prevent out-of-memory.We used Adam as the optimizer.The learning rate is set to 0.001 and decreases by half every 10 epochs.We trained 50 epochs for each model.

Upsampling Method
To confirm the down-sampling network structure of the model, we first compared the effect of applying SSR-upsampling to different down-sampling models.As shown in Table 1, VGG16 as a down-sampling model has better results than other traditional models, so it was chosen as the baseline of our study, further reducing the model parameters and achieving better experimental results.In the experiment, we adopted the 16 times up-sampling scheme for the SSR-upsampling module.
As shown in Figure 5 and Table 1, from ResNet50 to VGG16 to our optimization model, the reduction of model size improves the running speed and the accuracy of the counting task.The performance of the residual layer is not very good in our experiment.Neither ResNet50 nor Mobilecount, [53] a simplified model based on ResNet, performs well in counting tasks.Even the Mobilecount model may be too simplified to generate an excellent cell distribution map.
We compared the SSR-upsampling with other sampling methods.In the whole experiment, we used the slimming VGG model as the down-sampling model.In Table 2, the results show that our SSR-upsampling module excellently balanced the training of counting tasks and cell distribution prediction and achieved good results in both tasks.
The SSR-upsampling method and other classic upsampling methods are very similar in the results of PSNR and SSIM., which means the SSR-upsampling method provides fast upsampling capability while achieving almost the same image restoration as the classical upsampling method.
In addition, due to the results of IMG_MAE and IMG_MSE, our method generates images and also achieves excellent results in pixel value deviation comparison.At the same time, due to the neural network prediction of the inserted pixel values rather than simply mathematical calculations, the SSR-upsampling method gets better performance in the results of MAE and RMSE, which means that the fidelity of the enlarged image is better for the predictor.
Nearest upsampling can get a relatively good image segmentation result, but the pixel error introduced by this method leads to the worst performance in the counting result.The bilinear upsampling method can improve results in counting results than the nearest upsampling method and has similar image ResNet50 [35] 15.5715.87 52.021 0.937 Mobilecount [53] 2.91 3.22 65.45 0.536 VGG16 [29] 2.57 means that the larger the indicator, the better; b) means the smaller the indicator, the better; Red and blue represent the first and second best results, respectively.generation accuracy.However, these two methods lack learning ability, which may be the reason why the performance of these two classical upsampling methods is not as good as our SSR-upsampling method.Dilated convolution has balanced the training of the two tasks.However, because of its chessboard effect, the generated image is not good enough (Figure 6).
We evaluated the performance of SSR-upsampling with different upsampling scales.If the upsampling scale is twice, the model needs to use four continuous SSR-upsampling models to make the image 16 times larger.If the upsampling scale is four times, the model needs to use two consecutive SSRupsampling models to enlarge the image.
From Table 3, the results show that using multiple SSRupsampling models will improve the accuracy of count prediction because more modules are used.However, it does not bring about a qualitative improvement in the generation accuracy and will cause a serious decline in the image processing speed because of the increase in the number of spatial fine-grained processing.The structure of the model with 32 times magnification is nearly identical to that of the model with 16 times magnification.However, unlike the latter, the former does not require a 2 times nearest upsampling of the feature map before feature fusion; instead, it directly fuses the feature map with the topmost feature map.While this method has a similar imaging effect to that of 16 times magnification, it reduces the number of introduced error pixels and significantly enhances counting accuracy.

Loss Function
We compared the performance of OM loss with other loss functions on BBBC05 and BBBC04 datasets.As shown in Table 4 and Figure 7, the OM loss function has better performance in the training of balancing cell number prediction and cell distribution prediction.Reversed OM loss function refers to changing the position of L1 loss and L2 loss in the OM loss function, that is, using the L2 loss function to train the areas predictor and using the L1 loss function to train the counting predictor.Reversed OM loss function can also coordinate the training of two tasks to a certain extent and achieve good results, but it is still worse than the OM loss function.L2 loss is effective for cell counting, and L1 loss is effective for cell distribution prediction.Means that the larger the indicator, the better; b) Means the smaller the indicator, the better; Red and blue represent the first and second best results, respectively; the magnification of all the upsampling methods here is 16 times.
To achieve the best results for both cell counting and cell distribution, the proposed model utilizes a combination of L1 loss and L2 loss during training.By combining the strengths of both losses, the OM loss balances the learning of the two tasks and leads to improved accuracy for both cell counting and cell distribution predictions.In addition, the rise of overlapping rates hardly interferes with our SSRNet.
In the proposed OM loss function, the emphasis on cell counting and cell distribution prediction can be adjusted by adjusting the ratios of pixel loss weight and counting loss weight.In Table 5, the results show that as the proportion of L1 loss weight to L2 loss weight (a=b) increases, the image quality improves, but this improvement is not always guaranteed.The same trend can be observed when the proportion of L2 loss weight b increases, initially increasing and then declining.The default OM loss function has been selected based on the best overall results, with a weight of 10 for L1 loss and 1 for L2 loss.
However, this is not the best parameter, and the weights can be adjusted based on the difficulty of the counting or distribution prediction tasks.For instance, if the counting task is more challenging, the value of b can be moderately increased, and if the cell distribution prediction is more complex, the value of a can be moderately increased.

Comparing SSRNet with Other Studies
As shown in Table 6, our method has the best performance in splitting and counting tasks.Compared with traditional methods such as VGG16 or U-Net cannot even be directly used for multitask work without using transfer learning.Our method can predict both cell count and cell distribution prediction and still surpasses these traditional methods in a single task.In addition, compared with the MFGAN model using GAN for supervised training of the counting predictor first and then training the segmentation task, our method can obtain better prediction performance using the same final feature map to predict two tasks.
From Table 7, our method is faster and takes less space than traditional models.The SSR-upsampling module here uses a magnification of 32 times.FLOPs are floating-point operations completed by the network.MAdd shows the total number of  means that the larger the indicator, the better; b) means the smaller the indicator, the better; Red represents the best result.Means that the larger the indicator, the better; b) Means the smaller the indicator, the better; Red and blue represent the first and second best results, respectively.multiply and adds operations completed during the processing.Memory is the memory required for node reasoning.Params is the parameter quantity of the network.MemR þ W refers to the memory size that will be used by reading and writing when the network is running, that is, the size of the input plus the size of the network parameters.Speed is the average processing speed of each image after running 100 times on a 256 Â 256 Â 3 image.Our approach has been experimentally validated to achieve outstanding results in reducing the weight of the network, thereby improving the processing efficiency of the model.Compared with the classic U-Net, our SSRNet's running memory read and write consumption is only 1/10 of that of U-Net, and the total number of multiply and add calculations is 1/20 of that of U-Net.

Conclusion
We proposed a new deep learning network called SSRNet for cell counting and cell distribution prediction.SSRNet is designed with an encoder-decoder architecture and has the ability to perform both cell counting and cell distribution range generation tasks.The encoder part of the model has a simplified VGG16 architecture that enables efficient cell feature extraction from images.
Additionally, we proposed SSR-upsampling to minimize the loss of image information during the fast upsampling process by performing a one-step fast upsampling of the feature map and increasing the spatial information in the feature.We proposed an optimized multitask loss function to enable the simultaneous processing of both tasks, reducing the complexity of the model.Experiments show that our method has good counting accuracy and can predict the number and distribution of cells.In the future, we hope to obtain image datasets with more complex morphology (such as neural cells) for further research on the segmentation of complex morphology cells using convolutional neural networks.
Cell counting is a challenging task in image analysis, and developing accurate and efficient methods is essential for various biomedical applications.By demonstrating the effectiveness of nonpoint-based counting, researchers have opened up new avenues for addressing a wide range of object-counting problems.Means that the larger the indicator, the better; b) Means the smaller the indicator, the better; Red and blue represent the first and second best results; respectively.MFGAN [27] 1.8 2.17 56.5 0.077 VGG þ SFD [27] 2.16 2.34 66.03 0.82 LFMG [27] 1.39 1.76 60.41 0.546 cGAN [41] b) b) 56.6 0.194 GAN [54] b) b) 56.3 0.034 FPNet [3] 2.4 U-Net [28] b) b) 61 0.874 VGG16 [29] 2 MFGAN [27] 0.96 1.17 65.484 0.596 VGG þ SFD [27] 0.85 1.09 63.138 0.937 LFMG [27] 2.11 2.68 66.888 0.832 cGAN [41] b) b) 65.53 0.181 GAN [54] b) b) 65.08 0.046 U-Net [28] b) b) 62.312 0.951 VGG16 [29] 2. Means that the test was not conducted in its original paper; b) Means that this task cannot be completed without modification; c) Means that the larger the indicator, the better; d) Means the smaller the indicator, the better; Red and blue represent the first and second best results, respectively.U-Net [28] 65 Means that the larger the indicator, the better; b) Means the smaller the indicator, the better; Red and blue represent the first and second best results.
Nonpoint-based counting methods have the potential to be further developed with semisupervised or self-supervised learning approaches, enabling the use of larger datasets without the need for labeling.As such, this counting method is expected to find applications beyond cell counting in various domains, including industrial inspection, environmental monitoring, and agriculture.

2 )
Our SSRNet overcomes the limitation of traditional point-based counting methods and achieves nonpoint-based counting.3) We propose a skimming VGG, which only uses five down-sampling units and is one-third of the size of VGG16.4) We propose SSR-upsampling that can quickly enlarge the feature map 32 times without losing information.5) We propose an optimized multitask loss function (optimized multitask loss function [OM] loss).

Figure 3 .
Figure 3. Information transfer directions.A) shows the MRF/CRF.B) One-way operation in spatial fine-grained processing.

Algorithm 1 :
The algorithm of spatial-based super-resolution fast upsampling module.Input: topmost feature map (x t ); deepest feature map (x d ),Output: final feature map (x f ) 1.

Figure 4 .
Figure 4. Some samples in the datasets.The first row shows the images of BBBC04 dataset.From left to right are images with overlapping degrees of 0, 15, 30, 45, and 60, respectively.The second and third rows show images with different densities from the BBBC05 dataset.

Figure 6 .
Figure 6.Samples of different upsampling methods; From top to bottom are images with overlapping degrees of 0, 15, 30, 45, and 60, respectively.

Figure 7 .
Figure 7.Samples of different loss functions.From top to bottom are images with overlapping degrees of 0°, 15°, 30°, 45°, and 60°, respectively.The red circle indicates the common errors of rough, missing, or adhesive imaging in other methods.

Table 1 .
Comparison of SSR-upsampling module applied to different downsampling models on BBBC05.

Table 2 .
Comparison between different upsampling methods.

Table 3 .
Comparison between different upsampling scales based on SSR-upsampling module on BBBC05 dataset.

Table 4 .
Comparison of different loss functions.

Table 5 .
Comparison of different task weights.

Table 6 .
Comparison with different methods.

Table 7 .
Comparison of file size and running speed.