MP2020: Visual quality assessment database for macro photography images

With the development of mobile phone camera technology, mobile phones can take a large number of macro photography images that previously could only be taken by professional cameras. Therefore, it is of great signiﬁcance to study the quality of macro photography images. For this reason, a macro photography image visual quality evaluation database is established and it is named as MP2020. The database contains 100 reference images and 800 distorted images of four distortion types, including 200 distorted images of JPEG 2000, 200 distorted images of JPEG, 200 distorted images of white noise, and 200 distorted images of Gaussian blur. The DMOS values in the database were calculated from 48000 data which are provided by 60 subjects. Ten classical image quality assessment algorithms were tested on the MP2020 database. The experimental results show that the existing image quality assessment algorithms, which are widely used, are not applicable to the macro photography images. Therefore, MP2020 would contribute to the improvement of existing algorithms and the development of new algorithms. MP2020 has been uploaded to GitHub for download.


INTRODUCTION
Macro photography refers to the use of the optical capabilities of lenses to capture images that are larger or smaller than the actual object. Due to this special shooting technology, the main part of a macro image is clear, and the background part is blurred. Such an image can highlight the theme and increase the aesthetic feeling, which is deeply loved by everyone. In the past, macro photography images could only be taken by technicians with professional cameras. Currently, with the development of mobile camera technology, mobile phones can easily take macro photography images, resulting in an increasing number of macro photography images. However, this type of image is more prone to a variety of distortions in the process of acquisition, processing, transmission and storage than that of ordinary images because of their background blurring characteristics, resulting in image quality degradation and information loss. These distortions seriously affect the practical application of the images and the visual quality, so research on the quality of macro photography images is particularly important.
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. Image quality assessment (IQA) is a process of analysing image quality. The evaluation methods are divided into subjective image quality assessment and objective image quality assessment [1]. In the current universal image processing applications, human beings are the ultimate recipients of image processing applications, that is, to evaluate images with subjective human will, so subjective image quality assessment is the most effective and reliable method. Corresponding to the subjective evaluation is the objective image quality assessment, whose essence is to simulate the subjective perception of human eyes and give the quality evaluation results with the help of a certain mathematical model. According to the different dependencies of the evaluation system on the reference image, objective image quality assessment can be divided into full-reference (FR) image quality assessment, reduced-reference (RR) image quality assessment and non-reference (NR) image quality assessment [2].
In the past 40 years, image quality assessment (IQA) has developed rapidly. The full reference image quality evaluation algorithm is one of the earliest and most mature algorithms. In the early reference algorithms, the image quality was evaluated by directly calculating the corresponding pixel error between the IET Image Process. 2021;1-7.
Although the full-reference image quality evaluation algorithm is mature, it has limitations in practical application because of the integrity of the reference image. As a result, there is a nonreference image quality evaluation algorithm, which is widely used because it does not need a reference image. At present, the most common non-reference image quality evaluation algorithms are BIQI [12], divine [13], BRISQUE [14], NIQE [15], SSEQ [16], and based on CNN_IQA [17][18][19]. Also list the latest algorithms such as ASIQE [20] and ENIQA [21].
To research and develop IQA algorithms, multiple IQA image databases are established, such as LIVE [22], TID2008 [23], TID2013 [24], and CSIQ [25]. The number of images per degradation category in each database is shown in Table 1.
If you look carefully, you will find that most of the pictures in these databases are pictures with the same clear subject and background, and there are few micro photography images mentioned at the beginning. Figure 1 shows the difference between the two images Figure 1(a) shows a macro camera image, and (b) shows a natural image. As we can see from the figure, this is a special type of image, which is different from the ordinary image in uniform and clear (or blur), with a clear main body and blurred background. Therefore, this kind of image may be a distorted image for most of the existing quality evaluation algorithms. Figure 1(c) and (d) show the differences between the macro photography image and the natural image, using the histogram of oriented gradient (HOG). Currently, some of the existing databases cover this type of image, such as parrots.bmp and monarch.bmp in the live database, but there is no special database for this kind of image. Now, with the mobile phone also being able to take micro photography, the number of this type of image is increasing day by day, so it is very meaningful to study this new image quality. Based on this, we set up an image database dedicated to studying the quality of micro photography images, named MP2020.

Reference image and distorted image
The MP2020 database includes 100 reference images and 800 distorted images. The 100 reference images, which are carefully selected from a large number of macro photography images. The selection of the reference image of the database is based on the standard of pixel 255 × 255. If the image is large, the typical part is selected by manual cutting. At the beginning of the experiment, nearly 200 reference images were screened and deleted based on different kinds, and 100 reference images were selected. This database includes 25 flowers, 20 animals, 16 human figures, 14 insects, 10 vegetation, 9 delicacies, 5 buildings and 1 other type. Some of the images are shown in Figure 2. This selection guarantees the comprehensiveness of the database, makes the experimental pictures more diverse, and reduces visual aesthetic fatigue during artificial scoring.
A total of 800 distorted images are formed by four different types of distortion based on 100 reference images, and two different levels are divided under each type of distortion. The four types of distorted images are processed by MATLAB tools as follows: 1. JPEG compression (200 pieces): Forward discrete cosine transform (FDCT) is used to transform the image in the spatial domain into that in the frequency domain. The weighting function is used to quantize the DCT and then the Huffman variable word length encoder is used to encode the quantization coefficient. In the experiment, the JPEG compression algorithm is used to compress the reference image. Its quality factors are 5 and 25, respectively.  is used to obtain the point spread function and then calls the Imfilter() function to realize defocusing blur. In this experiment, the standard two-dimensional Gaussian smooth filter with a standard deviation of 0.9-6 is used for processing. Figure 3 shows a typical picture and its corresponding distorted picture in the MP2020 database, considering factors such as light, colour and whether it has the typical characteristics of macro photography images.

Subjective score
The subjective score of the experiment was performed in the normal indoor lighting environment. All images were displayed on an LCD with a screen resolution of 1680 × 1050 pixels. One subject viewed two images displayed on the screen at a distance of approximately 25 cm from the screen; the left image was a high-quality reference image, and the right image was a distorted image. In this experiment, a double stimulus impulse scale (DSIS) was used for subjective scoring; that is, the subjects first looked at the reference image, then at the distorted image, and then compared the two images. By comparing the two images, the subjects rated the distorted images and gave a score between 0 and 100. The lower the score was, the higher the image quality damage. The number of subjects in this experiment was 60, and the laboratory personnel were scattered to find the subjects. The number of students in this major is 27, the number of nonmajor students is 14, the number of graduates is 19, the ratio of men and women is 3:2, and the age range is from 21 to 33. Most of the subjects wear glasses, and the personnel are mixed and cover professional and non-professional, ensuring comprehensive experimental data.
To facilitate the scoring of subjects, independent platform software for MP2020 database scoring named subjective quality assessment (SQA) was developed to collect the scoring information of subjects. SQA was developed by using the C# tool in Visual Studio 2017, which mainly includes two functions: training and subjective scoring. This scoring software will be described in detail below. The software has an independent installation package, which can be installed locally. After the installation, double-click the shortcut generated by the desktop to run. The software installation package can be downloaded from GitHub for self-use. The main part of the main interface of the scoring software is two pictures distributed on the left and right sides of the screen. The image on the left is a reference image, and the other image on the right is an image after artificial distortion processing. The specific interface is shown in Figure 4.
Before the subjects score, they need to click the "Train" button to train to understand the approximate score curvature. The training interface is shown in the Figure 5 above.
The scores printed by the subjects will be saved in the format of .txt. The subjects can click the "Save" button to select the save path and file name.

Data processing
Different evaluation indexes often have different dimensions and dimension units, which will affect the results of data analysis. In order to eliminate the dimensional impact between indexes, data standardization is needed to solve the comparability between data indexes. After data standardization, all indexes are in the same order of magnitude, which is suitable for where max is the maximum value of sample data and min is the minimum value of sample data. One drawback of this method is that when new data is added, it may lead to changes in max and min, which need to be redefined.
(II) Z-score This method standardizes the mean and standard deviation of the original data. The processed data conforms to the standard normal distribution, i.e. the mean value is 0, the standard deviation is 1, and the conversion function is: where is the mean value of all sample data and is the standard deviation of all sample data. The data analysis method used in this paper is the min-max normalization. Normalization is to make the features of different dimensions have certain comparability in numerical value.
Generally speaking, the histogram of all subjects' scores tends to be normal Gaussian distribution after normalization, as shown in Figure 6.

Statistical analysis of subjective scores
After several months of experimental data collection, all the experimental data are analysed and screened. The data of this experiment are all between 0 and 100. Through the Grubbs criterion [26], the data with far deviation in the dataset are eliminated to reduce the impact of outliers on the overall data. That is, 60 data points of each picture are divided into 4 groups, each 15 data points is a group, each 15 data points is screened for outliers, and the subjects to which the outliers belong are collected.
If there are more than 20 abnormal points, the subject will be recorded as an unreliable subject and will be disqualified.
In the experiment, a boxplot is used to analyse the scoring of distorted pictures. Boxplot is a statistical chart used to display a group of data dispersion data. It is composed of a "box" and "whisker". There is a straight line in the box that represents the median of the sample. The upper and lower boundaries of the box show 75% and 25% values, respectively. The two "musts" are the maximum and minimum values of the data. Outliers are usually plotted separately and indicated by "+".
We use the boxplot tool to calculate the average artificial score of distorted pictures under the four distortion types of JP2K, JPEG, WN and GBLUR, including the minimum, the first quartile, the median, the third quartile and the maximum. Each centre blue rectangle spans from the first quartile to the third quartile, as shown in Figure 7. The line segment inside the rectangle (red) represents the median, and the horizontal line segments above and below the rectangle represent the maximum and minimum values. The figure shows the overall range of variation (minimum to maximum), the range of concentrated variation and the typical value (median). There are 200 distorted images under each distortion type, with a total of 200 data points.
In Figure 7, we can see that the fraction distribution is low under JP2K distortion, and the fraction span is large under JPEG distortion. The fraction under WN distortion is slightly higher than that under other distortion types, and the fraction under GBLUR distortion is compact, generally low, and has many outliers. On the whole, JP2K, JPEG and WN are better than GBLUR, which may be caused by the macro photography image itself being a kind of image with clear foreground and

Image quality assessment index
Two commonly used performance metrics were employed to evaluate the competing IQA methods. The first is the Spearman rank-order correlation coefficient (SROCC), which can assess the prediction monotonicity of an IQA method. This metric operates on the ranked data points and ignores the relative distances between data points. The second metric is the Pearson linear correlation coefficient (PLCC) between the MOS and the objective scores after non-linear regression. For the non-linear regression, we used the following mapping function: Quality( ) = 1 ×(0.5 − 1∕(1 + exp( 2 ×( − 3 )))) where α is the score obtained from the objective metric, and β k with k = 1, 2, 3, 4, 5 are parameters. The fitting, i.e. the determination of parameters in [27], is done by the non-linear regression over the dataset.

Evaluation of the reference image
We select six non-reference IQA algorithms, including BIQI, BRISQUE, SSEQ, CNN, ASIQE and ENIQA, to evaluate 100 reference images. These six non-reference methods have achieved good performance in several open image databases.
To evaluate the performance of the algorithms, we use two common indicators, PLCC and SROCC, whose absolute value range is [0, 1]. The closer the value is to 1, the closer the performance of the algorithm is to human vision. Table 2 shows the results of each index obtained by each algorithm after evaluation of 100 reference images in the MP2020 database. It can be seen in Table 2 that the scores of these five unreferenced algorithms on the reference images in the MP2020 database are not ideal, which indicates that the existing

Evaluation of the distorted images
Twelve typical quality evaluation algorithms are selected in the experiment, including six full-reference algorithms (PSNR, FSIM, GMSD, RFSIM, PSIM, SUMMER), six shallow-learning unreferenced algorithms (BIQI, BRISQUE, SSEQ, ASIQE, ENIQA), and one deep-learning unreferenced evaluation algorithm (CNN_IQA). The experimental results on the MP2020 database are shown in Tables 3 and 4. By comparing the IQA algorithms in Tables 3 and 4, it is obvious that these algorithms are not very effective in the evaluation of macro photography images. The accuracy of the full-reference algorithm is higher than that of the non-reference algorithm, mainly because we choose the learning-based nonreference method, and the learning-based method is affected by the generalization performance. The performance of FSIM and GMSD is better than that of the other full-reference algorithms, demonstrating that the two algorithms are excellent. The BIQI and CNN algorithms are relatively better in the non-referenced method, and the CNN method does not show excellent performance, which shows that the generalization performance of the deep-learning method is insufficient. Among them, we can see that the performance of all algorithms in the GBLUR distortion type is not good. The analysis may be due to the characteristics of the macro photography image itself, which is a kind of image with clear foreground and blurred background. This characteristic of macro photography images will also affect the subjective score of human beings, so the subjective score itself may be inaccurate.
To make the comparison more obvious, we give the experimental results of five kinds of non-reference algorithms in the live database, and the values of SROCC and PLCC are shown in Table 5. Through the comparison of Tables 3, 4 and 5, it can be found that the performance of the algorithm in the macro photography database needs to be further improved.

Some suggestions for developing IQA algorithms on MP2020
In view of the particularity of macro photography images, we try to give the following suggestions when developing a quality evaluation algorithm capable of evaluating such images: 1. For the evaluation of macro photography images, the image should be evaluated as a whole instead of block evaluation, and then the average score should be taken. To expand the training samples, some existing IQA algorithms, especially deep learning-based quality evaluation methods, usually carry out block processing, which should not be suitable for micro distance photography images. 2. The attention mechanism should be combined in the evaluation of macro photography images. The basic idea of the attention mechanism in computer vision is to let the system learn to pay attention to the important information instead of the irrelevant information. The attention mechanism focuses computation on certain elements, a set of calculations. It is very similar to human attention to relevant information. Macro photography images are images with such characteristics, and people only pay attention to clear foreground objects while ignoring fuzzy backgrounds. 3. The evaluation of macro photography images should be combined with image aesthetic evaluation. When macro photography mainly aims at the aesthetic stimulation formed under the influence of aesthetic factors, such as composition, colour, light, depth of field, and virtual reality. Therefore, the evaluation of macro photography images should be based not only on their own distortion but also on the 'beauty' of the image, that is, the combination of aesthetic evaluation.

CONCLUSIONS
To promote research on the quality of macro photography, we have established a micro camera image quality database MP2020, which contains 100 reference images and 800 distorted images. We use 10 existing typical full-reference and non-reference algorithms for evaluation and find that these algorithms are not very suitable for the MP2020 database, so the establishment of this database will improve the performance of existing algorithms and develop new algorithms.
In future work, we will make the following improvements and upgrades to the MP2020 database: (1) expand the number of images and distortion types to meet the requirements of deep learning; (2) improve the evaluation and statistics of GBLUR distortion types to make them more scientific; and (3) develop a general non-reference quality evaluation algorithm that can adapt to both common distortion images and micro distance camera distortion images.