Angular-spatial analysis of factors affecting the performance of light ﬁeld reconstruction

As a new VR multimedia format, Light Field (LF) has received more and more atten-tion. LF can support users to move freely within a certain range during the experience. However, the limitation of LF acquisition technology has become the main obstacle to its large-scale application. Since the current hardware technology cannot produce a lens small enough to meet the needs of LF capture, the intensity of the collected LF image is usually insufﬁcient. Although there are many software LF reconstruction algorithms have been proposed to make captured result denser, they all face the problem of poor generalization in the application process. In this work, it is attempted to locate the factors that affect the performance of LF reconstruction from both the angular and spatial domains. The analysis results show that it is affected not only by the edge and texture of the LF content in the spatial domain, but also by the adjacent disparity and the reﬂection characteristics in the angular domain. An indicator to quantify the impact of these factors on reconstruction performance is also proposed, which will be helpful to design a


INTRODUCTION
The difference between traditional media and VR media is that the latter can provide users with an immersive experience. That means users can freely choose the direction and position to observe the scene when viewing the VR media. The ideal VR media need to be able to support the user's viewing behavior on 6 Degrees of Freedom (6DoF). As shown in Figure 1(a), the users' viewing angle has 3 Degrees of Freedom (3DoF), usually defined as pitch, yaw and roll, represented by , and . For the 6DoF, an additional 3 degrees of freedom(x, y and z) are required to determine the viewing position of the user, as illustrated in (b).
The 360 • video and the 3-Dimension (3D) model are the most well-known VR media format for now. The former is captured by the panoramic camera, but only supports the 3DoF viewing experience. The user's viewing position is limited by the center of the camera. Although the 3D model (e.g. point cloud and mesh model) can allow users to explore freely within the virtual scene, how to accurately model any scene in real-time is still a challenge that cannot be solved currently. Because of the limitations of both 360 • videos and 3D models, the Light Field (LF) is seen as a new and promising VR format for a true 6DoF immersive experience. Furthermore, LF has advantages over the other VR format, including the lower computational requirement for rendering, and scene-complexity independence of the cost of interactive viewing [1,2].
The LF image records a scene with a set of images from sub-aperture lenses of different viewpoints, so it can provide more structure and texture detail of the scene. As shown in Figure 2, by switching between existing viewpoints or synthesizing new viewpoints, the user can move freely in both the vertical and horizontal directions. At the same time, the equally spaced view acquisition form is very beneficial for the extraction of the depth information for the scene, thus can support the positional movement of users in the forward-backward direction. Each  sub-aperture image stores a scene image captured by the LF camera from different perspectives. These sub-aperture images are the same with traditional images, and they represent the spatial domain of the LF. Therefore, the spatial domain resolution of the LF is the pixel resolution of each sub-aperture image. At the same time, each sub-aperture image corresponds to a different viewpoint of the shooting angle, and all the viewpoints constitute the angular domain of the LF. The denser the angular domain, the more scene details are captured and recorded.
The insufficient angular resolution of LF images is a common problem encountered in acquisition, mainly because the production technology of the acquisition lens cannot meet the actual needs. LF reconstruction is a solution to this problem, which can reconstruct high angular resolution LF images with low angular resolution input data. Although many reconstruction methods have been proposed, they all face performance generalization problems in actual use [9][10][11][12][13][14][15]. It is found that the performance of the reconstruction algorithm on different datasets varies greatly, and the reconstruction results of some datasets are seriously distorted. We analyze the LF data in both the angular domain and the spatial domain, and locate the main factors that interfere with the performance of LF reconstruc-FIGURE 3 (a) 2 layer representation for LF Image; (b) camera array for LF acquisition tion. At last, we establish indicators to quantify the impact of these factors. The contributions of this work are: • We analyze the factors that affect the performance of LF reconstruction in the spatial domain and the angular domain, which can greatly help solve the problem of unstable performance of the reconstruction algorithm. • In the spatial domain, the performance of LF reconstruction is found to be poor in the edges and texture regions of the image. In the angular domain, the adjacent disparity and the reflection characteristics are also regarded as the main factors that interfere with LF reconstruction. • An indicator is proposed to sufficiently evaluate the influence of interference factors both in the spatial and angular domain.
The rest of this paper is organized as follows. Section 2 discusses related work. Section 3 shows the performance test results of various reconstruction schemes on different LF images. Section 4 presents the analysis and conclusion of the test result in both the spatial and angular domain and proposes corresponding indicators to quantify the impact of these factors on reconstructing performance. Section 5 designs a comprehensive index to evaluate the complexity of LF scenes. Finally, Section 6 concludes the paper and discusses future work.

RELATED WORK
We first present an overview of the basic concepts of LF image processing. As shown in Figure 3(a), LF can be represented by a 4D plenoptic function L(u, v, x, y), which characterizes the light rays between two parallel planes, called the \boldmath uv camera plane and the xy image plane. The coordinate of (u, v) corresponds to the location of the viewpoint in the microlens matrix, and the coordinate of (x, y) corresponds to a pixel in the sub-aperture image of single viewpoint. The xy plane contains the spatial information of the LF, while the uv plane forms the angular domain of the LF. L u,v (x, y) represents the 2D spatial slice of LF, which is actually the 2D sub-aperture image of each viewpoint. Similarly, L x,y (u, v) represent, respectively, the angular slice of LF. Usually, camera array or lenslet [3,4] are used to collect LF, as illustrated in Figure 3(b). Figure 2 shows an LF image used in this work. The camera array contains 17 × 17 sub-aperture lenses with each having a 1024 × 1024 spatial resolution. The scene is recorded with a set of cameras from different viewpoints. The image captured by each separate camera is called a sub-aperture image, and all the sub-aperture images together form an image matrix according to their corresponding camera positions. For example, (u 0 , v 0 ) determines the position of the sub-aperture image shown by the red box, and (x 0 , y 0 ) indicates the position of the pixel in the image, shown by the blue dot in the figure. For more details, readers can refer to LF tutorials such as [5].
Since a single lens needs to occupy a certain area, it is impossible to deploy sub-aperture lenses infinitely densely on the camera plane. Therefore, the angular resolution of the LF that can be acquired is limited by the lens production process. In order to improve the angular resolution of LF through post-processing algorithms, many methods of LF reconstruction have been proposed. The LF reconstruction algorithm can be divided into two kinds: the one with depth information reference and the one without depth reference.
The first kind of schemes needs to collect the depth map of the scene when collecting the LF image or estimate it from the LF's Epipolar Plane Image (EPI), and achieve the upsampling of the LF image with the help of the depth information. Bilevich et al. [6] propose a three-pass algorithm for light field interpolation in EPI. [7] introduces the self-similarity based LF image compression scheme, compressing light field image by the commonly used image compression standards. Rizkallah et al. [8] proposed a light field compression scheme based on CNN, which uses entropy coding with graph transform. LF reconstruction is also helpful for 3D modeling in Virtual Reality (VR). Vagharshakyan et al. [9] utilize sparse representation of EPI with a directionally sensitive transform; However, all these LF reconstruction schemes suffer from a certain degree of performance degradation when dealing with some LF images containing complex scenes. By using locally linear embedding techniques, Lucas et al. [10] propose a new way to improve the performance of 3D modeling. With the help of the high-precision model, any virtual viewpoints can be rendered freely.
David et al. [11] cast the problem of upsampling camera arrays as a directional super-resolution in 4D space. Nima et al. [12] introduces disparity and depth estimation using a learningbased approach and presents an intermediate synthesis technique to reconstruct the LF image. Suren et al. [13] use the shearlet transform as the sought sparsifying transform and develop an effective reconstruction of the LF represented by EPIs. Jason et al. [14] combine band-limited filtering with wideaperture reconstruction which is essentially directional edgepreserving filtering. Bishop et al. [15] recover superresolution for Lambertian scenes in the spatial or angular domain on the base of Lambertian assumption.
The second kind of schemes directly reconstructs the subaperture image matrix of LF. Shi et al. [16] considered LF reconstruction as a sparsity optimization problem in the continuous Fourier domain. Sparse Fast Fourier Transform (SFFT) is used to reconstruct the full LF images. Irene Viola et al. [17] introduce graph theory to model the LF images. They regard each sub-aperture image as a vertex of a graph, with weighted edges represent the correlation. The determination of the graph model is defined as an optimization problem, and the gradient descent algorithm is used to search for the optimal solution. Graph Convolution Networks (GCN) [18] can efficiently and accurately calculate the graph model of LF, which can be used to reconstruct the high angular resolution LF image. Xu et al. [19] present a reconstruction algorithm that can restore the 4D light field from a portable light field camera without the need for calibration images. They proposed a 4D demosaicing algorithm based on kernel regression to enhance the quality of the reconstructed sub-aperture images.
Because the second kind of schemes is more in line with the actual application requirements, it does not need other information to assist except the collected LF image. Therefore, this work mainly analyzes the factors that affect the performance generalization of the second kind of LF reconstruction schemes. To our best knowledge, this is the only exploration work on this issue at present. But there are many works about quality metrics specially designed for LF images. In [20] Paudyal et al. present a SMART light field image quality dataset, which consists of source images, compressed images, and annotated subjective quality scores to help test new processing tools or even assess the effectiveness of objective quality metrics. Murgia and Giusto [21] present an open-access database of light field images and corresponding views to help fairly compare the results in light field imaging applications. Yuan Yan [22] explores the effects of light field photography on image quality by quantitatively evaluating some basic criteria for an imaging system. Kera et al. [23] mainly focuses on visual content quality, aiming to measure the perceptual difference between light field reconstruction and different angular resolutions via a series of subjective image quality assessments.

PERFORMANCE TEST FOR LF RECONSTRUCTION
In this work, three typical reconstruction schemes are selected for the test. They are LF reconstruction based on SFFT [16], Graph-Based Learning(GBL) reconstruction method [17] and GCN-based LF reconstruction [18]), respectively. These three reconstruction algorithms all directly perform signal processing and recovery on the LF sub-aperture matrix, instead of simply interpolating between viewpoints to achieve super-resolution in the angular domain. They can reconstruct the ground truth of LF images more accurately, and have been used in other applications [24][25][26].
All tested reconstruction methods are used to process six different groups of LF images (Stanford Crystal-large angular extent(Crystal-L), Stanford Crystal-small angular extent(Crystal-S), Bunny, Knight [24], Tower, Town) [27] and test their reconstruction performance, as illustrated in Figure 4. Among them, Crystal-L and Crystal-S capture the same scene, but the camera array used by Crystal-L has larger spacing between adjacent sub-aperture lenses, that is, the adjacent disparity of Crystal-L is larger.
The reconstruction quality is measured by average Peak Signal to Noise Ratio (PSNR) and average Structural Similarity  (1): In addition to objective quality metrics, we also test for subjective quality metric VMAF [28]. VMAF has been trained and tested with a large amount of users' data on Netflix servers, which can reflect the subjective feeling that the image quality brings to the user. VMAF scores range from 0 to 100, with 100 indicating the best subjective experience. The SSIM and VMAF are also the results after averaging on the uv plane, and the calculation method is similar to Equation (1). All LF images are down-sampled before reconstruction, raw data without preprocessing are used as the ground truth when evaluating the reconstructed performance. In Figure 4, each row represents the reconstruction quality results of the same method on different LF images. The upper left corner of the table also shows the spatial resolution and angular resolution of all LF images.
As shown in Figure 4, there is a significant difference between the reconstruction results of different LF images, and this difference appears in the results of all reconstruction methods. For example, the reconstruction result of Bunny and Town are better than other LF images. Another valuable test result is that the output quality of Crystal-S is better than that of Crystal-L even when the collected scenes are identical. Next, we will analyze the test results from two aspects: spatial domain and angular domain. We will focus on the causes of this poor performance

Analysis in spatial domain
The spatial domain of LF represents the imaging result of the scene when the viewpoint is fixed. Like the traditional image, the high frequency component of LF spatial domain corresponds to the edge and texture of the object, while the low frequency component corresponds to the smooth region. In order to determine the specific regions in the spatial domain that bring adverse effects to LF reconstruction, we make a more finegrained quality analysis of the reconstruction results. As shown in Figure 5, we define the 16 × 16 pixels block around target pixel P as U R . By calculating the MSE between reconstructed output and ground truth of all pixels in U R , we can evaluate the reconstruction quality of the local area around P. For the pixels at the edge of the image, we only use the remaining pixels in U R to calculate MSE. We will not consider the points that do not exist outside the boundary. Figure 6 presents the visualized result of fine-grained quality analysis on Bunny, Tower, Knight and Crystal-L using the GCNbased reconstruction method. The darker the pixel, the worse the reconstruction quality in the local area. In Bunny, the background is quite pure and the quality of those areas is significantly higher than in other areas. We also notice that for those pixels in the edge and texture regions, the resulting quality of their surrounding regions is very poor. These phenomena are also obvious in the other three LF images. Therefore, we think that the high frequency components such as edge and texture in LF spatial domain are the main factors to interfere with the reconstruction performance.
We consider calculating the high frequency ratio of all LF images in spatial domain to verify our analysis conclusion. Two dimensional Discrete Cosine Transform (DCT) is used to analyze the frequency component distribution for spatial slice L u,v (x, y). Since the signal energy is mainly concentrated in the low-frequency region, we reorder the spectrum according to the energy level, and select the proportion of the last 90% of the frequency point energy to the total energy as the high-frequency ratio R H −spa , as shown in Equation (2). However, because the value of R H −spa is too small, it is not conducive to comparative observation. So we designed indicator I spa to evaluate the impact of high frequency components in spatial domain: where C spa is just a scaling parameter, which can be set arbitrarily. In this section, C spa is taken as 100. Table 1 gives the I spa and the PSNR of its corresponding reconstructed results on four groups of LF images: Crystal-L, Knight, Crystal-S and Bunny. All results of SFFT, GBL and GCN are given. The last column is the I spa of each group of LF images.
In Table 1, it can be clearly observed that the lower I spa is, the less spatial high frequency components of LF image are, and the better reconstructed results are. Therefore, the highfrequency components such as edge and texture are one of the factors of interference reconstruction. At the same time, we also notice that I spa cannot explain the reason why Knight is better than Crystal-L, but worse than Crystal-S. So we believe that there are other interference factors besides the spatial domain.

Analysis in angular domain
We mentioned before in Section 3 that the quality of reconstruction results is different due to the difference of adjacent disparity between Crystal-S and Crystal-L. This shows that in addition to the interference factors inside the sub-aperture viewpoint, there should also be some factors between the viewpoints. We notice that the results in Figure 6 seem to be related to the location of objects in the scene. For example, the quality of the tower area in the upper left corner of the Tower is significantly better than that of other objects. We further use the depth map of each LF image to compare with its fine-grained quality analysis result. We use the GCN reconstruction algorithm for Tower to reconstruct the LF matrix from 5 × 5 to 9 × 9. Figure 7 presents the comparison result of Tower. The left is the reconstructed result; we transform it into a color temperature map for easy observation. The higher the temperature, the better the reconstruction quality, and vice From Figure 7, we notice that the objects with better quality are all concentrated near the same depth plane. Apart from the very far background area, whether it is closer or farther away from the plane, the reconstruction quality of the target is declining. This means that the depth plane of the object is related to its reconstruction quality. In order to verify that, we count the reconstructed PSNR of all pixels according to the depth plane; the result is shown in Figure 8. The horizontal axis represents different depth planes, and the vertical axis represents the average value of the reconstructed PSNR of all pixels on the same plane. The depth value is normalized to (0,1) and divided into 100 planes.
There is an obvious peak near the depth value of 0.51 in Figure 8. When moving away from the peak plane to both sides, the quality of their result gradually decreases. However, when the depth value is close to 1, the resulting quality begins to pick up. This is mainly because the depth plane close to depth value 1 represents the background area of the scene, which is consistent with the results observed in Figure 7. In order to determine the special physical significance of the depth plane near the depth value 5.1, we investigated the acquisition camera array of the Top view for arc surface layout of camera array in Tower LF imageTower. We find that the camera array of Tower is not arranged in parallel, but in a horizontal arc surface around the center of the scene, as shown in Figure 9.
Therefore, the depth plane near 0.51 is likely to correspond to the rotation center of the horizontal arc surface of the camera array. By analyzing the sub-aperture image matrix of Tower, we find that the displacement of the object near 0.51 is the smallest in different viewpoints, and it is determined as the horizontal rotation center of the camera array. To sum up, we confirm that the reconstruction quality of each object in the scene is related to their depth distance from the rotation center. As shown in Figure 9, the further away from the center of rotation, the larger the shift of objects in different viewpoints, the more unfavorable the reconstruction algorithm is to synthesize new viewpoints. At the same time, the closer the depth value is to 1, the closer the depth plane is to infinity. In this case, the camera array can be approximately seen as a parallel arrangement, and the change of the background in different viewpoints is very small, so the reconstruction quality is picking up.
Through the analysis, we know that the depth distance will increase the difference between viewpoints, and then lead to reconstruction distortion. In addition, the larger interval between adjacent viewpoints will also lead to the larger disparity. Shooting scenes with camera array can be regarded as discrete sampling in LF angular domain. When the array is more sparse, the more discrete the collected signal is, the less easy it is to reconstruct the LF image. We believe that this is the reason for the difference in reconstruction performance between Crystal-L and Crystal-S. In the LF image data set used in this work, only the crystal ball scene was collected with different adjacent disparity (Crystal-L and Crystal-S) during capturing. The rest of the scenes only captured the data of one adjacent disparity, and it is difficult to directly compare them with different disparity reconstruction results. So if we need to design a verification experiment, we  need to intentionally subsample the existing LF images to generate subsets of the same scene with different adjacent disparity. As shown in Figure 10, we design two kinds of sampling schemes: large interval and small interval. They all downsample the angular resolution of the LF matrix from 17 × 17 to 9 × 9. The red small box represents the viewpoint to be sampled, and the white one represents the viewpoint to be discarded. On the left is the small interval sample, on the right is the large interval sample. The angular resolution of these two schemes' output is the same, but their adjacent disparity is different.
Because the original angular resolution of Tower and Town is too small (only 9 × 9), it is not conducive to down-sampling. So we choose Crystal-L, Knight and Bunny three groups of LF images for experiment. After two kinds of subsampling for all LF images, we use the GCN-based method to test the reconstruction performance for their output, the result is shown in Table 2. Table 2 verifies that even in the case of the same angular resolution, the adjacent disparity difference of LF image will affect the quality of reconstruction.
The adjacent disparity and the depth distance from the rotation center are both the reasons for the great differences between LF viewpoints, which are actually the reflection of the high frequency component of LF image in the angular domain. While the region which tends to be invariable in different viewpoints represents the low frequency angular component. Therefore, similar to the methods in the spatial domain, we design indicators in the angular domain to evaluate the impact of these factors. We analyzed the frequency components for angular slice L w x ,w y (u, v). By reformulating the LF image and using DCT processing, we can get the energy distribution of frequency points in the angular domain, as shown in Equations (4) and (5).
where C ang is also a scaling parameter and set to 50 because there are fewer high frequency components in the angular domain. The comparison result between reconstructed quality and I ang is shown in Table 3. The Tower and Town are reconstructed from 5 × 5 to 9 × 9. The Crystal-L, Knight, Crystal-S and Bunny are reconstructed from 9 × 9 to 17 × 17. Overall, the lower I ang is, the less angular high frequency components of LF image are, and the better reconstructed results are. From Tables 1 and 3, we find that the reconstruction quality of LF image is affected by both spatial and angular domain factors. For example, although the I spa of Crystal-L and Crystal-S are similar, the I ang difference between them is obvious, which shows that the main reason for the performance difference between them is the angular domain factors. The huge difference between Knight and Tower can also be reflected by I ang . But for why Town is better than Tower, I ang is not a good explanation, which shows that the dominant factors are in the spatial domain.

A COMPREHENSIVE INDEX TO EVALUATE THE COMPLEXITY OF LF SCENES
As analyzed in the previous section, we believe that factors in the spatial domain and angular domain together affect the performance of LF reconstruction. We want to build a comprehensive index by the weighted sum of I spa and I ang to describe the influence of all possible factors in the LF scene on reconstruction. In essence, this is a quantitative evaluation of LF Scene Complexity(LFSC). As follows, we define LFSC as: Table 4 shows the LFSC results of different and , and compare them with the actual reconstruction quality.
When is 0.6 and is 0.4, the quality of reconstruction results decreases monotonically with LFSC. In the LF image dataset we used, LFSC can quantify and reflect the influence of all factors in spatial domain and angular domain on LF reconstruction performance. In the future, we believe that LFSC, I spa and I ang can provide a good reference index when designing a reconstruction method that can automatically adjust according to the input LF image. At the same time, the analysis and evaluation method of LF image from angular domain and spatial domain can also inspire the further application of LF scene information.

CONCLUSION AND FUTURE WORK
This work first introduces the performance generalization problem of LF reconstruction, which leads to the unpredictability of existing reconstruction schemes in processing different LF images. Then we find out all the factors that may interfere with the reconstruction quality from the spatial domain and the angular domain, and establish quantitative indicators I spa and I ang to measure the impact of these factors. To our best knowledge, this is the only exploration work on this issue at present. Finally, we integrate I spa and I ang to establish the comprehensive index LFSC to evaluate the complexity of LF scenes. We hope that this index can play an important role in designing a new LF reconstruction algorithm. Besides, this method of analyzing the LF image from the spatial domain and angular domain can bring new ideas for other applications and technology for LF. Next, we will optimize the existing reconstruction algorithm based on the indicator proposed in this work to improve the quality of reconstruction results. At the same time, we will consider applying this indicator to the layered transmission scheme for LF image/video to improve the stability of LF delivery.
There are also some limitations in our work. In Section 5, our design of LFSC is relatively simple, and we did not further explore the mechanism for working together of spatial domain factors and angular domain factors. This is what we need to focus on in our future work. At the same time, how to apply the results of this work to modify various application technologies of LF is also our future focus. We hope that it can at least help to design a new LF reconstruction method which can adjust automatically according to the input LF image.