Machine-Learned Light-Field Camera that Reads Facial Expression from High-Contrast and Illumination Invariant 3D Facial Images

each expression was calculated from the distribution of values obtained by repeating ﬁ vefold cross validation 40 times for the 2248 datasets. The randomly selected 4/5 dataset was used as a training set for the MLP, and the remaining 1/5 was used as a test set to calculate accuracy. The accuracy was a metric for how well the MLP classi ﬁ er matched each facial expression, that is, number of correctly classi ﬁ ed images for the expression data/number of images of the actual expression.

DOI: 10.1002/aisy.202100182 Facial expression conveys nonverbal communication information to help humans better perceive physical or psychophysical situations. Accurate 3D imaging provides stable topographic changes for reading facial expression. In particular, light-field cameras (LFCs) have high potential for constructing depth maps, thanks to a simple configuration of microlens arrays and an objective lens. Herein, machine-learned NIR-based LFCs (NIR-LFCs) for facial expression reading by extracting Euclidean distances of 3D facial landmarks in pairwise fashion are reported. The NIR-LFC contains microlens arrays with asymmetric FabryÀPerot filter and NIR bandpass filter on CMOS image sensor, fully packaged with two vertical-cavity surface-emitting lasers. The NIR-LFC not only increases the image contrast by 2.1 times compared with conventional LFCs, but also reduces the reconstruction errors by up to 54%, regardless of ambient illumination conditions. A multilayer perceptron (MLP) classifies input vectors, consisting of 78 pairwise distances on the facial depth map of happiness, anger, sadness, and disgust, and also exhibits exceptional average accuracy of 0.85 (p<0.05). This LFC provides a new platform for quantitatively labeling facial expression and emotion in point-of-care biomedical, social perception, or humanÀmachine interaction applications.
perspective views or in-and out-of-focus of digital refocusing of light-field image at a local region of interest in the face such as ears, iris, or lower jaw. [27,28] More recently, they also reconstruct face models based on model-free reconstruction of trained 3D face networks. [29] However, they still struggle in quantitatively extracting 3D features for precise facial topography from the entire face.
Here we report facial expression reading from a precise depth map using NIR-based high-contrast and illumination invariant LFC (NIR-LFC) and machine learning classification. High-contrast and illumination invariant light-field images of human face are captured by NIR-LFC with optical crosstalk-free MLA and NIR bandpass filter (Figure 1a). The NIR-LFC exhibits a broad light absorption range and adopts a Keplerian configuration intentionally to achieve successful facial imaging. Under NIR illumination of the entire face, the light field passes through an objective lens, NIR bandpass filter, and MLA. The reverse and repeated microimages are finally captured on a single CMOS image sensor. The raw light-field image reconstructs high-accuracy facial depth map. In addition, the pairwise Euclidean distances between key facial landmarks from the depth map serve as input vectors (d pi-pj ) to a multilayer perceptron (MLP) classifier ( Figure 1b). The output layer finally returns four different classes of facial expressions such as happiness, anger, sadness, or disgust.

NIR-LFC Design and Microfabrication
The NIR-LFC is fully packaged with compact modules of NIR illumination and LFC (Figure 2a). The camera module comprises a single objective lens, inverted MLA with asymmetric FabryÀPerot filter (iMLA-AFF), narrowband NIR filter, and a single CMOS image sensor. The illumination module also includes vertical-cavity surface-emitting lasers (VCSEL, V102C121A-850, Osram, 850 nm in wavelength, 2 mW) and 3D-printed housing. The NIR bandpass filter (SLB850, SHHO, 800À900 nm) allows illumination invariant 3D facial images from various ambient illumination conditions. The farthest object distance of NIR-LFC is determined by considering the interpupillary distance of human face and the size of reconstructed light-field data, corresponding to 240 mm. The iMLA-AFF includes MLAs and AFF of chromium (Cr)Àsilicon dioxide (SiO 2 )ÀCr. The AFF highly absorbs broadband illumination light from visible (vis) to NIR region and thus completely suppresses the optical crosstalk between adjacent microlenses. The 3D-printed housing of heat-resisting ABS resin and a printed circuit board for VCSEL sources are mounted to the LFC module ( Figure 2b). The waferlevel microfabrication of iMLA-AFF is done using Cr lift-off, plasma-enhanced chemical vapor deposition (PECVD) of SiO 2 , photolithography, and thermal reflow (Figure 2c, extended Figure 1. Schematic illustration of NIR-LFC with high-contrast and illumination invariant 3D face imaging for reading facial expression. a) NIR light sources from two VCSELs radiate the face regardless of the ambient light condition. Light fields from the face pass through a compact objective lens, NIR bandpass filter, and inverted MLAs with AFF (iMLA-AFF). The raw facial images are partially recorded on the CMOS image sensor. b) Facial expression reading of 3D depth map using MLP classifier. The MLP classifier consists of two hidden layers, including 15 neurons in the first layer and 10 neurons in the second layer. 78 pairwise Euclidean distances between 13 facial landmarks ( 13 C 2 ) are extracted from 3D depth map and used as an input vector to the MLP classifier that reads facial expression of each subject.
www.advancedsciencenews.com www.advintellsyst.com microfabrication steps, and descriptions in Figure S1, Supporting Information). The scanning electron microscope (SEM) of iMLA-AFF shows hexagonally packed MLAs of 30 μm in diameter and 0.82 in fill factor ( Figure 2d). The optical image of a fully packaged NIR-LFC shows the physical dimension of 8.4 mm Â 14 mm Â 5.6 mm in width, height, and thickness ( Figure 2e).

High-Contrast Light-Field Imaging
The iMLA-AFF plays a crucial role for high-contrast light-field imaging. Thin Cr layer shows optically lossy and less dispersive characteristics, suitable for both the top thin-film layer and the bottom reflector. The intermediate nondispersive SiO 2 layer exhibits high transparency in the visÀNIR region. Thus, the iMLA-AFF exhibits broadband absorption thanks to the extremely low Q-factor of cavity construction based on the strong interference effect. [30,31] The calculated thicknesses of the 6 nmthick Cr film, 90 nm-thick SiO 2 , and 140 nm-thick Cr film show the maximum absorption in the whole visÀNIR region, based on the finite difference time domain (FDTD) method (See Figure S2, Supporting Information). The optical sectioning and the line intensity profiles are compared for iMLA with and without AFF (25 μm in microlens diameter and 30 μm in pitch), using a confocal laser scanning microscope with an external laser diode (LM850103D6-CU, SHHO, λ 0 ¼ 850 nm, 5 mW) ( Figure 3a). [32] The normalized line intensities clearly show that the iMLA-AFF (line AÀA 0 ) completely eliminates the optical crosstalk compared with the iMLA (line BÀB 0 ). The raw images of line pairs captured from an USAF1951 target also substantially contribute to the image contrast after light-field reconstruction ( Figure 3b). The Michelson contrast, that is, C ¼ (I max À I min )/(I max þ I min ), of reconstructed light-field data from the NIR-LFC with AFF (blue line) is 0.66, which shows 2.1 times higher image contrast than that without AFF (red line).
As the microlens diameter of NIR-LFC decreases, the number of pixels inside the microimage linearly decreases (angular resolution decreases), while the magnification of LFC, that is, the reciprocal of the size of an object projected onto a single pixel, further exponentially decreases (Figure 3c). The MTF50 value for subaperture images increases as the microlens diameter decreases, but it decreases at below 30 μm in microlens diameter due to the low magnification ( Figure 3d, more information on MTF curves in Figure S3, Supporting Information) (Quick MTF software). The black short dashed line indicates a nonlinear Lorentzian fit for measured MTF50 values (black squares). As a result, the microlens diameter of NIR-LFC is set to 30 μm, which shows the maximum MTF50 value of 16.2 cycle mm À1 and the angular resolution of 27 Â 27 pixels. The raw images and their corresponding reconstructed light-fields of human

Illumination Invariant Facial Depth Map Estimation
The NIR-LFC captures the raw image of human face on a CMOS image sensor (Sony IMX 219, 2464 Â 3280 pixels), which reconstructs the depth map after the light-field image calibration ( Figure 4a). The raw image of I (s,t) is initially calibrated to assign the center position of individual microimages and then converted to the 4D radiance arrays of L(u, v, s, t) with the light-field imaging toolkit (LFIT v2.4). [33] The disparity map is estimated using a cost volume-based stereomatching algorithm, [16] which measures the similarity between the center view and the subaperture images and matches the cost of different disparity labels to estimate the stereo-correspondences with sub-label precision.
In addition, the camera matrices, that is, intrinsic ([K]) and extrinsic matrices ([R|t]), are calculated from a raw checker board image with a known pattern size using the geometric calibration algorithm [34] (more light-field reconstruction and depth map results in Figure S4, Supporting Information). Illumination invariant facial images are fully evaluated under different ambient illumination conditions using NIR-LFC and the 3D depth reconstruction process. White LED light (400À800 nm in wavelength) is illuminated on a human face at different illumination angles of 0 , 30 , and 60 , respectively ( Figure 4b). Two VCSELs of a radiance angle of 60 Â 45 also illuminate the entire face in front, 180 mm away from NIR-LFC. 13 different facial landmarks (P 0 ÀP 12 ) from reconstructed light-field data are extracted by using the OpenFace software, [35]  The raw images of line pairs captured from an USAF1951 target also substantially contribute to the image contrast after light-field reconstruction. Line intensity profiles of each reconstructed light-field indicate that the NIR-LFC with AFF (blue line) improves the image contrast up to 2.1 times compared with that without AFF (red line). c) As the microlens diameter decreases, the number of pixels inside the microimage linearly decreases (angular resolution decreases), while the size of object projected onto a single pixel exponentially increases (magnification decreases). d) A MTF50 value of subaperture image increases as the microlens diameter decreases, but it rather decreases below 30 μm in microlens diameter due to low magnification. A microlens diameter of NIR-LFC is selected to 30 μm, which exhibits the highest MTF50 value of 16.2 cycle mm À1 and angular resolution of 27 Â 27 pixels. e) NIR-illuminated raw images and reconstructed light-field data of human eye depending on microlens diameters (25,30, and 50 μm). The reconstructed light field from 30 μm in microlens diameter exhibits the highest spatial resolution compared with the others.
www.advancedsciencenews.com www.advintellsyst.com and the nine different 3D Euclidean distances (d E ) are then calculated from the depth map (more details of facial landmark extraction in Figure S5, Supporting Information). The facial landmarks of nose, eyes, and mouth are set to the distinct feature points. Note that the facial areas of cheek and forehead are easily saturated due to high specular reflection and thus excluded from the feature points. The illumination invariance of NIR-LFC with NIR bandpass filter (right [NIR], Figure 4c) at 60 illumination angle is compared with both NIR-LFC with NIR cut filter (left [vis], Figure 4c) and NIR-LFC without any filter (center [visþNIR], Figure 4c). The NIR-LFC with NIR bandpass filter captures uniform facial images despite the ambient illumination angles, and the reconstructed depth maps demonstrate clearly distinct features around the eyes, nose, and mouth. In contrast, other NIR-LFCs show severe shadows and shadings on the right side of the face, resulting in saturated and blurry areas in both the reconstructed light-fields and the depth maps. The reconstruction error from different Euclidean distances at each illumination angle is calculated by the root mean square error (RMSE) from the actual measured distance (d ref ) of ten subjects. (Figure 4d).
Note that the RMSE is defined as The RMSE of NIR-LFC with NIR bandpass filter is nearly uniform up to 6.7 mm, whereas the RMSEs from other LFCs substantially increase with the illumination angle showing the RMSE of 12.8 mm or more at a illumination angle of 60 .
The RMSE measurements clearly demonstrate that the depth map reconstructed by NIR-LFC with NIR bandpass filter significantly reduces the reconstruction error by up to 54%.

Facial Expression Reading of NIR-LFC
Machine learning-based facial expression reading is implemented using the facial features extracted from the facial depth map and the reconstructed light-fields. The NIR-LFC captures raw facial light-field images of 32 adult subjects. The subjects are asked to express an acted (or trained) facial expression, that is, happiness, anger, sadness, and disgust, to minimize external variables such as a gender, age, and ethnicity of subjects ( Figure 5a). The total of 281 raw images consists of 38 images acquired by repeating the facial expressions once from 10 subjects and 243 images acquired by repeating the facial expressions three times from 22 subjects. Note that some images are excluded from the training dataset due to different external conditions such as illumination intensity, incident angle, and face orientation. The 281 datasets containing 78 Euclidean distance features from 32 subjects are augmented to 2,248 datasets by adding random noises from the original value of each feature to improve the robustness of the training data in the MLP classifier. The classification accuracy of each expression is calculated as the distribution of values obtained by 40 times 5-fold cross www.advancedsciencenews.com www.advintellsyst.com validation from the entire datasets (Figure 5b, visualization of dataset distribution in Figure S6, Supporting Information). In all facial expressions, the average classification accuracy of 3D facial features is 0.85, substantially higher than that of 2D features, and the distributions of 2D and 3D features exhibit significant difference from each other (p < 0.05). The top four dominant features in each expression are selected using mutual information (MI) after the classification among the 78 facial features from both 3D and 2D facial images and expressed as the radar distributions of 15 MI (m pi-pj ) including one overlapping MI (m p0-p6 ) (Figure 5c). The MI quantifies the mutual dependency between the distribution of 3D and 2D facial features (d pi-pj ) and the distribution of expression classes (0 or 1), classified through the MLP [36] . The MI values from the 3D facial features exhibit a distinct and high mutual dependence for each facial expression, whereas those from the 2D features have an overlapped and low mutual dependence between other facial expressions except sadness (detailed 2D and 3D radar distribution in Figure S7  www.advancedsciencenews.com www.advintellsyst.com different weights to be attributed to the machine learning model for reading facial expressions. Therefore, the machine-learned NIR-LFC clearly demonstrates accurate facial expression reading using the MLP classification trained from precise 3D facial features.

Discussion
The NIR-LFC acquires high-contrast and illumination invariant face images and reconstructs them into accurate 3D depth maps. A broadband light absorption of AFF is mainly attributed to both the optical lossy nature and low dispersion of Cr. The interface reflection phase shifts become significantly different from 0 and π when a refractive index of the top thin Cr film has a comparable imaginary component (k) to real component (n) between 3.0 and 3.5 in the whole visÀNIR region. [36,37] Therefore, the complex refractive index of Cr and the thickness of each Cr and oxide satisfying the critical coupling condition allow perfect absorption compared with other metals such as Au, Ag, and Al, which do not exhibit broadband absorption characteristics ( Figure S2, Supporting Information). Meanwhile, the slight difference between the measured and calculated absorption of AFF using the FDTD method is related to a thin Cr film formation. The top Cr film thickness as low as 6 nm is randomly deposited as a form of both film and nanoislands during microfabrication to minimize the surface energy with the substrate, unlike assuming a uniform Cr film in the calculation. The random distribution causes less-than-0.1 absorption spectra difference in the corresponding region compared with the absorption spectra calculated by the FDTD method. The depth reconstruction algorithm implemented takes 10 min for the 4D radiance arrays around 250 MB in size, which still limits a real-time 3D imaging. However, this can be significantly improved by parallelizing with GPU. Note that the computation equipment in this process uses an Intel i5 3.60 GHz CPU and 16 GB RAM. In addition, even after refinement process, including interpolation and filtering, specular reflection in no fine structure areas such as forehead and cheeks still interferes accurate depth reconstruction, which generally appears in stereo correspondence-based 3D reconstruction method. [16,38] It is necessary to propose an alternative method of applying external illumination using a Moiré pattern or dot projection to LFC to overcome the issues. Despite a few limitations, the NIR-LFC obtains high-contrast and uniform 3D face images in various illumination conditions through an optimized optical design and optical crosstalk elimination from vis to NIR region.
The MLP classifier successfully reads facial expressions from 3D facial depth maps with high accuracy. The radar distribution and heat map from the MI values quantitatively evaluate the mutual dependency between the facial expression and the feature distribution. The results indicate that the specific features obtained from the 3D depth map play an important role in facial expression reading. Therefore, the 3D facial features from the NIR-LFC can further provide cognitive interaction information about weighted face positions in the processing of the human brain or machine learning when recognizing human expression.

Conclusion
In conclusion, this work has successfully demonstrated facial expression reading using compact, high-contrast, and a highaccuracy LFC and MLP classifier. The NIR-LFC comprises both NIR illumination module and camera module of a single objective lens, NIR bandpass filter, and iMLA-AFF on CMOS ISA. The total physical dimension is 8.4 Â 14.0 Â 5.6 mm 3 . The iMLA-AFF significantly increases the image contrast by 2.1 times and the NIR illumination module reduces the reconstruction errors by up to 54% regardless of ambient illumination condition. An MLP classifies the input vectors of 78 pairwise distances and clearly distinguishes four different facial expressions of happiness, anger, sadness, and disgust with exceptional average accuracy of 0.85 (p < 0.05). This LFC provides a new platform for labeling facial expression reading and emotion in point-of-care biomedical, social perception, or humanÀmachine interaction applications.

Experimental Section
Optical Design and Packaging of NIR-LFC: Each position of an image sensor, iMLA-AFF, and a single objective lens in NIR-LFC was numerically calculated considering the minimum total track length. The NIR-LFC intentionally selected the Keplerian configuration ( f MLA < B) due to the NIR bandpass filter thickness (0.4 mm in thickness) between the image sensor and the objective lens. The position of objective lens was 3.54 mm from the CMOS image sensor, considering the target depth-of-field range, MLA position (B), and the f MLA ( Figure S8, Supporting Information). Note that the objective lens's focal length ( f OBJ ), f MLA , and B were 3.04 mm, 75, and 100 μm, respectively. The theoretically calculated depth resolution of NIR-LFC was 1.7 mm ( Figure S9, Supporting Information). An objective lens was mechanically separated from Raspberry Pi V2 camera first, and four spacer films were precisely attached to the edge of the image sensor and permanently bonded to the iMLA-AFF using a UV-curable adhesive. VCSEL housing and PCB were designed to be 8.4 mm Â 14 mm Â 2.9 mm (W Â H Â D) and 8.4 mm Â 4.7 mm (W Â D), respectively, and completely combined with LFC module.
4D Light-Field Data Calibration and Depth Map Estimation: Raw microimages of NIR-LFC consisted of 2,464 Â 3,280 matrices. Using the calibration data file that displayed the center points of each microimage, indices (u, v) represented the position of each microlens and indices (s, t) represented the pixel within the microimage, respectively. Then, the raw images were reconstructed into 4D radiance arrays during the data calibration procedure according to the directional (u, v) and spatial (s, t) information using the LFIT. [33] The 3D depth estimation algorithm was based on the phase shift in 2D Fourier domain for stereomatching between subaperture images with very narrow baseline. [16] The disparity map was calculated by measuring the cost volume and by matching the similarity between the center and the subaperture images for estimating the stereo correspondences. Two complementary costs were used to match subaperture images, that is, the sum of absolute difference (SAD) and the sum of gradient differences (GRAD). The calculated disparity map was sequentially refined using multilabel optimization, iterative refinement process, and weighted median filtering of the cost slices. The final disparity map was estimated by user-defined parameters such as the number of pixels in a single microlens, the lens pitch, diameter, and a single pixel size. Simultaneously, a geometric calibration was conducted to subaperture images of a checker board with a known pattern size and extracted intrinsic (K ) and extrinsic ([R|t]) matrices. Each matrix contained focal length in x and y directions, principal point offset, and 3D rotation and translation information to elucidate camera geometric parameters. Note that all the processes were fully implemented in MATLAB.
www.advancedsciencenews.com www.advintellsyst.com MLP and MI for Facial Expression Classification: An MLPMLP, one of the representative feed-forward artificial neural networks, learnt the dataset by updating weight of connections of each neuron. A multilayer perceptron consisting of two hidden layers with 15 and 10 neurons was applied to classify various facial expressions with vectors acquired from 3D facial depth maps. Rectified linear unit (ReLU) was used as an activation function in the classification model. This model predicted one of four expressions (happiness, anger, disgust, and sadness) in the output layer. The model was further optimized by the ADAM optimizer with 3000 iterations during the training, and the nonconvergent models were excluded from the task. The dominant features to determine each facial expression were selected using MI. [39,40] MI between a feature and a facial expression measured how much the feature contributed to the expression reading. Highly scored features for each facial expression were selected from their corresponding MI.
Training Dataset Acquisition and Accuracy Calculation: The total of 281 raw images were acquired from 32 adult participants. For the experiment with human research participants, a written consent from all participants was obtained prior to the research including photos or reconstructed images of volunteers being taken and utilized as part of the research paper. KAIST Institutional Review Board (IRB) confirmed the written consent (approval number KH2020-105). 3D coordinates of 13 key facial landmarks (P 0 ÀP 12 ) were extracted from each image, and the pairwise distances (d piÀpj ) between each landmark were calculated. 78 pairwise distances ( 13 C 2 ) from one image were designated as a single vector data. Then, data augmentation was performed to enhance the robustness of the entire dataset by adding random feature noise dataset, which attributed variations to the original feature values. Note that the feature noise followed a standard normal distribution and the augmented data was generated by adding 1/10 of the average feature value of the original data and a random value from the corresponding distribution. As a result, 8 new images with added noise were generated per image, resulting in a total of 2248 augmented datasets.
The classification accuracy of each expression was calculated from the distribution of values obtained by repeating fivefold cross validation 40 times for the 2248 datasets. The randomly selected 4/5 dataset was used as a training set for the MLP, and the remaining 1/5 was used as a test set to calculate accuracy. The accuracy was a metric for how well the MLP classifier matched each facial expression, that is, number of correctly classified images for the expression data/number of images of the actual expression.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.