VID: Human identification through vein patterns captured from commodity depth cameras

Syed W. Shah, School of Computer Science and Engineering, The University of New South Wales, Randwick, NSW, Australia. Email: z5038389@zmail.unsw.edu.au Abstract Herein, a human identification system for smart spaces called Vein‐ID (referred to as VID) is presented, which leverage the uniqueness of vein patterns embedded in dorsum of an individual's hand. VID extracts vein patterns using the depth information and infrared (IR) images, both obtained from a commodity depth camera. Two deep learning models (CNN and Stacked‐Autoencoders) are presented for precisely identifying a target individual from a set of N enrolled users. VID also incorporates a strategy for identifying an intruder—that is a person whose vein patterns are not included in the set of enrolled individuals. The performance of VID by collecting a comprehensive data set of approximately 17,500 images from 35 subjects is evaluated. The tests reveal that VID can identify an individual with an average accuracy of over 99% from a group of up to 35 individuals. It is demonstrated that VID can detect intruders with an average accuracy of about 96%. The execution time for training and testing the two deep learning models on different hardware platforms is also investigated and the differences are reported.


| INTRODUCTION
The importance of accurate user identification is becoming apparent as we move towards a future where connected devices are increasingly being used in our smart spaces. Beyond the traditional use cases, for example, authenticating with a personal device (computer, phone, etc.) or an online service (web banking), these smart environments offer a range of personalised services which require knowing the identity of the person currently using the space. For example, a smart home may detect the presence of a child or an elderly person and disable the access to risky home appliances (e.g. Oven) or impose restrictions on the content viewed on the TV and Web. The temperature and light settings in a room can be adjusted as per the preference of the individual. Likewise, a smart office may restrict the access to classified workplaces (e.g. a room with sensitive documents) to certain individuals. A number of person identification methods have been proposed for smart spaces with a particular emphasis on physical or behavioural biometrics, as they do not require individuals to carry specific devices. However, most methods that use physical biometrics have been shown to be vulnerable [1]. Fingerprints can easily be collected from the surfaces that a victim may have touched and used to circumvent the fingerprint-based authentication [1]. Similarly, there have been instances where a victim's photograph obtained from a simple web search has been used to bypass the face recognition system [2]. Likewise, iris scans can be subverted using the victim's image super-imposed with a contact lens [3]. The authors in Refs. [4] and [5] used an individual's gait (behavioural biometrics) manifested as a unique pattern in pervasive WiFi signals for human identification in smart spaces. These mechanisms are not only non-intrusive but they also do not require users to carry any sensor or hardware (e.g. smart watch for collecting gait data). However, they expect users to walk on a long pre-defined path which may not always be available in a constrained smart environment. In addition, these works have only been tested in fairly controlled setting, for example, when there are no other individuals in close proximity of the person being authenticated, which may not always be the case in the real-world.
Herein, we present a human identification mechanism for smart spaces called Vein-ID (alluded to as VID in rest of the work), that uses the vein pattern (physical biometrics) of an individual's hand dorsum recorded using an off-the-shelf depth camera. Vein patterns (in hand dorsum) show individuality and do not change appreciably for an adult unless subjected to major surgery [6]. In addition, unlike other physical biometrics identifiers (fingerprint, face and iris) described above, surreptitiously capturing a victim's vein pattern is difficult. For example, veins are located under the skin and leave no imprint on any surface when touched, and therefore cannot be gathered from the surfaces that a victim may have touched (unlike fingerprints). Moreover, it is unlikely that vein patterns are readily available on the Internet (unlike photos) and neither can they be captured stealthily by an attacker.
VID uses a commodity depth camera (Intel RealSense) to capture the images of the dorsal area of the fist of a hand. Making a fist forces veins to be closer to the skin and greatly simplifies the process for extracting the vein patterns. Recent advances in miniaturization and camera technology have made it possible to embed depth cameras in a wide range of devices such as smartphones, computers and even autonomous drones and cars [7,8]. Conventional cameras convert the 3D world into a 2D image, significantly limiting the performance of many computer vision applications due to the loss of the depth information [9]. In contrast, depth cameras can capture information in all three dimensions which is closer to reality, thus enabling the computer vision applications that would not otherwise be possible (e.g. avoiding collisions in autonomous robots/vehicles [9]). It is expected that this (depth camera) technology and thus VID will see a wider uptake in the near future. A typical depth camera (including the RealSense device used in our experiments) comprises a pair of depth sensors, a RGB sensor, and an infrared (IR) laser projector [10]. The presence of an IR projector makes it possible to detect vein pattern in the IR image because the blood in the veins attenuates the IR light differently than other biological tissues. At the same time, the depth camera captures the 3D images of the fist by projecting a stream of IR dots on the fist (not in visible range). These dots spread differently on close and distant objects (from the camera), and the depth perception is calculated using this displacement. When a fist is made and the hand is placed on a flat surface, the hand dorsum will be closer to the camera lens (than the flat surface) and hence can be recognized using the depth information. Figure 1 shows the typical usage scenario of VID. We assume that in any smart environment, the depth camera used by VID is mounted at a certain height (see Section 4 for details on setup) from a flat surface. Users are instructed to place their hand (for which samples have been enrolled, similar to fingerprint based authentication) on the surface and form a fist. Note that, Figure 1 illustrates a mobile phone as the image capture device but more generally any device with an embedded depth camera can be used in its place. For extracting the vein patterns, we utilize both, the IR image and depth information, which can be captured simultaneously by most depth cameras. Using depth information, we estimate the hand boundary, separating the background from the hand, and then using the hand position extract the veins, which illuminate differently in the IR image from the other parts of the hand. Once the vein pattern is estimated, we use image processing techniques to obtain a clearer version of the estimated pattern. The extracted vein patterns are fed to a deep learning model for precisely identifying the target individual from a set of N enrolled subjects for whom a set of training images have been collected a priori during an enrolment process (analogous to the aforementioned biometric methods). We consider two deep learning methods, namely Convolutional Neural Network (CNN) and Stacked-Autoencoders, as they have been used successfully in other classification tasks [11,12].
Vein-pattern-based recognition systems can be traced back to the early 2000s [13]. Prior works like [6,14] used the specialized hardware for extracting the vein patterns. In contrast, VID is designed to work with commodity depth cameras embedded in off-the-shelf smart devices such as smartphones and laptops [8]. A work that is closely related to VID is VeinDeep [15], which used the Microsoft Kinect V2 (which is now officially discontinued) for extracting the vein patterns. However, VID is different in many ways. 1) Vein-Deep sequentially compares the captured vein pattern with each and every enrolled pattern from the entire set of N identifiable individuals. This approach is thus not scalable, as the distance computation will need to done each and every time the subject wants to establish his identity. In contrast, the VID employs deep learning and is thus scalable. Once the model is trained, the identification process is quick and invariant of the number of users in the enrolled set. 2) Authors in Ref. [15] collected only six images per subject for their evaluations. Contrary, VID used 500 images per subject, which includes images that are more representative of normal usage of such a system and thus capture expected artefacts such as slight variations in the hand position (and orientation) relative to the camera when the images are recorded. The sequencebased matching methods such as VeinDeep are likely to produce false negatives when such variations are encountered (see details in Section 3.5). In contrast, VID utilizes deep learning which is adept at learning these subtle variations that are also likely to be present in the training data and still achieve high accuracy. 3) VID outperforms VeinDeep by approximately 9%, with an evaluation dataset that is 145� greater and more representative of the typical usage of such a system (detailed results are presented in Section 4.2).
The following are the main contributions.
• We present VID, an identification system that uses commodity depth camera for capturing vein patterns and uses deep learning to precisely identify the target individual from an enrolled set (see Sections 3.1-3.6).
• We also devised a strategy that can detect an intruder whose vein pattern is not enrolled and otherwise would be mis-classified as one of the enrolled individuals (see Section 3.7).
• We performed an extensive analysis by collecting a comprehensive dataset of approximately 17,500 image of hand dorsum from 35 subjects using Intel RealSense D415 depth camera. Evaluations revealed that both CNN and Stacked-Autoencoders-based models can identify an individual from a set of up to 35 users with an average accuracy of above 99%. The intruder detector achieved an average accuracy of 96% (see Section 4).
Herein, Section 2 contains the related work, whereas Section 3 expands on the detailed process involved in VID. Our experimental step-up and evaluations are discussed in Section 4, and finally, the concluding remarks appear in Section 5.

| RELATED WORK
Although vein-pattern-based authentication systems can be traced back to the early 2000s [13], we limit this section to a selection of papers that use vein patterns from palm or hand dorsum. While there are mechanisms that use finger-veins for identification, such approaches require placing the finger in a receptacle for extracting veins and therefore are unlikely to work with an off-the-shelf depth camera. Authors in Ref. [14] presented a system that uses a specifically designed hardware assembly comprising 48 infrared LEDs and a CCD camera to capture the video of user's palm placed at a distance of 24 cm. From the captured video, multiple images of the palm are extracted, which are then used to extract vein patterns. SIFT features are used to perform authentication. Likewise, authors in Ref. [6] presented another system that uses an infrared camera to capture images of the fist and extract the vein patterns. Minutiae triangulation is used for feature extraction. While making an authentication decision, a similarity score is computed between the test pattern and the enrolled pattern of the same individual. Similarly, authors in Ref. [16] extract the ends points and crossing points in vein patterns and then computes the distance from an enrolled pattern. The authors in Ref. [17] achieved a promising recognition accuracy using the Radon transform and Hessian-phase-based features from the palm veins, and computing a matching score with the enrolled template. All of these works not only require a special hardware assembly for extracting the vein patterns, but they also compute a similarity score with the sequence of pre-enrolled patterns to identify the individual which will be onerous and may not be scalable. VID on the other hand leverages a commodity depth camera (which is ubiquitously available on computing devices) and utilizes the deep learning which is scalable and the identification phase is invariant of the number of the enrolled users. Recently, authors in Ref. [15] used Microsoft Kinect V2 (which is now discontinued) to extract the vein patterns from the hand dorsum. Benefits of VID over this approach are mentioned in Introduction (see Section 1 for details).
LG recently introduced vein-based unlocking for mobile devices [18]. Unlike this work, given a vein pattern, VID endeavours to identify the corresponding subject from an enrolled set. Few works like [19][20][21] have used deep learning for either verification or identification. However, they utilize the publicly available finger-vein databases for evaluations. As described above, obtaining a vein pattern from a finger requires finger scanning in a receptable and, hence may not be feasible to acquire these patterns (fingerveins) using off-the-shelf depth cameras. In contrast, VID can leverage the ubiquitously available depth cameras, and also includes a strategy for detecting whose patterns are not enrolled. Likewise, authors in Ref. [22] used the CNN models (e.g. AlexNet) for dorsal vein recognition. VID performs better than this approach for a training/test ratio of 2:3 while having a simple network architecture (approximately 1% better accuracy). The authors in Ref. [23] utilized blood vessel structure of the sclera for human identification. However, illumination of light on the eyes from a short distance may not be convenient for the users. Other competing physical biometric identifiers such as iris and Face ID [8] also suffer from a similar issue as they too involve an IR illumination on the face from a short distance, resulting in concerns from users regarding potential damage to the retina and cornea [24,25]. In contrast, VID does not expose sensitive organs like eyes to the IR laser projections and thus does not pose such problems. Reference [26] proposes an identification method that leverages geometric features extracted from four fingers of the hand. However, as pointed out in Ref. [27], this mechanism is prone to spoofing. Likewise, authors in Ref. [28] presented a mechanism which requires user to perform some gesture in the air, which may be deemed strenuous by the users. In Ref. [29], authors used a thermal tracer to capture images of the subject's hand and then extract vein patterns. The matching decision is made by computing a Hausdorff distance between a test and enrolled pattern. Unlike the depth camera used in VID, thermal tracers are not ubiquitous, thus limiting its wide applicability.
As indicated in Section 1, vein-pattern-based identification offers significant benefits over the other biometric identifiers, such as fingerprints, facial recognition, and iris, which are shown to be vulnerable to certain types of attacks [30] (see Section 1 for details). Voice-based identification system can also be by-passed by recording the user's voice. Unlike these identifiers, furtively capturing the vein pattern of a victim is difficult. As the veins lie underneath the skin, they do not leave an imprint on any surface that the victim may have touched (unlike fingerprint). Moreover, they are not readily available on the Internet (unlike facial-photographs) and, are less likely to be stealthily recorded without being noticed by the victim (unlike voice and iris). Recently [31], an experiment demonstrated a way to potentially subvert vein-pattern-based authentication systems. A modified SLR camera with the infrared filter removed was used to record the vein patterns of the palm of the victim. These patterns were then overlaid on a wax hand mock-up to mimic the victim's hand and subvert the authentication system. However, such a hack may only work in a laboratory setting where the victim willingly allows her palm images to be captured. In the real world, furtively obtaining the images of vein patterns (particularly where the subject must form a fist) is difficult without being noticed by the victim. This shows that the VID may offer significant benefits over other state-of-the-art mechanisms and enables precise human identification by simply using the commercially available depth cameras, which are now being increasingly integrated in computing devices (e.g. mobile phones, laptops, etc.).

F I G U R E 2 VID work flow for identification
-145

| VEIN-ID IDENTIFICATION WORK FLOW
In this section, we present the details of our proposed VID system. Figure 2 outlines the various steps which will be explained in detail in the subsequent sub-sections.

| Capturing images with the depth camera
The Intel RealSense D415 Depth Camera is used for capturing the images (of hand dorsum) that are used for the identification process. As explained earlier in Section 1, this depth camera is equipped with a pair of depth sensors, a RGB sensor and an IR projector. We use a single IR image and a depth map for one identification check. Figures 3 and 4 show examples of the captured IR and depth image, respectively. In the IR image (see example in Figure 3), the veins stand out distinctly from the other parts of the hand, because the blood in the veins attenuates the IR light differently than other biological tissues. It is important to note that this observation is consistent for all skin types and colours, thus highlighting the generality of VID. The depth map contains the information regarding the distance of different parts of the hand from the camera view point. Recall from Section 1 that, we capture the images of human fist which forces veins to become closer to the skin. Also, when an adult makes a fist and places it on a flat surface, the back of the hand is about 5-6 cm from the flat surface. Thus, varying pixel values are attributed to the hand dorsum in the depth image. This helps in estimating the hand dorsum in the captured images and also facilitates the vein extraction process.

| Determining the ROI
As can be visualized in Figure 3 (and 4, a number of background objects are captured in both images. For segregating the user's hand (fist) from the rest of objects in the image, we determine the Region of Interest (ROI), which should only capture the user's hand. The ROI is represented as follows, where x i and y i refer to the row and column indices that approximately capture the subject's hand in both the images (i.e. IR and depth images with position 0,0 at the top left corner). I is used to generically represent the both IR and depth images. Note that, for a fixed deployment of VID (i.e. where the depth camera is at a fixed distance from the place where user places his fist) the x i and y i remain constant. In our experiments, we analysed a number of images and empirically set these values to constant indices that approximately capture the subject's hand. We allow some room for minor variations in the position of the hand. For a significant variation in the camera-to-hand distance, these indices would have to be recomputed (see Section 4.2.3 for details on impact of camera-to-hand distance). However, we anticipate that most practical deployments of VID would be similar to our set-up in Figure 1 whereby users are instructed to position their hand at a specific position (relative to the camera). Figures 5 and 6 show the resultant images after setting the ROI for the actual images shown in Figures 3 and 4, respectively.

| Estimation of hand boundary
Although, Figures 5 and 6 capture the subject's hand, an accurate estimation of the hand boundary is necessary to extract the vein patterns from the hand dorsum. This is because the presence of the background (which still appears in the image after determining the ROI) can impact the vein extraction process. To accurately estimate the hand boundary, we use the depth information, as shown in Figure 7 (same as Figure 6). It is evident that the back of the hand being closer to the camera (than the flat surface), appears darker in the depth image (marked in Figure 7). Utilizing this information, we use a threshold filter to estimate the hand boundary. The specific filter used is defined as, where D refers to the depth information, and th 1 and th 2 , respectively, represent the lowest and highest pixel values in the depth image corresponding to the back of hand (dark section). For determining the values of th 1 and th 2 , we tested a number of images (≈10 images from five enrolled users) by varying these values. Based upon our analysis, we set these values to 5 and 35, respectively. We make allowances for some variations in the camera-to-hand distance (i.e. upto 4 cm, see Section 4.2 for details). Figure 8 shows the output image obtained by applying the threshold filter to the depth image shown in Figure 7. We are only interested in the white portion of the resultant image ( Figure 8). Note that, with the help of depth map, we can easily detect the hand boundary without requiring an intensive computation, unlike some of prior techniques that employ pre-processing steps to segregate the hand. As the depth image and IR image are exactly of the same dimensions and recorded simultaneously, the hand boundary in the IR image can be readily mapped to the hand boundary in the depth image. The portion of the IR image that does not belong to the hand dorsum is eliminated by setting the corresponding pixel values to zero. This ensures that our vein extraction algorithm operates only on the hand dorsum and not on the background. Figure 9 shows the resultant image after identifying the hand boundary and removing background from the IR image originally shown in Figure 5. We refer to this image as I r in the subsequent discussion.

F I G U R E 8 Estimated hand dorsum
-147

| Extracting vein pattern from hand dorsum
For extracting the vein pattern from the captured images, we leverage the fact that the veins illuminate differently in the IR image than other portions of the hand, as the blood flowing through the veins attenuates IR light differently than other parts (as discussed earlier). Our vein extraction algorithm (detailed in Algorithm 1) takes both D th (hand estimate in the depth image) and I r (hand estimate in the IR image) as inputs. For every pixel index p(ij) in I r , the mean value of a square of area e � e pixels is computed (p(ij) represents the top-left pixel of the square that we used to segregate the vein pixels from I r ). The precise value of e depends upon the thickness of the veins and the distance between adjacent veins. We analysed a number of images from different subjects (≈10 from five subjects) and empirically set the value of e to 18. The pixel value in the vein pattern V p (i, j) equals 1 if, the value of the corresponding pixel in I r is less than the computed mean value and the depth value of same pixel in D th (i, j) is equal to 1. This is because the pixel values corresponding to the veins in the IR image are less than the values of adjacent pixels that do not contain veins (as the blood in the veins attenuates the IR light differently). The estimate of the hand boundary in the depth map is used to ensure that each pixel identified as a part of the veins (i.e. has value less than the mean of e � e square) lies on the hand dorsum (and not in the background). Note that, like other parameters (i.e. x i , y i , th 1 , th 2 described above), the value of e is chosen such that slight variations in the camera-to-hand distance do not impact the outcome. However, as stated above (in ROI discussion), these parameters may change for a significant variation in the camerato-hand distance. Figure 10 shows the extracted vein pattern obtained by providing I r ( Figure 9) and D th ( Figure 8) as inputs to Algorithm 1. As is observed, the vein-extraction algorithm returns a binary image representing the vein pattern V p .

Algorithm 1 Vein Extraction Algorithm
Result: Vein Pattern: V p Inputs: D th , I r , e % e is constant extraction parameter while i ≤ r && j ≤ c do % r and c respectively refers to maximum row and column index

| Connected components in vein pattern
It is evident from Figure 10 that the extracted vein pattern is noisy. To be able to extract distinctive features that can be used for identifying the individuals, it is important to eliminate the effect of noise and obtain a clean version of the vein pattern. For this purpose, VID leverages the Connected Components (CC) algorithm [32]. We feed V p (which is a binary image) to the CC algorithm, which first finds all non-zero pixels in the pattern (i.e. veins), and then uses the flood fill algorithm to label them to be part of a different connected component (labelling refers to identifying which component a particular pixel in the binary image belongs to). Specifically in our implementation, for labelling each pixel, we look at the eight adjoining pixels (up, down, left, right and all corners) for identifying conflicts with adjacent components to decide which component a particular pixel belongs to, and update the labels. We refer readers to Refs. [32] and [33] for further details on CC. Once all the non-zero pixels in V p are labelled and categorised as part of different CCs, we only retain the largest component and discard the rest. This is based on the assumption that the veins in V p correspond to the largest connected component while the other smaller (isolated) components are most likely to be noise. Figure 11 shows the clean vein pattern V p 0 , obtained by applying CC algorithm on V p shown in Figure 10. It is evident from Figure 11 that most of the noise is removed while the vein pattern is retained.  However, our analysis revealed that, slight variation in hand posture (e.g. a slight tilt in the orientation) while capturing the IR and depth images results in slight variations in the resultant vein pattern. This can be observed in Figures 15-18, which depict four different vein patterns of the same individual (Subject# 1). The similarity amongst all the vein patterns is evident, however, some of the patterns do not capture the entire vein structure (e.g., Figures 16 and 17). Converting the vein pattern to a binary sequence of 0's and 1's and then computing a matching score with the enrolled sequence (as done in prior work [15]) is unlikely to provide accurate results in real-world settings where such variations are likely to arise from time to time. Thus, we chose to explore deep learning, which has the proven ability to perform accurate classification even on noisy data. Specifically, we considered two different deep learning approaches, namely, CNN and Auto-Encoders AE), as they are widely used for similar classification tasks (like digit and word classification [11,12,34]). Next, we briefly discuss the architectures of CNN and AE along with the parameters used for the evaluations.

| Convolutional neural network
The architecture of the CNN that we have used is shown in Figure 19. Note that, the CNN structure used in VID is obtained empirically by analysing the identification performance with a varying number of layers and values of other parameters. The model is stacked as follows: the input layer, first convolutional layer (C1), first rectified linear unit (ReLU) layer, first pooling layer (P1), second convolutional layer (C2), second ReLU layer, second pooling layer (P2), four fully connected layers (FC1-FC4), and an output layer. The input is the vein pattern (i.e. V p 0 ) extracted in accordance with the method described in the previous section. The dimensions of the vein patterns (i.e. input) are [231, 231, 1], which are first passed through a set of convolutional filters (composed of neurons). These filters take as input a part of the vein pattern (generally referred to as receptive field), and attempt to learn certain features from it. In this way, we scan through the entire vein pattern, learning the localized features of different sub-sections of the pattern. In the first convolutional layer (i.e. C1), we empirically used a total of 20 filters of size [4,4] with a stride of [1,1]. No padding was performed as a part of C1. After passing the V p 0 through the C1, the depth becomes 20 resulting in an output of dimensions [228,228,20]. The output of C1 is fed to a ReLU, which performs the threshold operation on the filter's output by setting the negative values to zero and maintaining the positive values. Next, we pass the output of ReLU through a maximum pooling layer (i.e. P1), which is a downsampling layer and helps in reducing the parameters that need to be learnt. The max pooling layer scans through its input using a window of particular size that strides with a fixed length and outputs the maximum value of every window. In our implementation, we set the window-size to [4,4] and stride to [4,4] in the P1. Therefore, the maximum value of every window of size [4,4] will be the output of P1. As the pooling layer does not change the depth of its input, the output of P1 is of order [57,57,20]. Likewise, in the second convolutional layer (C2), we used a total of 4 filters with a size of [5,5] and stride of [1,1]. In C2, we used the padding in such a way that the output size is same as that of the input (i.e. padding ¼ 2). As a result, the output of the C2 is of form [57,57,4]. In P2 (i.e. second max pooling layer), we set the window size to [3,3] and stride to [2,2], leading to an output of dimensions [28,28,4].
The output of the second pooling layer (i.e. P2) represents the local features learnt from different parts of the input vein pattern. For identifying the corresponding subject given the vein pattern, these local features are unfolded to a flattened vector of size [1,3136], which is fed to a sequence of fully connected layers for making the final identification decision. As the name suggests, all the neurons in the fully-connected layer are connected to the neurons in the previous layer. It multiplies the input by a weight matrix and then add a bias vector. We used a total of four fully connected layers, with each F I G U R E 1 9 CNN architecture used in VID 150layer reducing the size of its input to half. For example, the first fully connected layer takes the flattened vector (i.e. [1,3136]) and produces an output of size [1,1568] (i.e. 3136/2 ¼ 1568). Similarly, the final features vector at the output of third fully connected layer (i.e. FC3) is of order [1392], which is then fed to the last fully connected layer (i.e. FC4) whose output is equal to number of enrolled subjects (i.e. 35 in this case). Our analysis shows that a decremental reduction in number of features in fully connected layers is particularly helpful in our intruder detection strategy which is based upon SVM and works well with less number of extracted features (see intruder detection in subsection 3.7), while maintaining a promising performance of VID in a typical usage scenario (i.e. identifying the corresponding subject from the enrolled set given a vein pattern, see results in Section 4). The output of FC4 is fed to an output unit activation function (i.e. softmax function) that computes the probability for each user in the enrolled set in accordance with following equation.
Finally, in the training phase, we take the softmax values and assign each input to one of the X mutually exclusive enrolled individuals by using the cross entropy cost function (optimized using SGDM algorithm) for 1-of-X coding scheme [35]; where l refer to loss, n is number of samples and, X refers to the total enrolled individuals (i.e. total classes). I ij is a representation for a general situation where the i th vein pattern belongs to the user j. y ij is the values of previous layer that is softmax function for i th vein pattern. In other words, it is the probability that the network associates the vein pattern i with the enrolled user j. In the test phase, the probabilities computed by the softmax function are used to make the final identification decision.

| Autoencoder
In its general form (illustrated in Figure 20), AE is a neural network that learns the efficient data codings in an unsupervised manner and reconstructs the input at the output with a minimal Mean Square Error (MSE) [11]. AE comprised an encoder and a decoder, represented by following transitions ζ and ϱ, respectively.
If the hidden layer is smaller than the number of pixels in the input vein pattern, the AE can learn a compressed representation of the input vein pattern which can serve as a feature vector for the target individual. The complexity is proportional to the number of neurons required for extracting distinctive features (hidden layer), which in turn depends on the size of the input image. To balance this trade-off, we tested the impact of the varying the size of the input vein patterns on the execution time and identification accuracy. Our experiments revealed that a size of 50 � 50 pixels achieved a reasonable balance. Moreover, vein patterns smaller than this size degraded the identification accuracy significantly. For feature learning, we tested different sizes for the hidden layer. Our analysis revealed that setting the size of the hidden layer to 1200 (i.e. approximately half the size of the input image) results in a better classification accuracy. Our experiments also showed that feeding these features to a secondary AE results in an improved identification accuracy. Therefore, we design a Stacked-Autoencoder model, with two autoencoders concatenated in series as shown in Figure 21. The Stacked-Autoencoders were also used in prior work [36] on user identification and achieved good accuracy. The second AE takes the features learned by the first AE as input and further compresses these features. The size of the hidden layer in the second AE is analytically set to one-sixth (i.e. 200) the size of the features learned by the first AE. These features are then passed to a Softmax function classifying the individual. Figure 21 shows the architecture of Stacked-Autoencoder model that we have used in our implementation. A complete list of the parameters (for both CNN and Stacked-Autoencoders) used in our implementation is presented in Section 4.1.

| Identification of the individual not in the enrolled list (i.e. intruder detection)
Given a vein pattern V p 0 , VID attempts to identify the corresponding individual from a list of enrolled individuals. However, if a given vein pattern does not belong to any of the enrolled individuals, the trained model will still associate it to one of the individuals, whose vein pattern has the closest resemblance with the given vein pattern V p 0 . In this section, we present a strategy that can help VID overcome the above artefact and correctly detect an intruder. Assume that, VID associates a given vein pattern V p 0 to the user U ex from the list of X enrolled users, U e1 , U e2 , …, U eX . To distinguish between the instance when V p 0 genuinely belongs to person U ex as opposed to an intruder, we design a secondary binary SVM classifier. This classifier is trained for two classes with t patterns (t refers to number of patterns used for training the original model). The first class refers to the user U ex (referred to as true-data), and the features of t patterns belonging to this user that are already learned by the deep learning model (e.g., CNN) are used for the training. The second class refers to intruders (referred to as false-data), which contains the features of all other X À 1 users excluding U ex with t/X À 1 patterns from each of the users. This secondary classifier is pre-trained for each enrolled user. If the given vein pattern V p 0 is associated to an enrolled user U ex by the deep learning model, the secondary classifier corresponding to this user U ex is selected and V p 0 is fed to it. If the secondary classifier also associates the given vein pattern V P 0 to the user U ex then it implies that the person being identified is genuinely U ex . Otherwise, the person is assumed to be an intruder and authentication is unsuccessful.

| EXPERIMENTAL EVALUATION
In this section, we present the evaluation set-up, experimental methodology and performance of VID.

| Evaluation setup and experimental methodology
We used an off-the-shelf Intel RealSense Depth Camera D415 (laser wavelength of 850 nm � 10 nm [37]) to capture the images of subject's hand dorsum. This camera has a five MIPI Camera Serial Interface to compute real-time depth map along with an active infrared projector to illuminate the hand dorsum to enhance the depth perception [10]. We interfaced this camera with an HP Folio 9480m laptop with core i5 processor  Figure 22 shows the experimental setup of the VID system. We mounted the camera at a height of 0.6 m from a flat surface upon which the users place their hand during the data collection process. Note that, we mounted the camera at a height to facilitate the data collection. In practice, any other orientation may also work (e.g. mounted vertically on a wall facing the user). We also varied the distance between the camera lens and the hand to analyse the effect on the accuracy of VID (see Section 4.2). We explicitly marked a square of dimensions 12 � 12 cm on the flat surface to serve as guide for the participants to place their hand. This helps in determining the ROI for extracting the hand from the captured images. We recruited a total of 35 volunteers as test subjects (27Mþ8F, with countries of origin from Australia, Austria, China, Brazil, Mexico, India, Pakistan, Indonesia, Iran, Srilanka, Spain and Saudi Arabia). All of these volunteers were Ph.D. students or PostDocs aged 23-40 years. The volunteers were told to use one of their hands (as per their preference) for all experiments. We instructed volunteers to place their hand by making a fist at the centre of the marked square, as shown in Figure 22. Recall from Section 3.1, that the fist gesture is particularly well suited for detecting the vein patterns in the captured images. No other specific guidelines were given regarding the orientation of the hand relative to the camera. Thus the images collected include natural variations in orientation (due to the users changing the hand orientation intermittently during the data collection) and are representative of real-world usage of such a system. We collected approximately 500 (precisely 490 � 10) images from each of the 35 volunteers. Once the data collection was completed, we first extracted the vein patterns from the collected images (using the steps outlined in Section 3). These vein patterns are then used for training and testing the deep learning models. The parameters for CNN and Stacked-Autoencoders used in our implementation are shown in Tables 1 and 2, respectively. Out of a total of 500 (i.e. 490 � 10) patterns from each volunteer, 300 were used for training the models (approximately total of 10, 500 images), while rest of 200 (i.e. 190 � 10) patterns per volunteer (approximately total of 7,000 images) were used for testing the performance of VID system. We also varied the number of training patterns to analyse the impact of training data upon the performance (see Section 4.2). The vein patterns of an individual not included in the enrolled list are detected using the strategy described in Section 3.7. For evaluating the performance of intruder detector, we used the vein patterns of four subjects who were not included in the list of (35) enrolled individuals (see detail in Section 4.2). The vein extraction algorithm was implemented in Matlab R2019a running on the Hp Folio laptop (with 2 GHz processor and 8 GB RAM), whereas the deep learning-based user identification models were implemented on two different hardwares 1 : (i) the same Folio laptop referenced above and using MatlabR2019a and (ii) a GPU-accelerated computing server and using MATLAB R2017b (with 2-12/24 thread processor and NIVIDIA TiTAN X Pascal GPU) available at CSE, UNSW Sydney [38]. This allowed us to evaluate the time required for training and testing the deep learning models on different hardware (see Practical Considerations in Section 4.2).

| Evaluation results
The results for both CNN and Stacked-Autoencoders are very similar. The accuracy for identifying an individual from the 35 test subjects is around 99.8%. The precision and recall were both around 99.7%, resulting in an F 1 score of 99%. Note that, F 1 score of VID is 9% better than the state-of-the-art [15]. Tables 3 and 4 show the confusion matrices for CNN and Stacked-Autoencoders models, respectively. These matrices show that both the models can accurately identify the target individuals from a set of 35 enrolled users. It is noteworthy that, the total population size evaluated herein is approximately 7 times of that used in Refs. [4] and [5] (these used human gait information manifested in pervasive WiFi signals for identification in smart spaces) and Ref. [39] (which uses cardiopulmonary activity of an individual for identification in smart spaces), while also outperforming them by approximately 15%. This shows the feasibility of VID for both small (three to six inhabitants) and medium-large (6-35 inhabitants) smart spaces. Recent statistics reveal that in OECD countries, an average household has between two to six inhabitants [40], while a typical micro-enterprise has between 7 and 35 employees. Figures 23 and 24 show the ROC curves, which graphically illustrates the diagnostic ability of VID for identification across 35 enrolled users at varying thresholds, by plotting the FPR versus TPR. All curves cluster around the top left corner, which confirms the accurate classification capabilities of VID (Note that, Class 0 i 0 in Figures 23 and 24 corresponds to User 0 i 0 ).

| Identification of an intruder
For evaluating the ability of VID to detect an intruder, we trained the secondary classifier described in Section 3.7 for all enrolled user (i.e. U e1 , U e2 , …, U e35 ) with 50 patterns for both classes (i.e. true-data and false-data, see Section 3.7 for details). The first class (true-data) contains the features of 50 patterns of a user U ex (i.e. to whom the classification model associates a given vein pattern), whereas the second class (false-data) contains the features of the other 10 users from our set of 25 users (5 patterns from each of 10 randomly selected users resulting in 50 patterns). For testing, we used a dataset comprising 20 vein patterns belonging to four non-enrolled users (i.e. not in the enrolled list U e1 , U e2 , …, U ex ) and five patterns belonging to the user U ex . Table 5 shows the performance of the intruder detection methodology (with features learnt by CNN at FC3). The accuracy of VID in  (80%, 76%, 80%, 80%, and 76%, respectively). This may be attributed to the fact that SVM is unable to discriminate between the features of some of the intruder's patterns from the four enrolled users (i.e. U e2 , U e6 , U e10 , U e11 , U e28 ). However, even for these five subjects, the majority of the intruder's patterns were correctly identified, resulting in a high overall accuracy. This demonstrate the efficacy of our approach in detecting intruders.

| Impact of number of training patterns
Recall from Section 4.1 that we used 300 vein patterns per user to train both models. In this sub-section, we analyse the impact of varying the size of the training data upon the performance. We varied the number of training patterns from 100 to 300 with increments of 50 samples in five different experiments with a total of 35 enrolled users as before. Figure 25 presents the results. It is conspicuous that the percentage accuracy of both CNN and Stacked-Autoencoder improves slightly with an increase in the size of the training set. The improved accuracy of CNN can be attributed to the fact that it is better at learning localized features with more training samples. In contrast, the accuracy of AE is almost invariant beyond 200 training samples, which may be attributed to the fact that it learns the compressed representation of the input pattern which can be transformed back to the pattern with minimum MSE. Nevertheless, both the models achieved high accuracy with even a small training set. For example, the accuracy is 99.3% for CNN and 99.6% for AE with only 100 training samples. This suggests that VID is particularly well-suited for real-world deployment. As mentioned in Section 1, VID requires users to undertake an one-time enrolment procedure. A smaller number of training samples implies a swifter enrolment process.

| Impact of distance between camera and hand
Recall from Section 4.1 that, the distance between the camera and the hand was fixed at 0.6 m. As discussed earlier in Section 3, the camera-to-hand distance affects the values of a few parameters (i.e. th 1 , th 2 and, e detailed in Section 2) used for extracting vein patterns. Herein, we analyse the impact of varying the camera-to-hand distance during testing on the accuracy, assuming that the aforementioned thresholds are derived with this distance set to 0.6 m. For this purpose, we collected six test images from a subject U x by varying the camera-to-hand distance from 48 to 56 cm in increments of 4 cm (resulting in 18 images), whereas the training samples (i.e. 50) were collected at the distance of 60 cm. Figure 26 shows the accuracy of VID at varying distances. It is evident that, the  -157 accuracy decreases as the position of the hand is moved further away from the place used to collect the training samples. For example, all the test samples collected at a distance of 56 cm were correctly classified (i.e. 4 cm from the enrolment position). However, the accuracy decreased to 83% at 52 cm, which further dropped to 50% at 48 cm. This demonstrates that although VID accommodates some minor variations in camera-to-hand distance (up to 4 cm), a readjustment of some empirical parameters may be required to accomplish high accuracy when the camera-to-hand distance is significantly changed (compared with the enrolment phase).

| Practical considerations
Herein, the evaluation of time required for training (with 100 samples) and testing the deep learning models is performed. Two different hardware platforms described earlier in Section 4.1 are considered. Figures 27 and 28

| Comparison with other human identification approaches
In this sub-section, we compare VID with other popular human identification systems crafted specifically for smart spaces. Table 6 presents the summary of this comparison. As VID leverages depth camera which is increasingly appearing on numerous smart devices and are thus likely to be ubiquitous in smart spaces, the comparison with other vein-patterns based approaches which makes use of some purpose-build hardware (e.g., matrix of LEDs to illuminate hand for extracting vein-pattern) is irrelevant. In addition, VID makes use of both depth map and IR images captured simultaneously by an off-the-shelf depth camera. To the best of our knowledge, no other publicly available dataset has these two types of images, thus a cross-comparison with other vein-pattern-based approaches is both infeasible and unnecessary. One exception to this is VeinDeep [15], that makes use of Microsoft Kinect V2 (discontinued officially) to capture both depth image and IR image for identifying the corresponding subject. As indicated in Sections 1 and 2, VID outperforms VeinDeep with a 9% better F 1 score, and is evaluated with almost 1.75� more users (i.e. 35 vs. 20). In view of this elaboration, the rest of this sub-section is focused on comparing the performance of VID with other popular human identification approaches proposed for their potential applications in smart spaces. One popular approach for human identification in smart spaces leverages WiFi signals to capture human gait-patterns and utilize the related perturbations to establish the identity of corresponding person. Although, this approach is nonintrusive-i.e. it does not necessitate users to carry any special hardware for identity establishment, it still requires some deliberate actions for accomplishing the identification. For example, approaches presented in Refs. [4], [5], and [36] require user to walk on a long pre-defined path which may not be available in constrained smart spaces. In addition, all of these approaches are -159 evaluated for group sizes of 2-11 users-i.e. only 31% of VID, and all have considerably less accuracy as compared with VID (i. e. Refs. [4] and [5] have less than 80% accuracy for group of only five to six users, and Ref. [36] has around 94% accuracy for 11 users). Similarly, authors in Ref. [39] used WiFi signals to capture cardiopulmonary activity of a person and used it for identification. However, this approach is also evaluated for small groups only-that is two to five users, and has an accuracy of around 70% for an enrolled group of five users. All of the aforementioned approaches require some deliberate action from the user (i.e. walk in Refs. [4], [5], and [36], and sit in front of WiFi devices in Ref. [39]), which is akin to VID (i.e. it requires user to place hand dorsum on a flat surface). However, as described earlier, VID is evaluated for a much bigger enrolled set and achieves better accuracy, and is thus deemed better than other state-ofthe-art approaches. Similarly, other approaches leverage smart wearable (e.g., smart watches) to accomplish the human identification in smart spaces. For example, authors in Ref. [42] used wrist worn inertial sensors to capture data from the daily activities and used them to identify the corresponding person from a F I G U R E 2 5 Accuracy as a function of training samples F I G U R E 2 6 VID accuracy at varying camera-to-hand distance group size of 29 subjects and demonstrated an accuracy of around 71-88% (i.e. evaluated group sizes is smaller than VID and accuracy is less than VID). Likewise, authors in Ref. [43] used floor-embedded sensors to identify the subjects. Not only this approach appears to be cumbersome unlike VID which is dependent upon depth camera that can easily be embedded in smart devices such as smart TV or refrigerator, it is only evaluated of 11 subjects and has an accuracy of around 88% (unlike VID whose accuracy is around 99% for much bigger enrolled set -i.e. 35). This discussion suggests that VID outperforms all other state-of-the-art mechanisms proposed for human identification in smart spaces.

| CONCLUSION
An investigation is undertaken into the use of vein patterns embedded in the hand's dorsum for uniquely identifying an individual. We utilize a commodity depth camera to capture the vein patterns by leveraging the depth map and IR image, and then design two different deep learning models, namely, CNN and Stacked-Autoencoders, to identify an individual from an enrolled set. We also crafted a methodology for identifying the intruders whose pattern are not enrolled. Our extensive evaluations revealed that the VID can identify an individual with an average accuracy of above 99% from a group of up to 35 subjects. Our intruder detection strategy achieved an accuracy of 96%. We implemented VID on two different hardware platforms and our observations indicate that the deep learning model(s) can be trained on a high-performance server, while the identification can readily be performed on an embedded platform. In future, we plan to do the comparative analysis of VID under multiple scenarios such as different orientations of patterns, hand-boundary detection without depth map, impact of lighting, and performance comparison with other state-of-the-art techniques. In addition, we also plan to test VID on much bigger group sizes (i.e. exceeding 50) so as to analyse its performance in real-world situations.

of Users Comparison
VeinDeep [15] Microsoft Kinect V2 to access depth and IR images 20 VID has 9% better F 1 score with 1.75� users WiFi-ID [4] WiFi signal to capture gait-pattern 2-6 users Accuracy is less than 80% for 6 users WiWho-ID [5] WiFi signal to capture gait-pattern 2-6 users Accuracy is around 80% for 6 users Smart user authentication [36] WiFi signal to capture gait-pattern 2-11 users Accuracy is around 94% for 11 users CP-ID [39] WiFi signal to capture cardiopulmonary activity 2-5 users Accuracy is around 70% for 5 users Continuous user identification [42] Wrist worn inertial sensor 29 users Accuracy is around 71%-88% Person identity [43] Floor-embedded sensors 11 users Accuracy is around 88% VID Depth camera 35 users Accuracy of around 99% SHAH ET AL.