Recent Advances in Artificial Intelligence Sensors

Significant growth in the development and deployment of artificial intelligence (AI) is being witnessed. Driven by the great versatility of emerging computer science and material science, various AI sensors provide cost‐effective approaches for a wide range of monitoring applications toward the realization of smart homes and personal healthcare. Advanced AI sensors have multiple sensors capable of detecting multidimensional information and human‐brain‐like computation device for data processing. Herein, this review outlines the recent advances in the development of AI sensors. This review first introduces the materials, fabrication methods, and algorithms of current AI sensors and their applications, i.e., complementary metal oxide semiconductor image sensors for computer vision, microelectromechanical systems, microphone sensors for voice recognition, and wearable sensors for gesture recognition. Then, the recent advances in AI wearables sensors and self‐powered sensor systems are highlighted. Next, the current developments of neuromorphic computing systems, multimodality, and digital twins are reviewed. Last, a perspective on future directions for further research development is also provided. In summary, the trend of advanced AI sensors is the complementary between edge computing and cloud computing, which will show great potential in the applications of smart buildings, individual healthcare, the Internet of things, etc.


Introduction
With the significant advancement of Internet of things (IoT) and artificial intelligence (AI), Fourth Industrial Revolution DOI: 10.1002/adsr.202200072 (Industry 4.0) is revolutionizing the way that companies upgrade and transform manufacturing technologies to realize smart digital factories. [1][2][3][4][5] Automatic machines and robotics play a vital role in our daily lives and are essential components for manufacturers, which leverage AI to finish cloud computing and data analytics in their production facilities and throughout their operations. [6][7][8][9] The fast development of AI started from 2012 benefiting from the improvement of computation power and speed, in which graphics processing unit (GPU) accelerates the design of various deep learning algorithms. Deep learning makes great achievements in speech, image, and other fields, which is continuously emerging different architectures to get better accuracy of target specific tasks. In 2012, AlexNet [10] was designed to separate the deep neural network into two parts to make use of the computing power of multiple GPUs. With the AlexNet, the accuracy of the ImageNet Large Scale Visual Recognition Challenge (LSVRC) 2012 is about 84.6%. In 2014, VGGNet [11] achieved higher accuracy of 92.7% with the advantages of deeper depth of the neural networks. At the same time, generative adversarial networks (GANs) [12] are presented as a powerful subclass of generative model for image generation tasks. Then, the ResNet [13] got the accuracy of 96.4% which is beyond the average human level in 2015. In addition to the achievements in image processing, several fundamental deep learning frameworks, such as deep neural network (DNN), [14] convolutional neural network (CNN), [15,16] transformer, and recurrent neural network (RNN), [17] have been applied in the fields of visual art processing, automatic speech recognition, natural language processing, healthcare informatics, etc. [18][19][20][21][22][23] The burst of sensor technology builds up the other cornerstone for Industry 4.0. Sensor devices have equipped with the function of responding the physical stimulus (such as heat, light, sound, pressure, magnetism, motion, etc.) or chemical variation (such as humidity, relative humidity, volatile organic compound, etc.) and transmits a resulting impulse for measurement and control system. [24][25][26][27][28][29] In addition to the most common complementary metal oxide semiconductor (CMOS)-based image sensor for computer vision, microelectromechanical systems (MEMS)-based microphone for voice recognition, another MEMS sensors (such as accelerometers, gyroscopes, pressure sensors, tactile sensors, biosensors, etc.) were widely applied to many applications, [30][31][32][33][34] especially for wearable electronics, due www.advancedsciencenews.com www.advsensorres.com to their advantages of small size, low power consumption, low cost, high reliability, and robustness. In addition, piezoelectric and triboelectric-based sensors are investigated as self-powered sensors which can harvest energy from ambient and further eliminate the need of external power supplies required for peripheral electronics and user interfaces. This feature has accelerated the development of wearable sensors by using piezoelectric and triboelectric materials. [35][36][37][38][39][40][41][42] Wearable sensors combining with AI data analytics can capture the signals of muscle deformation, joint bending, temperature changing, heartbeat frequency, etc., where such information are crucial and widely applied for healthcare, environmental monitoring, human-machine interactions (HMI), and plant monitoring applications (Figure 1).
Since the tremendous amount of sensors will generate a large amount of data to bring a heavy load for networks, some IoT applications may not be supported by cloud computing under the current IoT framework. [43][44][45][46][47] In addition, all the data sending to the cloud will prolong the response time. In 1990s, Akamai launched content delivery network (CDN), as original edge computing, which introduces nodes at locations geographically closer to the end user for the delivery of cached content such as images and videos. [48,49] Nowadays, edge computing, as one of the top technologies, is developed for building sensor networks with shorter response times, more efficient computing, and smaller transmission power consumption, while the current processors are often constrained to be compact, mobile, and batteries powered. [50][51][52] Hence, it is crucial to make new computation units and AI algorithms to compute on compact edge devices operated at low power consumption. Therefore, neuromorphic computing is presented to provide a new computing architecture based on biological brain, since the average human brain contains between 80 and 100 billion neurons, each of which works highly efficiently and asynchronously to provide massively parallel processing. [53][54][55][56][57][58] Spiking neural networks (SNN) is one of the most promising platforms to emulate the biological neurons, i.e., the neuron fires and transmit a signal to other neurons when the membrane potential reaches the threshold. [59][60][61][62] On the other hand, due to the limited functionality that a sensor can have, multimodality becomes a solution to this problem by combining different sensors to achieve functional diversity in a system. Human biological sensory systems can obtain more comprehensive information and make more reliable reactions from multiple signal sources than with a single signal. [63][64][65] Therefore, the artificial sensory neuron is a promising research direction for mimicking the human brain, including synaptic systems on the hardware side to multimodality algorithms on the software side.
Based on the development of the edge computing sensor network integrated with advanced AI algorithms and sensor technologies, digital twin (DT) is presented for improving work productivity, interactive learning environments, and convenient virtual communication. [66,67] DTs make the product design with a more digitalized process by emulating the physical systems into virtual space and reflecting the real-time status in augmented reality (AR)/virtual reality (VR), which will help users better control and monitor the smart system. [68][69][70] Recently, DT concept is presented from satellites to manufacture, smart home/building, and smart farming. In addition to applying DTs for a diversified range of applications including entertainment, healthcare, industrial design, communication, etc., there are problems of re-source waste and environmental pollution in agricultural production that need the help of DT. [71][72][73][74] These problems urgently need us to make a major change in agricultural production mode, and the combination of DTs and agriculture can pave a feasible way to solve the above problems. The integration of smart farming and DTs can achieve better more comprehensive data analytics to continuously monitor the growth status and localized microclimate in real time. [75][76][77][78] The emergence of DTs enables users to identify problems in advance, and schedule predictive maintenance of urban farming at the right time.
This review focuses on the development of advanced AI sensor. In Section 2, we introduce the advanced image sensors and cutting-edge AI analysis. In Section 3, microphone sensors which can realize voice recognition are introduced. In Section 4, other type sensors are presented. In Section 5, advanced wearable sensors integrated with AI data analytics are reviewed. In Section 6, advances in self-powered sensing system are provided. In Section 7, neuromorphic computing as the core technology of edge computing is introduced. In Section 8, various multimodality is presented. In Section 9, DTs and their applications are shown as the future prospective. Lastly, a short conclusion is provided in Section 10.

Advanced Image Sensors
CMOSimage sensors were developed between the early and late 1970s. Due to the unacceptable performance of CMOS, advances in CMOS design were yielding chips with smaller pixel sizes, reduced noise, more capable image processing algorithms, and larger imaging arrays from 1990s. [79][80][81] Recently, the rapid development of CMOS image sensors and AI algorithms for computer vision has enabled computers/machines to process visual data like humans. [82][83][84][85] Among various artificial neural networks (ANN), CNN plays a more and more important role in various computer vision because of the powerful feature extraction capability of the convolutional layer. [86,87] However, the numerous multiply accumulate (MAC) operations and huge storage space of CNN's parameters make the terminals difficult to complete the computation independently. Therefore, current data analysis methods of image sensor is mainly realized in cloud computers although the data acquisition is carried out by image sensor, which leading to a long delay, high-power consumption, and wasting communication bandwidth and storage memory. [88] The promising solution of advanced CMOS image sensor is integrating the computing units and the sensing units together to overcome the aforementioned issues. [89] For example, as shown in Figure 2a, Song et al. proposed a processing-in-pixel (PIP) architecture for COMS image sensor by executing the convolution operations before the column readout circuit, [90] which improves the resource utilization of the deep learning accelerators (DLA) to significantly reduce the overall power consumption. The fill factor and computing efficiency are improved by highly efficient convolution operations in pixels, while the MAC operations are achieved by pulse width modulation (PWM) and charge redistribution, and the convolution operations are realized by reconfigurable switching for parallel computing and low-level features extracting. Finally, the proposed PIP architecture can realize massive parallel convolution operations to generate one output feature map with the filter size of 3 × 3 × 3 and support 60 frames  [79] Copyright 2005, IEEE. MEMS sensor: Reproduced according to the terms of the CC BY license. [122] Copyright 2021, the authors, published by MDPI. Wearable sensor: Reproduced with permission. [141] Copyright 2019, SAGE Publications; Reproduced with permission. [38] Copyright 2018, American Chemical Society; Reproduced with permission. [142] Copyright 2017, American Association for the Advancement of Science; Reproduced with permission. [198] Copyright 2020, American Chemical Society. Deep learning: Reproduced with permission. [222] Copyright 2015, Spring Nature. Digital Twin: Reproduced with permission. [197] Copyright 2020, Springer Nature; Reproduced under terms of the CC-BY license. [199] Copyright 2022, The Authors, published by arXiv; Reproduced with permission. [200] Copyright 2022, Elsevier.  [90] Copyright 2022, IEEE. b) A hybrid architecture based on 2D memristor crossbar array and CMOS integrated circuit. Reproduced with permission. [91] Copyright 2022, Springer Nature. c) A multifunctional infrared (IR) image sensor based on an array of black phosphorous programmable phototransistors (bP-PPT).Reproduced with permission. [95] Copyright 2022, Springer Nature. d) An integrated end-to-end photonic deep neural network (PDNN).Reproduced with permission. [96] Copyright 2022, Springer Nature. and 128 × 128 resolution with the output channel size of 64. The computational efficiency can be as high as 3.37 TOPS/W at the 8-bit weight configuration.
Additionally, Kumar et al. design a hybrid architecture based on 2D memristor crossbar array and CMOS integrated circuit in Figure 2b, [91] which shows the potential of many classification tasks for audio, image, video and so on. The proposed hybrid system architecture leverages from the emerging memristive technologies made of 2D materials for edge computing, which consists of a CMOS encoder chip followed by the memristor decoder chip. The two units based on the local receptive field-based extreme learning machine (LRF-ELM) algorithm work in tandem yet being isolated from one another. Among them, the CMOS encoder chip is utilized for extreme ELM algorithm involving by an ELM encoder, a row-select encode unit, a bias generator unit, and a control unit. The memristor decoder chip performs MAC operation between the input feature map and the stored weights, which consists of a row-select decode, a memristor crossbar ar-ray, and a mixed-signal interface unit. As a result, the proposed hybrid architectural framework achieves nonlinearity with minimum transistor count. In addition, the advantages of small footprint and low power can make it used as a multistate memristive device for edge computing.
In addition to using the electrical circuits to realize edge computing, the development of optical in-sensor computing is emerging thanks to its speed-of-light latency and low energy consumption. Among them, 2D semiconductors have the advantage of realizing such AI image sensors, including the tunable electrical and optical properties and amenability for heterogeneous integration. [92][93][94] Therefore, Lee et al. reported a multifunctional infrared (IR) image sensor based on an array of black phosphorous programmable phototransistors (bP-PPT) for broadband optoelectronic edge computing in Figure 2c. [95] The proposed bP-PPT based array is programmed electrically and optically by utilizing the stored charges in the gate dielectric stack, which has programming precision with a resolution higher than 5-bit based on the charge trapping mechanism. Therefore, the bP-PPT array combines the functions of multispectral imaging and analog inmemory computing for image recognition. The programmability of bP-PPT array enables the edge detection of optical image over a broad infrared band. The demonstrated 5-bit programming precision of proposed devices is more suitable for edge computing due to its low power consumption and low latency and can be extended to a broader range of infrared and further improved by heterogeneous integration of bP with other 2D materials or optimized for a specific spectral range by varying bP's thickness. Overall, the demonstrated optoelectronic bP-PPT array can realize more complex DNN for image sensors distributed with edge computing, which promise for distributed and remote sensing applications on the industrial, smart home/building, and smart farming.
Another example is proposed by Ashtiani et al., as shown in Figure 2d. [96] The authors reported an integrated on-chip photonic deep neural network (PDNN) for end-to-end image classification. First, the array of grating couplers converts the target images as input pixels and makes the optical waves coupled into the corresponding nanophotonic wave guides. Then the light propagates through neurons of different layers on the PDNN chip. Among them, linear computation is performed optically in each neuron and the nonlinear activation function is optoelectronically realized, which realizes the classification of a single image under 570 ps. Because the supply light is uniformly distributed the uniformly distributed into same optical output range for each neuron, the proposed PDNNs have good scalability to enlarge the scale of CNN structure. Finally, the PDNN chip realizes the accuracies of higher than 93.8% and 89.8% for two-class and four-class classification of handwritten letters, respectively. Implementing the entire optical CNN on a single chip eliminates the challenges of conventional image recognition requiring analog-to-digital conversion and large memory modules, providing a new platform for the next generation of faster, more energy-efficient neural networks.
For the development of AI image sensor in the software part, thanks to the high-speed parallel computing capabilities of GPUs, computer vision is experiencing rapid development, which is most widely implemented in many applications including image recognition, object detection, image generation, image translation, 3D modeling, etc. [100][101][102][97][98][99] With the advanced growth in machine translation and object detection, many interesting works are inspired to develop the conversion between image to text. [103][104][105] For example, as shown in Figure 3a, Andrej et al. presented a deep visual-semantic alignment model for generating natural language descriptions of images. [106] The alignment model implements CNNs to realize object detection for generating image regions and bidirectional RNNs for segmenting sentences. Then the structured objective will align the two modalities through a multimodal embedding to generate descriptions of image region. In addition, caption generation is a very important challenge in the machine learning field because it needs complicated computation to mimic the human brain in the compression from huge amounts of salient visual information into descriptive language. Therefore, as shown in Figure 3b, Xu et al. proposed a model based on Attention to describe the content of images. [107] This model uses a lower convolutional layer as the encoder to extract a set of feature vectors and a long-short-term memory (LSTM) layer as the decoder to produce a caption by generating one word at every time step. Most recently, Wang et al. developed a Generative Image-to-text Transformer (GIT) in Figure 3c, [108] which consists of an image encoder based on the contrastive pretrained model and the text decoder based on a transformer module to predict the text description. The overall network structure is simple since the pretraining task is just to map the input image to the entire associated text description and consists only of one image encoder and one text decoder. In addition, the proposed model can achieve good results after scaling up the pretraining data and the model size. Moreover, the videoto-text transformer is a popular topic because the current imageto-text approach parses snapshots or short clips accurately, they cannot connect dots and reason across a longer range of time. Therefore, Wu et al. design a Memory-Augmented Multiscale Vision Transformer (MeMViT) to overcome this challenge. [109] As shown in Figure 3d, MeMViT is used for long-term video recognition, which is based on a memory-augmented multiscale vision transformer network. The core of MeMViT is to cut a long video as a sequence of short clips and sequentially process them, in which the memory obtained from earlier iterations was cached. The advantages of MeMViT are high accuracy, efficient scale-up, and easy integration of other transformer-based video models. Finally, it can handle videos 30 times longer than existing models with only a 4.5% increase in computation.
In addition to image-to-text conversion, Hong et al. designed a zero-shot text-driven 3D avatar generation and animation, named Avatar Contrastive Language-Image Pre-Training (AvatarCLIP). [110] As shown in Figure 3e, AvatarCLIP can create a customized 3D avatar following the users' expected shape and texture and make the avatar follow the predefined motions using desired text. Specifically, the generated 3D human geometry is initialized from shapes driven by natural language descriptions through a Variational Autoencoder (VAE) network. The generated 3D human shape further facilitates geometry sculpting and texturing through a volume rendering model. Moreover, by leveraging the priors learned in the motion VAE, a Contrastive Language-Image Pre-Training (CLIP)-guided referencebased motion synthesis method is proposed for the animation of the generated 3D avatar.
Another popular application in computer vision is the estimation of 3D objects from 2D images/videos, which can get free viewpoint rendering of a moving subject and rotate 360°around the performer for each frame from images/videos of human activity. Previous rendering methods are mostly based on multiview input, but the solution based on a monocular camera is a longstanding research challenge. [111,112] As shown in Figure 3e, Weng et al. presented the Human Neural Radiance Fields (Hu-manNeRF) model using a single video of a moving person as inputs. [113] HumanNeRF also optimizes the 3D human model for a canonical, volumetric T-pose of the human after preframe, off-the-shelf segmentation, and automatic pose estimation. The proposed approach is data-driven, with the canonical volume and motion fields derived from the video itself and optimized for large body deformations, trained end-to-end, including 3D pose refinement, without template models. Finally, their solution can pause at any frame in the video and conditioned on the pose in that frame, rendering the resulting volumetric representation from any viewpoint. Reproduced with permission. [106] Copyright 2015, IEEE b) A model based on attention to describe the content of images.Reproduced under terms of the CC-BY license. [107] Copyright 2015, The Authors, published by arXiv. c) Develop a Generative Image-to-text Transformer (GIT) for prediction the text description.Reproduced under terms of the CC-BY license. [108] Copyright 2022, The Authors, published by arXiv. d) A Memory-Augmented Multiscale Vision Transformer (MeMViT) for long-term video recognition.Reproduced with permission. [109] Copyright 2022, IEEE. e) A Avatar Contrastive Language-Image Pre-Training (AvatarCLIP) for zero-shot text-driven 3D avatar generation and animation. Reproduced with permission. [110] Copyright 2022, ACM. f) A Human Neural Radiance Fields (HumanNeRF) model for 3D pose estimation. Reproduced with permission. [113] Copyright 2022, IEEE.

Acoustic Sensor and Voice Recognition
In addition to the image sensor, acoustic sensor is one of the most intuitive tools to build bilateral communication between humans and robots/machines. [114][115][116] The most common acoustic sensors are MEMS microphone operating within the frequency ranging from 20 to 20 000 Hz. Traditional MEMS microphone sensors are based on capacitive sensing mechanism, i.e., condenser microphone, which induces the voltage variation between a flexible membrane and a back plate with a bias voltage. [117,118] With the development of fabrication process and speech-related algorithms, MEMS-based microphone has big markets and wide applications in consumer electronics and AI voice assistant. [119,120] For example, Lo et al. proposed a silicon on insulator (SOI) condenser MEMS microphone consisting of planar interdigitated sensing electrodes, a deformable diaphragm, and a back chamber, as shown in Figure 4a. [121] The no-back-plate structure with a 600 μm diameter diaphragm had 42 pairs of planar interdigitated electrodes (IDTs) in the same plane of the diaphragm. The sen-sitivity of the proposed microphone is −60.1 dB at 1 kHz, such high performance of linear output and fewer damping noises is achieved by the out-of-plane capacitance changing and no backplate design. Shubham et al. proposed a MEMS capacitive microphone with suspended polysilicon diaphragm with a center post and flexible springs supported by eight peripheral backplate protrusions extending from the backplate, as shown in Figure 4b. [122] Finally, with an applied bias, this microphone has a sensitivity of −38 dB (ref. 1 V/Pa at 1 kHz) and a signal-to-noise ratio (SNR) of 67 dBA measured in a 3.25 mm × 1.9 mm × 0.9 mm package including an analog application-specific integrated circuits (ASIC). Recently, the advance of human-computer interaction in wearable sensors provides a thin-film, flexible, lightweight, and robust solution for loudspeakers and microphones. [123][124][125][126] As shown in Figure 4c, Han et al. reported a flexible piezoelectric acoustic sensor (f-PAS) with a highly sensitive multiresonant frequency band for speaker recognition. [127] A flexible piezoelectric membrane was employed using inorganic-based laser lift-off (ILLO) to fabricate the basilar membrane (BM)-inspired f-PAS.  [121] Copyright 2015, The Authors, published by arXiv b) A schematic cross-sectional view of conventional capacitive MEMS microphone package. Reproduced with permission. [122] Copyright 2021, MDPI (Basel, Switzerland). c) A flexible piezoelectric acoustic sensor (f-PAS) for speaker recognition. Reproduced with permission. [127] Copyright 2018, Elsevier. d) A framework, wav2vec 2.0, for self-supervised learning of speech representations. Reproduced under terms of the CC-BY license. [132] Copyright 2020, The Authors, published by arXiv. e) A Hidden-Unit BERT (HuBERT) architecture for self-supervised speech representation learning. Reproduced under terms of the CC-BY license. [134] Copyright 2021, The Authors, published by arXiv.
The f-PAS acquires abundant voice information from the multichannel sound inputs. Incoming voice information of TIDIGITS dataset was recorded by the f-PAS and converted to frequency components using a fast Fourier transform (FFT) and a shorttime Fourier transform (STFT) to acquire the frequency characteristics of the human voice. Then, the Gaussian mixture model (GMM) is utilized to classify the different speakers from multichannel outputs, which exhibit an outstanding speaker recognition rate of 97.5% with an error rate reduction of 75% compared to the commercial MEMS microphones.
Benefiting from the advances in microphone and various acoustic sensors, the AI algorithms for speech recognition have been applied widely in various applications. [128][129][130][131] For example, Figure 4d shows a DL framework, named wav2vec 2.0, proposed by Baevski et al. for self-supervised learning of speech repre-sentations from raw audio data, [132] which encodes speech audio via a multilayer CNN. Wav2vec 2.0 is aiming to learn powerful representations from speech audio alone followed by finetuning on transcribed speech, which is conceptually simpler and can outperform the best semi-supervised methods. This method masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations. Wav2vec 2.0 achieves word error rate (WER)1.8/3.3 on the clean/other test sets of Librispeech when using only 10 minutes of labeled data, which shows the feasibility of speech recognition with limited amounts of labeled data. When using all 960 h of labeled data, wav2vec 2.0 can achieve 4.8/8.2 WER.
Self-supervised approaches for speech representation learning are limited by the multiple sound units in each input fragment, variable lengths of sound units, and no explicit segmentation.
With the Bidirectional Encoder Representations from Transformers (BERT) proposed, pretrained representations reduce the need for many heavily engineered task-specific architectures. Therefore, BERT achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many taskspecific architectures. [133] Based on BERT and related algorithms, Hsu et al. proposed the Hidden-Unit BERT (HuBERT) for selfsupervised speech representation learning in Figure 4e. [134] Hu-BERT utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. This approach forces the model to learn a combined acoustic and language model over the continuous inputs benefiting from the prediction loss over the masked regions only and the consistency of the unsupervised clustering step. As a result, the HuBERT model improves the performance of wav2vec 2.0 on Librispeech (960 h) and Libri-light (60 000 h) benchmarks with 10 min, 1 h, 10 h, 100 h, and 960 h fine-tuning subsets and shows up to 19% and 13% relative WER when pretrained on the Libri-Light 60k h.

Wearable Sensors
In addition to image sensors and acoustic sensors, wearable sensors have gradually become a popular research field. Wearable sensors are widely adopted to analyze motion and sensory information due to their versatility and huge for its unique advantages. [135][136][137][138] As one of the cut-off edge technologies of wearable sensors, e-skin has experienced rapid progress in being more stretchable, integrated, and bionic owing to the development of materials and manufacturing techniques. [139,140] Upon human motion decoding, the glove HMI has unique advantages when compared with e-skin. In addition, e-skin also can be used for eye blink detection and translation. Therefore, this section, we will introduce the recent advances of wearable sensors from glove HMIs to smart glasses sensors. In Figure 5a, Song et al. explored a flexible glove including piezoelectric sensors for collecting the hand gesture signal and soft pneumatic actuator for providing tactile feedback. The pneumatic soft actuator activated by an electrostatic force, which can be divided into a ring part where the electrostatic attractive force works, and the center part as the contact part. The pneumatic soft actuator also can be attached to the gloves to generate effective tactile feedback. When the user holds a virtual object, the actuator is well maintained in the on state, and when the virtual object is released, the actuator is switched to the off state. The actuator is actuated by electrostatic attraction. When the air space is reduced by the electrostatic attraction, the central part expands and is designed to generate the haptic feedback. Especially, the actuator showed a larger movement as the period became faster and the applied voltage became larger. In addition, the designed silicone monolithic glove was able to detect movement of fingers with the PVDF sensors and transmit data via Bluetooth. A voltage output by a piezoelectric sensor deformation provides finger motion information. In order to distinguish the moving state from the received information, the threshold value was specified, and the gain value was obtained through the initial calibration, which could be used in several ways by linking with various VR software.
In addition, wearable sensors can be employed for human body to monitor heartbeat frequency, eye blink, gesture, etc. As shown in Figure 5b, Adams et al. applied the motion tracking sen-sors with the wearable glove orthosis to enable stroke patients to realize interaction in a virtual world through functional upper extremity movements. [141] The glove orthosis consists of wrist and finger motion sensors around the hand and an electronics enclosure attached on the palmar side. Among them, wrist and finger motion sensors are used to monitor the finger bending angles when grasp-release interactions happened in VR, while the electronics enclosure is used to process the sensor data and transmit information. Data from the wearable sensors are processed by the motion captured algorithm to realize real-time estimation of arm, wrist, and finger joint angles. In addition, the algorithms for tracking the shoulder and elbow have been applied to realize 27°of freedom in the hand articulation model. With the tracking of the angles and positions of joint, angular rates, and velocity for the different body location, i.e., fingers, wrist, elbow, and shoulder, the kinematic pose of patient can be reconstructed in real time and establish in VR. With the aids of wearable gloves and the motion tracking sensors, stroke patients can realize typical interactions including reaching for and grasping objects in VR. Chen et al. introduced a sensor patch which can be attached on the arm, wrist and fingers, as shown in Figure 5c. [38] Due to the self-powered, flexible, triboelectric characteristics of the sensor (SFTS), it is used for finger trajectory sensing and robotic control. There are two kinds of patch sensors: the 2D SFTS and the 1D SFTS. The 2D-SFTS patch consists of a stretchable silicone rubber substrate, a resin grid layer and four electrodes which are fabricated with a starch-based hydrogel PDMS elastomer (HPE). The 1D-SFTS patch includes a stretchable silicone rubber substrate and two electrodes. As a demonstration, SFTS is realized for 3D robotic movement control, in which the 2D-SFTS patch for in-plane control and the 1D SFTS patch for out-of-plane control. the position of a fingertip engaged on the tactile sensor surface is determined by the output ratio from two pairs of electrodes, which can eliminate external interference. The position of a fingertip engaged on the tactile sensor surface is determined by the output ratio from two pairs of electrodes, which can eliminate external interference. Combing these two patches, trajectory, displacement, and velocity detection can be achieved. To further achieve 3D signal detection and control, 2D-SFTS and 1D-SFTS are combined. The SFTS patches are instrumented with a signal acquisition system, a computer, a drive system, and a robot to enable the system to control the velocity, trajectory and 3D motion of the robotic manipulator. Similarly, as shown in Figure 5d, Shi et al. presented a flexible four-electrode triboelectric patch which can probe human machine interactions such as tapping and sliding, etc. [40] The four electrodes are in ring configuration. In addition, there are eight electrode points for detecting operations and nine additional points outside the electrode areas for more advanced detection through pattern recognition of the generated output voltages. The patch is formed with the polyethylene terephthalate (PET) substrate, the patterned aluminum (Al) electrodes and the polytetrafluoroethylene (PTFE) friction layer. The electrodes points on the patch can be divided into individual electrode points and common electrode points. When tapping and sliding on the patch, the fingers are as the positive layer while the PTFE surface are as the negative layer. Output signals are only generated on the corresponding electrode for operation on individual electrode points and generated on both two adjacent electrodes for operation on common electrode points. The  [198] Copyright 2020, American Chemical Society. b) Wearable glove orthosis and interactions in virtual space. Reproduced with permission. [141] Copyright 2019, SAGE Publications. c) Schematic of the sensor patch.Reproduced with permission. [38] Copyright 2018, American Chemical Society. d) Schematic of the fourelectrode triboelectric patch for probing human machine interactions. Reproduced with permission. [40] Copyright 2019, Elsevier. e) Schematic of the triboelectric nanogenerator (TENG)-based micromotion sensor. Reproduced with permission. [142] Copyright 2017, American Association for the Advancement of Science. f) Schematic of the wearable Non-Attached Electrode-Dielectric Triboelectric Sensor (NEDTS).Reproduced with permission. [143] Copyright 2020, Elsevier.
working mechanism of triboelectric voltage ratio endows the proposed flexible patch with various applications, including the writing pad interface for real-time finger writing traces detection, the identification code interface for security, door access and autonomous express delivery, and the control interface for gaming, entertainment, and robotics.
In Figure 5e, Pu et al. proposed a triboelectric nanogenerator (TENG)-based micromotion sensor which is fixed on a pair of glasses. This mechnosensational TENG (msTENG) sensor allows real-time eye blink detection and translates the eye blink for command control. [142] The msTENG is based on the singleelectrode mode and consists of five layers: one PET layer as the substrate, one fluorinated ethylene propylene (FEP) thin film with one indium tin oxide (ITO) electrode as the electrification layer, one natural latex film as a skin-contacting, and an acrylic thin annulus between the natural latex film and the FEP film as an air spacer to allow charge generation and transfer. To further enhance triboelectrification, vertically aligned polymer nanowires are set on the surface of the FEP film. With the micromotion sensor, the proposed glasses show great potential for applications in the smart home such as controlling the light switch with eye blinking and acting as an extra hand in daily life such as answering the phone while driving, ringing the doorbell while both hands are occupied, etc. In addition, as shown in Figure 5f, a wearable nonattached electrode-dielectric triboelectric sensor (NEDTS) is introduced by Anaya et al. to directly place on one side of the eye and sense the Orbicularis Oculi muscle motion for monitoring voluntary and involuntary eye blinks. [143] Benefited from the topology of the sensor, NEDTS can generate the voltage in a separate conductor by noncontact electrostatic induction, which enables near field remote sensing. The designed TENG sensor consists of a poly(3,4-ethylenedioxythiophene) polystyrene sulfonate (PEDOT:PSS) film and the Ecoflex film, and the metal electrode plate placed on the lateral temple of the eyeglasses. When the eyes blink, the Orbicularis Oculi muscle around the eyes moves, which cause the deformation and rubbing of the PEDOT:PSS film and the Ecoflex film. The charge transfer is generated between the two layers and the voltage is induced in metal electrode on the eyeglasses. Finally, NEDTS realize eye monitoring and cursor control, which can help the disabled to access the web and computers and shows great potential of applications for disabled in car control, drone control, and driver fatigue monitoring.

Advance in AI-Enhanced Wearables Sensors
In addition to the advanced development of wearable sensor, the integration of wearable sensor and AI technology leverage the outcome of AI data analytics to handle more complex functions. Advanced AI wearable sensor can execute the more complicated pattern recognition tasks or regression tasks, such as hand motion recognition, object recognition, lip motion recognition and so on. [144][145][146][147] In this section, we highlight the recent works of AI wearables sensors. As shown in Figure 6a, Sundaram et al. proposed a low-cost and scalable tactile glove (STAG) which can identify objects, estimate the weight of objects, [148] and recognize hand poses. The STAG consists of a sensing sleeve with 548 sensors a knitted glove and the readout electronics using the ResNet-18-based CNN architecture to train the model, and finally get the accuracy of to recognize 26 kinds of objects. AIenhanced glove sensor can realize the more complicated recognition task than before, i.e., object recognition. In addition, another smart glove sensor is shown in Figure 6b [149] from Wen et al. using the low-cost and self-powered superhydrophobic textile, which realized various gesture recognition in real time with the help of machine learning. Benefiting from AI algorithms, the smart glove also achieve VR/AR controls. The authors train the CNN model to realize gesture recognition with the accuracy of 99.167%. Such gestures can be utilized to control a virtual hand, for example, three different gestures represented for palm ball, curved ball, and knuckle ball, respectively. Moreover, Wen et al also presented another triboelectric smart gloves as shown in Figure 6c, [146] which can achieve sign language recognition and realize bidirectional communication in VR. They use different methods to achieve recognition, nonsegmentation method and segmentation method. For word and existed sentence signals in dataset, 1D CNN is applied, the recognition accuracy of 50 words and sentences is 91.3% and 95%, respectively. They use a segmentation method to realize the recognition of new/never-seen sentences, in which the sentences are divided into intact word signals, incomplete word signals and background signals by sliding windows. Then the hierarchy CNN classifier is used to reconstruct and recognize sentences with the accuracy of 85.58%. Finally, the authors demonstrate the bidirectional communication in a virtual social scene, in which the sign language is translated into text and video by the proposed DL structure, then the information is transmitted to the nonsigner-controlled server and the nonsigner can type to respond the speech-disordered user.
Moreover, Zhu et al. introduced a soft modular glove with multimodal sensing and augmented haptic feedback in Figure 6d. [150] With the help of machine learning, the proposed glove can realize real-time detection of dexterous hand motion, intelligent object recognition and augmented feedback. The soft modular glove consists of two modules: finger module and palm module. With TENG tactile and bending sensing of the finger module, the glove can detect hand motion and test surfaces. With TENG tactile sensing of the palm module, the glove can achieve object detection. The accuracy of the object detection is about 96%. There are pneumatic actuators on finger and palm modules to achieve tactile haptic feedback, for example, user A can simply sense the contact points from user B. To delivery temperature information and realize thermal haptic feedback, nichrome rings are applied with two connectors that can be turned into a heater when connecting to a power supply. User B can perceive the thermal feedback through the Tactile+ units on the palm module when a hot water cup contact user A's palm module. Their system can realize multimodal communication among three spaces, two real spaces and virtual space.
In addition to gloves, at present, lip recognition is a hot topic in intelligent wearable sensors because of the more comprehensive predictive capability of current advanced AI algorithms compared to the traditional methods based on mathematics and statistics. In Figure 6e, Lu et al. proposed a mask with the triboelectric sensors to realize lip-language decoding. [151] The system is consisted of fixing masks, readout electronics, neural network classifiers, and a self-powered, low-cost, contact, and flexible TENG. There is a sponge in the middle layer of the triboelectric sensors with a rectangular hole in the middle, which is used for charge transfer. When the user has lip motions, electrical signals can be generated since the muscles on both side of the mouth squeeze the triboelectric sensors. When the mouth is closed or the mouth is opened to the maximum, there is no current. When the mouth is gradually opened or the mouth is gradually closed, the current flows from PVC to nylon and from nylon to PVC respectively. With the help of deep learning, lip language recognition is achieved with the test accuracy of 94.5%. A dilated recurrent neural network model based on a prototype learning approach is developed since the prototype learning converges faster and gets higher accuracy than the SoftMax-classification layer. The proposed lip-language interpretation system achieves identity recognition, which will open the door for the host but not for the guest. Such lip-language recognition system helps users communicate with others using lip motions which are decoded into words or sentences and translated into voice and text. In Figure 6f, Wang et al introduced a silent speech recognition system (SSRS) which can achieve all-weather, natural interactions. [125] SSRS consists Figure 6. Advanced AI wearables sensors. a) Schematic of the scalable tactile glove (STAG). Reproduced with permission. [148] Copyright 2019, Springer Nature. b) Demonstration of the baseball game scenario with the smart gloves. Reproduced with permission. [149] Copyright 2020, Wiley-VCH. c) A smart glove for sign language recognition. Reproduced with permission. [146] Copyright 2021, Springer Nature. d) A soft modular glove for multimodal sensing and augmented haptic feedback. Reproduced with permission. [150] Copyright 2022, American Chemical Society. e) A lip-language decoding system. Reproduced with permission. [151] [151] Copyright 2022, Springer Nature. f) A silent speech recognition system. Reproduced with permission. [125] Copyright 2021, Springer Nature. www.advancedsciencenews.com www.advsensorres.com of four-channel tattoo-like electronics, a wireless data acquisition (DAQ) module, and a terminal display of silent speech recognition. When the user has silent speech, the four-channel tattoolike electronics and the wireless DAQ module can capture and process the real-time sEMG signals, and then wireless transmission the corresponding sEMG signals to the cloud. A server-based machine-learning algorithm classify the sEMG signals to the corresponding recognized speech information and audio can be displayed on the phone by taking Bluetooth connection. In addition, the tattoo-like electrodes can stably capture the signal features in long-term using with room temperature change, running, and dinning. To achieve silent speech recognition, the linear discriminant analysis (LDA) algorithm is applied. Compared with support vector machine (SVM) and naive Bayesian model (NBM), LDA shows high classification accuracy of 92.64% in 110 classifications and high prediction speed which can ensure naturally communication of silent speech users. The SSRS shows good stability when the users are in different states and different environments and can be naturally used in noisy, quiet-required environments, which shows good performance whether the user is working, exercising, dinning or the user's body shaking, mouth deformation and muscle fatigue.

Advance in Self-Powered Sensor Systems
Due to the variety of functions, AI sensors are undergoing advancement. The self-powered approaches, including triboelectricity and piezoelectricity for stimulus detection, and thermoelectricity and pyroelectricity for temperature sensing have shown great advantages in realizing long-term sustainable IoT intelligent systems because these sensors can generate electrical signals without external electrical bias, i.e., zero-power consumption at the sensor itself, and are made with low-cost fabrication technology. These advantages are indispensable for enabling massive sensor nodes to collect multimodal sensory information aiming at the future ubiquitous IoT framework. The very popular self-powered sensors used as AI sensors play more and more important roles in sensor system for smart home/building application. [35][36][37][152][153][154][155] First of all, the robotics equipped with AIsensors has great potentials for unman shop or factory. In Figure 7a, Zhu et al. proposed an exoskeleton manipulator which can monitor all the movable joints of the human upper limbs and project them into the robotic arm or virtual space. [156] The exoskeleton manipulator uses a triboelectric bidirectional (TBD) sensor which makes it become universal and low cost, low consumption. There are two kinds of TBD sensor: the rotational TBD (RTBD) sensor and the linear TBD sensor. Based on the triboelectric sliding sensing mode, the TBD sensor can detect the motions of the human upper limb. The single-arm exoskeleton includes the glove, the forearm, the upper arm, the shoulder module, and the back supporter. With the help of proper programming, the actions of real player will be recorded and control the corresponding commands in VR. Additionally, as shown in Figure 7b, Jin et al. presented a smart soft-robotic gripper system based on TENG, [157] which can capture the continuous motion and tactile information for soft gripper. There are two kinds of TENG sensors in the system for sensing both self-actuation and external stimuli. The TENG sensos are patterned-electrode tactile TENG (T-TENG) and length TENG(L-TENG), both are working in single-electrode mode. As for the T-TENG, there is a PET substrate with five Ni-fabric electrodes patterned on and the negative triboelectric layer is the silicon rubber layer. As for the L-TENG, the positive triboelectric layer is the gear coated with Nickel-fabric conductive textile and the negative triboelectric layer is the PTFE film. T-TENG is used for detecting sliding, contact position, and gripping mode of the soft gripper, while the L-TENG can measure the bending angle of the soft actuators. A tri-actuator soft gripper with integrated TENG sensors is used to realize feedback function. With the help of machine learning, i.e., PCA for features extracting and multiclass SVM classifications, the proposed softrobotic gripper system can perceive the gripping status and realize object identification. Finally, they get the classification accuracy of 98.125% for 15-channel data of 16 objects. Then, the smart soft-robotic gripper system is established for DT-based unmanned warehouse system, which can realize the real-time object recognition by the trained SVM model and be projected into the virtual space.
In addition to smart soft-robotic gripper system, floor mats, as a commonly used furniture, are also common applied into smart building/home with the help of AI sensors. [158][159][160][161][162][163] The generation of smart floor mats focus on solving the problem of privacy protection, which can be applied for human activities recognition, gait recognition, etc. [164][165][166][167][168][169] In Figure 7c, a smart floor sensor system based on self-powered deep learning-enabled smart mats (DLES-mats) is proposed by Shi et al. to form a smart floor monitoring system. [170] DLES-mats are based on the triboelectric mechanism and fabricated by screen printing. The mats have three layers: PET friction layer, silver (Ag) electrode layer, and PVC substrate layer. Each DLES-mat has distinct electrode pattern with varying converge rate: 20%, 40%, 60%, and 80%. The number of generated electrons associated with foot stepping is proportional to the electrode coverage area, with the help of this design, this system can determine the walking positions according to different magnitudes from the generated triboelectric signals of the mats. DLES-mat with a higher electrode coverage rate can generate voltage pulse with higher peak-to-peak magnitude. The DLES-mats can not only determine position but can also distinguish different activities: slow walking, normal walking, fast walking, running, and jumping. In addition, individual recognition can be realized since different people has different walking gait pattern leading to an individual output signal. Then the authors used CNN with high accuracy of 96% for individual recognition to show the real-time position and the individual recognition when the person walking on the DLES-array in virtual space. In addition, Shi et al. show another reliable and smart floor system with a robust floor mat 4 × 4 array in Figure 7d. [171] The smart mats are screen-printed with one mask which is more convenient and cost-effectiveness than the precious one in Figure 7c. To obtain reliable output characteristics, the floor mats are designed with a universal and innovated electrode pattern, the Ag electrode is designed with four individual electrodes consisting of a reference electrode, two coding electrodes, and a sheet electrode. The two coding electrodes can realize quaternary coding schemes (0, 1, 2, 3) through output writing and distinctive output characteristics can be obtained by taking out the ratios of the two coding electrodes with respect to the reference electrode. This smart floor system can realize real-time position sensing with time-domain analysis, identity recognition with deep learning model and Figure 7. Advanced self-powered sensor systems. a) An exoskeleton manipulator. Reproduced with permission. [156] Copyright 2021, Springer Nature. b) A smart soft-robotic gripper system. Reproduced with permission. [157] Copyright 2020, Springer Nature. c) Self-powered deep learning-enabled smart mats. Reproduced with permission. [170] Copyright 2020, Springer Nature. d) A smart floor system. Reproduced with permission. [171] Copyright 2021, American Chemical Society. energy harvesting. 1D convolutional neural network (1DCNN) is applied to realize identity recognition with the accuracy of 85.67% for 20 users. Finally, they built a VR scene to mimic the real-time usage scenarios in real-life smart home environment.

Neuromorphic Computing
With the aids of AI and sensor technology, tremendous datasets have exploded as never before. The currently used computing system is mainly based on the classical Von Neumann architecture composed of separate Central processing units (CPUs) and memory units, where the computing relies on centralized and sequential operations determined by a clock. It is facing limitations when dealing with vast numbers of datasets. Neuromorphic computing systems are as opposed to classical computing systems whose structure and function are inspired by brains and that are composed of neurons and synapses in the late 1980s. [172] Both processing and memory are governed by the neurons and the synapses in neuromorphic computing systems. The biological nervous system of humans can be divided into the central nervous system (CNS) and the peripheral nervous system (PNS). [173][174][175][176][177] CNS performs the computing, learning, and www.advancedsciencenews.com www.advsensorres.com memorizing activities. PNS perceives and responds to stimuli such as pressure, temperature, light, and sound. The goal of neuromorphic computing is to extract what is known of its structure and operations to be used in a practical computing system, i.e., mimic the partial function of the brain. In terms of hardware, oxide-based memristors, spintronic memories, threshold switches, and transistors, among others are designed and implemented for artificial neurons. While for the development of software, analog, digital, and mixed-mode analog/digital VLSI focus on realizing perception, motor control, and multisensory integration which is similar to the neural systems of the human brain. As shown in Figure 8a, a biological afferent nerve that is stimulated by pressures applied onto mechanoreceptors to change the receptor potential of each mechanoreceptor. [178,179] The receptor potentials combine and initiate action potentials at the end of a myelinated segment of a neuron. Action potentials from multiple nerve fibers combine for bioinformation processing through synapses forming by the nerve fiber with interneurons in the spinal cord. As for the bioinspired artificial afferent nerve (Figure 8b), it emulates the functions of biological lowly adapting type I (SA-I) afferent (sensory) nerves by collecting data from multiple tactile receptors. An artificial afferent nerve consists of three core components: pressure sensors, an organic ring oscillator, and a synaptic transistor. Pressure sensors convert external tactile stimuli into voltage pulses and transmit it to an artificial nerve fiber (an organic ring oscillator). Then the voltage signals from multiple artificial nerve fibers are integrated and converted into postsynaptic currents by a synaptic transistor. The complete mono-synaptic reflex arc is achieved by using synaptic transistors as biological efferent neural interfaces.
Therefore, many of neuronal systems analyze multiple sensory cues efficiently to mimic the human sensory neuron and establish accurate depiction of the environment. Figure 8c shows a bimodal artificial sensory neuron to implement the sensory fusion processes. [180] Wan et al. develop a bimodal artificial sensory neuron (BASE) based on ionic/electronic hybrid neuromorphic electronics to implement the visual-haptic fusion. A BASE unit made of a resistive pressure sensor, a perovskite-based photodetector, a hydrogel-based ionic cable, and a synaptic transistor. The photodetector and pressure sensor are used as the receptors to convert the external haptic and visual stimuli into electrical signals. The electrical signals are subsequently transmitted through the ionic cable to the synaptic transistors for integration of bimodal potential changes. Then it was converted into a transient channel current, analogous to the biological excitatory postsynaptic currents (EPSC) in the synaptic transistor. The changes in intensity of ESPC can determine extent of synchronization of bimodal stimuli, thereby controlling a biohybrid neuromuscular junctions or manipulators, mimicking the process of "perception for action." In addition, a matrix of BASE can be used as the feature extraction layer of a neural network for recognition of multitransparency alphabetic patterns. Overall, the proposed BASE has profound implications for neurorobotics, cyborg systems, and autonomous AI. Moreover, as shown in Figure 8d, John et al. presented comprehensive synaptic behaviors in 2D MoS 2 three-terminal devices with a synergistic multigate architecture exhibiting three modes of operation, i.e., electronic, ionotronic, and photoactive. [181] The famous classical conditioning Pavlov's dog experiments were demonstrated to emulate associative learn-ing in the human brain. The proposed synaptic devices implement classical conditioning in a thin-film transistor by coupling between optical and electrical pulses. Compared with Pavolovian memristors with temporal dependence of activation signals, the proposed neural network architectures are using light as a global gate, in which light pulses emulated food/unconditioned stimulus to activate salivation/unconditioned response from the postsynaptic terminal. While voltage pulses applied at the back gate is using for emulating bell/conditioned stimulus of the dog to activate conditioned response.
In addition to the development of artificial synapses, a hardware element capable of sensing and computing multiple physical signals is also highly desired. [60,182,183] Yuan et al. proposed a spike-based neuromorphic perception system consisting of calibratable artificial sensory neurons (CASN) based on epitaxial VO 2 , where the high crystalline quality of VO 2 leads to significantly improved cycle-to-cycle uniformity (Figure 8e). [184] CASN via a calibration resistor to optimize device-to-device consistency and adapt the VO 2 neuron to have varied resistance level for different sensors. After further integrated scaling resistor, crosssensory neuromorphic perception component equips with the capability of encoding illuminance, temperature, pressure and curvature signals into spikes. The authors realize CASN-based neuromorphic perception system in highly efficient multisensory neurorobotics by designing perception neurons into a 3layer spiking neural network (SNN) with the accuracy of 90.33% on MNIST-based pressure image classification. Lastly, Yeon et al. proposed a memristor based artificial synapse for emerging neuromorphic computing applications, as shown in Figure 8f. [185] Due to the high mobility of metal ions in the Si switching medium, the electrochemical the metallurgical strategy in design metallization (ECM) memory based on Si across a wide range of conductance levels. Compared with the current memristors, conduction channels formed by Ag alloyed with Cu in a Si medium enhanced the memristive performance and has properties of a uniform gradual switching, reliable retention at multilevel conductance states, and enhanced symmetry and linearity of analogue conductance updates. Benefiting from above performances, a 32 × 32 memristor crossbar array with a 100% yield is further developed for reliable operation and programming as convolutional kernels to perform inference tasks.

Multimodality
It is more straightforward to use an image sensor to describe the visual information which may not be obvious from the microphone sensor. However, for videos, both image sensors and microphone sensors are important. Thus, in cases dealing with multimodal data, it is important to use a model which can jointly represent the information such that the model can capture the correlation structure between different modalities. Sensors with a single function is no longer enough to meet diverse requirements for functions in a smart sensing system, which have inspired an emerging field of multimodality. Human biological sensory systems can simultaneously analyze multiple cues and make corresponding reactions. The development of multimodality for mimicking the human brain mainly includes the hardware side of various sensor receptors and the software side of data fusion algorithms. In Figure 9a, Wang et al. presented a Figure 8. Advanced neuromorphic computing systems. a) A schematic of a biological afferent nerve, which is stimulated by pressure. Reproduced with permission. [179] Copyright 2018, AAAS. b) A schematic of an artificial afferent nerve made of pressure sensors, an organic ring oscillator, and a synaptic transistor. Reproduced with permission. [179] Copyright 2018, AAAS. c) A hybrid neuromorphic electronics to implement the visual-haptic fusion. Reproduced with permission. [180] Copyright 2020, Springer Nature. d) A multigated architecture of analogous artificial MoS 2 synapses. Reproduced with permission. [181] Copyright 2018, Wiley-VCH. e) A spike-based neuromorphic perception system consisting of calibratable artificial sensory neurons (CASN) based on epitaxial VO 2 . Reproduced with permission. [184] Copyright 2022, Springer Nature. f) A silicon memristor as an artificial synapse for emerging neuromorphic computing applications. Reproduced with permission. [185] Copyright 2020, Springer Nature. Figure 9. Multimodality systems. a) A multimodal human gesture recognition system. Reproduced with permission. [186] Copyright 2020, Springer Nature. b) A multimodal smart soft robotic manipulator. Reproduced with permission. [187] Copyright 2021, Wiley-VCH. c) A multimodal noncontact interaction interface. Reproduced with permission. [188] Copyright 2022, Wiley-VCH. d) A wearable multimodal sensing system. Reproduced with permission. [189] Copyright 2023, Wiley-VCH. e) A multimodal sensing system. Reproduced with permission. [190] Copyright 2022, Wiley-VCH. f) A wearable multimodal sensing system. Reproduced according to the terms of the CC BY license. [191] Copyright 2022, The authors, published by MDPI. www.advancedsciencenews.com www.advsensorres.com method fusing the somatosensory data of the strain sensor and visual data of the camera sensor to realize human gesture recognition (HGR). [186] Multimodal fusion is applied to improve the performance since the accuracy of detection or recognition are always affected by the environment such as noise, light and blur. The strain sensor consists of three parts: the sensing component SWCNTs, a stretchable polydimethylsiloxane (PDMS) layer and an adhesive poly (acrylic acid) (PAA) hydrogel layer. This sensor has 100% stretchability to monitor somatosensory signals from the human hand and it has good durability and reproducibility. In addition, the strain sensor has stable and regular resistance response which is an almost constant base resistance. The stretchable strain sensor is reliable to collect somatosensory data signals without affecting visual data which ensure the performance of bioinspired somatosensory-visual (BSV) fusion. The 5D vector strain data connect with transferable semantic features which is leaning from the visual hand gesture data by ALexNet CNN and form a 53D vector. The five-layer sparse neural network is for final learning, in this way the generalization ability of the BSV architecture is enhanced. They get 100% accuracy of hand gesture recognition. This HGR has good performance even in dark environment with the accuracy of 96.7%. Their work can be applied in human-machine interaction, use the gesture to control the robotics.
A kind of multimodality method is to combine different kinds of sensors into one system to realize various functions. In Figure 9b, a smart soft robotic manipulator [187] integrated with TENG sensors for strain sensing and tactile sensing, and a PVDF sensor for temperature sensing is introduced by Sun et al. By fusing the data from different sensors, they realize the automatic recognition of the grasped object. TENG sensors has six components, a gear with a layer of Ni-fabric conductive textile, a strip, a disc spring, the sponge layer, Cu layer and PTFE thin film. As for the T-TENG, there are three layers, the silicon layer, Ni-Fabric layer, and TPU substrate with four distributed electrodes. PVDF temperature sensor is made of three components, a poled PVDF film, an Ag electrode and a PET thin film. After integrating all of sensors into a tri-finger pneumatic gripper, a smart robotic manipulator system, with the grasping data collecting, is realized for grasped objects detection with the accuracy of 96.111% by applying the three-layer 1DCNN. In addition, they built a DT-based virtual shop system which can show the grasped items and its temperature for customer in VR.
In Figure 9c, Le et al. fused the information of the microelectromechanical system (MEMS) humidity sensor and the triboelectric sensor to build a multimodal noncontact interaction interface. [188] As for the humidity sensor, it consists of a two-port aluminum nitride (AlN) bulk wave resonator and a graphene oxide (GO) film layer, it is used for detection of the finger which is approaching. As for the facile triboelectric sensor, it consists of two annular aluminum electrodes, it can recognize the multidirectional finger movements. By combining the humidity sensor and the triboelectric sensor, output signals of the humidity sensor and the triboelectric sensor can provide enough information to achieve noncontact human-machine interfaces (HMIs). Finger motion signals detected by the triboelectric sensor and finger humidity signals detected by the humidity sensor are fused to control the motion of the car such as speed and direction. In addition, they build a noncontact 3D password input interface by combining signals of the triboelectric sensor and the humidity sensor, signals of the triboelectric sensor are used to select panel site and signals of the humidity sensor are used to achieve site height control and panel switch. They use the signals of two different sensors to achieve multimodal noncontact interaction interface.
In Figure 9d, Yang et al. proposed a multimodality enhanced all-TENG based information mat (InfoMat), [189] the InfoMat can be applied in the digital-twin smart home which can mimic two users simultaneously into VR space. The InfoMat includes an in-home mat array which has 24 pixels (4 sets × 6 pixels) and an entry mat. The in-home array mat pixel consists of three components, poly (ethylene terephthalate) (PET) as the triboelectric material, Silver (Ag) paste as the electrode material and polyvinylchlorides (PVC) as the substrate. Each pixel has different interdigital electrodes (IDE) ratios which can generate different ratio of triboelectric output to distinguish the six pixels. The inhome array mat can realize position sensing, walking trajectory monitoring. The entry mat can realize weight sensing in a higher sensitivity and a larger continuous linear sensing range by using the pyramid and sphere mixed TENG-based sensor. In addition, they realize identity recognition using convolutional neural network (CNN). By fusing two different kinds of sensor (bottom hierarchical mat and upper one-electrode mat) signals in data-level feature, they get higher accuracy from 93% and 94% respectively to 99%. Activities on this Infomat can be real-time projected in virtual space and create a digital-twin smart home.
In Figure 9e, Sun et al. presented a triboelectric nano generator/piezoresistive sensor made from chitosan which can be used in the multimodal sensing system. [190] This chitosan-based composites can modify ionic and interfacial polarization. To enhance the surface charge of friction layer, they introduce corona charge injection. Chitosan based devices can be applied in the multimodal system in a human body. TENG and humidity sensor uses a casting method-made free-standing film and casting layer, such sensor can be used to detect human stepping, elbow bending and nose/mouth breathing. Piezoresistive sensor and a special fourdirection sensor is made of sponge which soaked in a chitosan conductive solution and dried, such sensor can be used to detect knee bending and neck movement in four directions. High conductive paper-based electrode generated by screen printing can be used as the capacitance of the touchpad unit. With this unit, real-time pattern generated by finger touch can be projected on the computer. With variable sensors, the multimodal system can detect human body more comprehensively.
In Figure 9f, Sanchez-Perez et al. propose a wearable multimodal sensing system to track changes in cardiopulmonary status. [191] The system consists of the audio board and the main board. As for the audio board, it collects sample data from four audio channels at the same time. The main board records data from two inertial measurement units, two temperature sensors and two pairs of electrical bioimpedance electrode wires. The system contains two modes: continuous mode and spectroscopy mode. At continuous mode, the system concurrently captures the multifrequency IP signal at four frequencies (5, 50, 100, and150 kHz) while at spectroscopy mode, the system measure BIS across a logarithmically distributed range of 32 excitation frequencies from 5 to 150 kHz. With this multimodal sensing system, lung sounds from the chest and IP-derived respiratory waveforms can be captured, BIS-based fluid, kinematic and temperature data can be measured, thus cardiopulmonary health status can be assessed with various data. The multimodal sensing system shows a great potential in cardiopulmonary healthcare area.
In addition to the fusion of multiple signal sources on the sensor capture, multimodality also has many unique algorithms for obtaining more accurate digital information. In this regard, vision-and-language (V+L) tasks emerge to bridge the semantic gap between visual and textual clues in images and text by using joint multimodal embeddings. As shown in Figure 10a, Chen et al. introduced UNiversal Image-TExt Representation (UNITER) as a large-scale pretrained model for joint multimodal embedding, which can be used as a universal image-text representation for all V+L tasks. [192] UNITER model based on Transformer by leveraging its elegant self-attention mechanism encodes image regions (visual features and bounding box features) and textual words (tokens and positions) into a common embedding space with Image Embedder and Text Embedder. UNITER is used by conditional masking including four pretraining tasks, Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA, Optimal Transport (OT)). After an optimal combination of pretraining task, UNITER achieves good results for many V+L tasks, including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and Natural Language for Visual Reasoning. In Figure 10b, Li et al. proposed a unified VidL framework LAVENDER, where Masked Language Modeling (MLM) is used as the common interface for all pretraining and downstream tasks. [193] MLM overcome the limits of conventional video-language (VidL) models, which requires task-specific designs in model architecture and training objectives for each task. Unification makes MLM as a simplified model architecture, instead of a decoder with much more parameters which needs on top of the multimodal encoder. The experimental results of unified framework achieve competitive performance covering video question answering, textto-video retrieval and video captioning. LAVENDER shows advantages of supporting all downstream tasks with just a single set of parameter values, few-shot generalization on various downstream tasks, and enabling zero-shot evaluation on video question answering tasks. Image fusion enables the creation of novel predictive imaging modalities by combining the strengths of different sensor types, providing insights that cannot be obtained from single sensors source. Biological research often integrates results from different measurement principles by modeling cross-modal relationships to obtain more comprehensive predictions. In Figure 10c, Bhargava et al. developed an Vascu-Viz method as a multimodality and multiscale imaging and visualization pipeline for vascular systems biology because blood vessel data are often available only from a single modality and are difficult to integrate cross-modality data. [194] VascuViz is easy for enabling multimodality and multiscale 3D imaging of the vasculature in intact, unsanctioned tissues using standard sample preparation protocols and commercially available reagents. Vas-cuViz combines a water-soluble computerized tomography (CT) contrast agent with a fluorescently labeled Magnetic Resonance Imaging (MRI) contrast agent to obtain a compound that makes the macro-and microvasculature simultaneously visible in high-resolution imaging with MRI, CT, and optical techniques. In addition, Plas et al. described a predictive imaging modality by fusing the imaging mass spectrometry (IMS) and microscopy, as shown in Figure 10d. [195] IMS-generated molecular maps, rich in chemical information but having coarse spatial resolution, are combined with optical microscopy maps which have high spatial information but relatively low chemical specificity. Such combination can make a more precise prediction of a molecular distribution with the advantages of both technologies, i.e., high spatial resolution of microscopy and high chemical specificity of IMS. Multivariate regression is used to model variables in one technology, using variables from the other technology. The authors use partial least-squares (PLS) regression to predict abundance for an experimentally measured ion intensity value. The results show the potential of image fusion in a variety of predictive scenarios and equip with many advantages, including a ten times or more spatial resolution than measured ion images by using microscopy measurements to predict ion distributions, a more precise prediction of ion distributions in tissue areas, and enrichment of biological signals and attenuation of instrumental artifacts compared with only using microscopy or IMS. Similarly, as shown in Figure 10e, Parkins et al. presented a multimodality imaging model using MRI and Bioluminescence imaging (BLI) to provide a more holistic view of metastatic cancer cell fate in mice. [196] BLI can measure tumor burden in the brain but only using BLI in cancer metastasis models is limited due to the difficulties in collecting information on the number, size or distribution of tumors within the region of interest. In contrast, MRI can detect individual iron-loaded cells and provide information on tumor 3D location and size but cannot differentiate between dead and viable cells. Therefore, the authors combine MRI and BLI to overcome the limitations of the impendence of each modality. Finally, the experimental results successfully demonstrate direct longitudinal measures of whole-brain single-cell arrest, tumor burden, and cancer cell viability in the brain.

Digital Twin Application
With the aid of AI sensors, machines will automatically generate a more reliable decision of machine operating status, and fault detection. Recently, the concept of metaverse interested people about the virtual world. The design and application of various sensors provide possibility to realize the metaverse. DTs build a bridge between the real world and the virtual world, which can be widely used in smart home and smart farm. The DT realize the closed loop process of the system by building a model in the virtual space through AI and feeding back the digital model to the physical space. This process can continuously collect and accumulate life cycle data and knowledge of physical products through sensors, which will bring a more convenient lifestyle in the future. For example, as shown in Figure 11a, Zhang et al. established a DT system to mimic human activities in smart home and smart class in virtual place by using the intelligent socks. [197] The sock is based on a T-TENG pressure sensor which consists of six layers, a silicone rubber film in the middle for pressure sensing, a nitrile thin film on top of the silicone rubber film, two conductive textile films for charge collection and two nonconductive textile film on the outside of both side of the sensor. The T-TENG works in the contact-separation mode, and harvests energy from Figure 10. Image-based multimodality AI algorithms. a) A large-scale pretrained model, UNiversal Image-TExt Representation (UNITER), for joint multimodal embedding as a universal image-text representation. Reproduced with permission. [192] Copyright 2020, Springer Nature Switzerland AG. b) A unified VidL framework used Masked Language Modeling as the common interface for all pretraining and downstream tasks. Reproduced under terms of the CC-BY license. [193] Copyright 2022, The Authors, published by arXiv. c) An easy-to-use method, VascuViz, enables multimodality and multiscale 3D imaging of the vasculature in intact. Reproduced with permission. [194] Copyright 2022, Springer Nature. d) An predictive imaging modality created by "fusing" two distinct technologies: imaging mass spectrometry (IMS) and microscopy. Reproduced with permission. [195] Copyright 2015, Springer Nature. e) A multimodality imaging model using both magnetic resonance imaging (MRI) and bioluminescence imaging (BLI) to overcome the limitations of the impendence of each modality. Reproduced with permission. [196] Copyright 2016, Springer Nature. Figure 11. DT application. a) DT application of the intelligent socks. Reproduced with permission. [197] Copyright 2020, Springer Nature. b) Schematic of the integrated wearable plant sensors. Reproduced with permission. [198] Copyright 2020, American Chemical Society. c) Concept map of a DT smart farm. Reproduced under terms of the CC-BY license. [199] Copyright 2022, The Authors, published by arXiv. d) DTs in smart farm.Reproduced with permission. [200] Copyright 2022, Elsevier. e) Architecture of a DT smart farm. Reproduced with permission. [201] Copyright 2022, ICIC International.
human body motions to transmit data from the wireless sensor. Then they use a neural network which contains four convolutional networks, four max-pooling layers and one fully connected layer for gait identification. With the help of deep learning, the socks can recognize the gait patterns of 13 participants with the accuracy of 93.54%. They also realize real-time human activities recognition with the accuracy of 96.67%. A DT-based smart classroom is realized by mimic real-time student identification and activities and human body temperature monitoring of students in the classroom. Such information is mapped into virtual space by using the socks sensor and the temperature sensor. In this way, the DT system can help to better monitor the students in the virtual space. The new trend of DT and advanced AI sensors is applied for smart farming because of the rapid development of plants sensor from silicone to flexible sensors. As shown in Figure 11b, Lu et al. proposed wearable flexible plant sensors for monitoring plant growth status. [198] The system includes two humidity sensors: one is for real-time room humidity monitoring, and another one is for real-time leaf humidity monitoring. There is also an optical sensor for real-time ambient light detecting and a temperature sensor for real-time ambient temperature monitoring. To ensure the system can be stable and durable enough for humidity monitoring and sensitive enough for light responding, functional ZnIn 2 S 4 (ZIS) nanosheets are used as the major sensing media. The sensors consist of the thin polyimide (PI) film as the substrate and the laser-induced graphene as the interconnected electrodes. Since the sensors are flexible and have light weight, the sensors can be directly attached on the plant. When the parameters of plants change, the resistance of the sensors change. Based on this wearable plant sensors system, the plant growth conditions can be ensured by detecting external abiotic stresses such as the temperature, the light and humidity. In addition, plant health status can also be relatively prolonged monitoring. Such a multimodality wearable sensor system has diversified range of applications in plant growth status detection and environment monitoring of the smart farms.
In Figure 11c, Zhao et al. proposed the DT smart farm to solve the problems of resource waste, environment pollution and food safety. [199] In the DT smart farm, AI is applied to predict plant growth, virtual reality is applied to build a 3D farm and blockchain is applied to manage supply chains. To predict the plant growth, different kind of sensors are used to monitor the soil parameters such as soil moisture, soil PH value, and some environmental data such as wind direction and light. With these real-time monitoring data, plant growth can be predicted accurately, and the natural disasters can be timely warned. In this way, effective protection can be provided for plants as soon as possible to ensure the health growth of plants. To facilitate farmers to observe plant growth status and give timely and automated operation, they build a 3D farm to simulate the real-time status of the plant's growth and the real-time living environment. Based on the real-time monitoring data of the plants and the environmental data, insect and weed can be precisely controlled remotely, irrigation and fertilization can be performed and protection for the plants can be provided timely. To better know the request of the plant's products, they use the blockchain to supply the production chain.
For plants monitoring in the farm, DT smart farm plays an important role. DT-based farm simulates the real farm into virtual space to build a 3D farm. With the DT farm, auto-monitoring can be realized without physical proximity. In Figure 11d, Ariesen-Verschuur et al. propose that the DT consists of four layers which are device layer, network layer, integration layer and the application layer. [200] In the device layer, there are various sensors for monitoring, such as sensors for soils which can detect the humidity, PH value of the soil. In addition, actuators are also needed to realize remotely operate such as auto irrigation. As for the network layer, it is used for data transmission such as data transmission to the cloud. In the integration layer, data from the sensors and the applications are processed and integrated to create virtual representations and realize virtual DT farm building. DT-based smart farm is continuously updated by the real-time information of the physical objects in the real farm. In the application layer, users can visit the DT smart farm system. In this layer, it also provides data analysis and machine learning, etc. The DT smart farm provides a platform for users to monitor and control the farm anywhere and anytime. It also helps to analyze various data from different sensors and provide rich information which cannot be observed by human senses. With such a DT smart farm, people can realize all-round automated monitoring of the farm in the real space.
They predict the demand of the future market by analyzing the historical data, thus the goal of maximizing income can be realized. Moreover, in Figure 11e, Sung et al. propose that the DT contains three layers, physical world, communication protocol and cyber world. [201] In the physical world, there are environmental sensors, camera sensors and chemical sensors for data collection. In a farm, soil, light, temperature, air, water, and nourishment are all key parameters to ensure the health growth of the plants. Environmental sensors can be used to monitoring the temperature, humidity and light, etc. while camera sensors can be used to monitor plant growth and the farm status, and the chemical sensors can be used to control the use status CO 2 and energy. After real-time data detecting in the physical world, data integration and data analysis are realized via communication protocol. In the cyber world, data from the communication protocol are processing by applying data classification, data forecasting, data reasoning and data optimization. These three layers build a cycle, data from the physical world are sent to the cyber world for data processing via the communication protocol and the results from the cyber world send the feedback to the physical layer for adapting the sensors and other devices in the real farm. DT smart farm becomes a core means in farm management which makes more efficient and scientific management.

Conclusion and Prospective
In summary, our review systematically introduces the recent progress of the rapid development of image sensors, acoustic sensors, wearable sensors and their respective AI algorithms. Regarding the advances of image sensors, we focus our discussions on edge computing by integrating of sensing element and computation element together, which will provide an emerging platform with short response time, low power consumption and high transmission efficiency for IoT compared with conventional cloud computing. Meanwhile, as another key trend for the development of image sensor, various AI algorithms have risen great growth in the applications of object recognition, object detection, video tracking, motion estimation, visual 3D scene modeling, etc. As for the voice systems, we have introduced the mechanism of common MEMS microphones and the advancement of speech recognition system. With the integration of sensors and AI algorithms, more advanced sensors emerged ranging from wearable devices to ambient devices, such as gloves for gesture recognition, masks for lip language recognition, robotic manipulator for object recognition, floor mats for gait recognition, etc. Therefore, we have reviewed in detail their design considerations, working mechanisms, application scenarios, and potentials for security, automation, healthcare, environment monitoring, human computer interaction in smart home and smart farming.
We summarize the various AI sensors and their applications in Table 1. Though with the significant achievements so far, there are still some existing research opportunities for future development of AI sensors. First of all, the current computation unit based on classical Von Neumann architecture faces the limitations of dealing with tremendous numbers of datasets, which limits the development of more complicated AI algorithms. In addition, edge computing requires the new computation architecture to realize the on-chip computation with low power Water quality sensor pH-sensitive polymer, ion-selective electrodes, oxygen-sensitive dye 3D printing, injection molding, screen printing Water quality monitoring [220,221] consumption, fast transmission speed, and high mobility. Neuromorphic computing is one solution to emulate human biological brain, which provides a more efficient platform for massive parallel processing. In this regard, neuromorphic computing enables fully connected and rapid response in smart home systems. Hence, the neuromorphic computing devices are required to be further optimized in order to integrate with sensor element in a single chip, especially for the development of SNN which is a more closely mimic natural neural networks equipped with more powerful computational capability than traditional artificial neurons. Next, due to the limited functionality that a sensor can have, multimodality becomes a solution to this problem by combining different sensors to achieve functional diversity in a system. Human biological sensory systems can obtain more comprehensive information and make more reliable reactions from multiple signal sources than with a single signal. Apart from advancing to AI sensors, single function is always one inevitable bottleneck in a smart home system. Multimodal fusion provides a promising solution, leading to a field of multimodality system such as multimodal fusion of data for gesture recognition, multimodal fusion of different sensors for the healthcare system, etc. In addition, the prosperous development of VR and AR has facilitated the emergence of a brand-new research area, that is, DT with broad applications in DT smart home and DT smart farm. Benefiting from DTs and various sensors, indoor activity detection, human body status monitoring, plant growth status monitoring, and plant growth environment monitoring can all be automatic detected and controlled without physical proximity anytime and anywhere. Last but not least, how to seamlessly combine various advanced AI sensors such as image sensors, microphone sensors, wearable sensors, and edge computing units, etc. into an integrated system should be carefully considered. All in all, with the continuous technology innovations, we can envision the realization of DT systems to achieve smart home/smart farm with better environmental monitoring, more immersive interaction, and more comprehensive healthcare in the future.