Dark‐Mode Human–Machine Communication Realized by Persistent Luminescence and Deep Learning

Increasing ubiquitous collaborative intelligence between humans and machines requires human–machine communication (HMC) that is more human and less machine‐like to accomplish given tasks. Although speech signals are considered the best modes of communication in HMC, background noise often interferes with these signals. Therefore, research focused on integrating lip‐reading technology into HMC has gained significant attention. However, lip‐reading functions effectively only in well‐lit environments. In contrast, HMC may occur daily in dark environments owing to potential energy shortages, increased exploration in darkness, nighttime emergencies, etc. Herein, a possible method for HMC in the dark mode is presented, which is realized based on deep learning motion patterns of persistent luminescence (PL) of the skin surrounding the lips. An ultrasoft PL–polymer composite patch is used to record the motion pattern of the skin during speech in the dark. It is found that visual geometric group network (VGGNET‐5) and residual neural network (ResNet‐34) could predict spoken words in darkness with test accuracies of 98.5% and 98.75%, respectively. Furthermore, these models could effectively distinguish similar‐sounding words such as “around” and “ground.” Dark‐mode communication can allow a wide range of people, including disabled people with limited dexterity and voice tremors, to communicate with artificial intelligence machines.

of deep oceans, deep space, tunnels, and caves; a novel HMC model integrated into a multimodal communication system could prove useful in facilitating humanmachine interaction in such dark environments. [12,13] Furthermore, a possible energy crisis in the future would also require dark-mode HMCs. Therefore, to enable humans to communicate with intelligent machines in dark environments, this study presents a novel HMC model that can be used by anyone (regardless of ability) and is not affected by background noise.
HMC in complete darkness may be inspired by bioluminescent communication, where deep-sea creatures and cave creatures generate bioluminescence (a form of cold light) to transmit and receive large amounts of information within a species and sometimes between species. [14] The application of light sources to modulate robotic manipulation actions has been recently studied using flexible photodetector textiles, introducing a new interactive method for optical mechanical communication. [15] A form of cold light could be used to set up the darkmode HMC via visualization of lip movements during speech, and deep learning algorithms could be used to decode speech in a manner similar to that used in lip reading technology. Thus, no external source of illumination such as a light-emitting diode (LED) is required to visualize the lips. A simple form of cold light placed around the lips could facilitate realization of dark-mode HMC. The possible source of cold luminescence to synchronize lip movement could be either persistent luminescence (PL) or mechanoluminescence (ML). [16][17][18] PL, also known as long-lasting phosphorescence, is light emission from certain materials that can persist for an extended time after the excitation sources, such as UV, have ceased. Following the pioneering work of Matsuzawa, who discovered an ultralong green PL over 24 h in SrAl 2 O 4 :Eu,Dy (SAO) phosphors, several PL materials with different peak emission wavelengths have been discovered and widely used for emergency signs, indicators, in vivo optical imaging, and toys in the past few decades. [19] In contrast, ML is the emission of light when a solid material is subjected to mechanical stimuli such as stress, pressure, friction, and vibration. Among the various ML materials, the self-reproducible ML material, in which ML is generated and maintained under repeated mechanical stimuli without the need for UV preirradiation, may be ideal for visualizing lip movements. Materials exhibiting obvious self-reproducible ML include ZnS:Cu/Mn 2þ , CaZnOS: Ln 3þ , SrZnOS:Ln 3þ , and SrZn 2 S 2 O:Ln 3þ . Various applications such as display systems, light sources, heartbeat detection, hybrid sensors, and motion-driven ML fibers have been proposed. [20] To prove the concept of dark-mode HMC, PL is chosen in this study to decode the speech from the skin surrounding the lips using convolutional neural network (CNN) models. Ultrasoft PL-polymer composite patches were used to record the movement pattern of the skin during speech in a dark environment. Ten different classes, including seven words, two letters, and a smiling gesture, were considered to evaluate the performance of CNN models with different feature extraction layers. The present work will help all people, including disabled people with limited dexterity, voice tremors, or poor vision, to communicate with artificial intelligence machines in a dark environment.

Visualization of Lip Movements in a Dark Environment
Highly flexible, superstretchable, and ultrasoft PL patches were fabricated from Ecoflex elastomer blended with SAO microparticles, which glow in the dark emitting green light after irradiation with excitation sources such as UV light, fluorescent room light, or sunlight. ( Figure S1, Supporting Information). The emission spectra of the PL patch is illustrated in Figure S2a, Supporting Information, along with the excitation spectra of fluorescent room light, which show a characteristic broad Eu 2þ -based emission band with a maximum at 530 nm resulting from the characteristic transition of 4f5d-4f of Eu 2þ ion. [19] The afterglow lifetime of the PL patch at room temperature, which was measured by a photomultiplier tube (PMT) after charging for 20 min under the fluorescent room light, was found to be over 12 h, as illustrated in Figure S2b, Supporting Information. At 12 h, the afterglow of the PL patch was clearly observable with the naked eye under dark conditions, and photographs of the afterglow of SAO for a 12-h period are well presented in the study by Hu et al. [21] The lifetime of the afterglow can be influenced by the excitation source, charging period, and external temperature. For example, as temperature decreases, the rate of detrapping of trapped electrons/holes is reduced, which increases the decay time period. The decay curve of SAO for different variables is well reported in other studies. [19,22] It is important to note that the afterglow of SAO is recoverable after irradiation with excitation sources.
Initially, eight ultrasoft and flexible SAO-Ecoflex composite patches were applied directly to the lips to visualize lip movements in response to spoken words in a dark environment. However, we found that there was frequent contact between the tongue and the pasted PL patches on the lips. Therefore, in view of possible health problems, we selected the skin above the orbicularis oris muscle to apply the PL patches. The orbicularis oris muscle surrounds the lips and controls the shape of the lips during speech. Therefore, the skin above the orbicularis oris muscle could be an ideal location to decode speech based on the movement patterns of the PL patches. On this basis, eight PL patches were applied to the skin with biocompatible soft glue, as shown in Figure 1a. In a dark environment, the luminous PL patches can be seen surrounding the lips, as schematically presented in Figure 1a. It should be noted that the PL patches were naturally charged under fluorescent room light. Thus, no special UV source was used to charge the PL patches.
Because this study is mainly focused on improving the decoding of speech in dark environments, the number of distinguishable classes was limited to ten. Considering a small-scale database allows a comprehensive presentation of the model performance for each class, which should aid understanding among general readers. Furthermore, care has been taken in the selection of classes to include short words, long words, like-sounding words, visually similar words, visually different words, and gestures ( Figure 1a).
The video for a particular class was recorded continuously, while the words were uttered repeatedly 20 times with a short interval in between. In this manner, 40 video clips, representing four participants and ten classes (4 Â 10), were obtained. A portion of a video clip for each class is included as supplementary information, showing the movements of the PL fields in response to speech. To facilitate the extraction of the corresponding frames for each repetition, the short interval between pronunciations was synchronized by blocking the camera with a sheet of A4 paper. Also, ten representative frames from each repetition were selected for further analysis and feature extraction for deep learning. To extract ten frames for each repetition, each video clip was first converted into images of size 210 Â 210. Then, the maximum intensity of each frame was determined and plotted against the number of frames, as shown in Figure 1b. The plot illustrates sudden increases and decreases in the intensity in cyclic mode, where the maximum value of intensity decreases continuously. The sudden rise and fall results from exposure to PL and blocking of the camera lens, whereas the falling intensity is a decaying afterglow of SAO as mentioned earlier. [19] Figure 1b was further processed by defining the threshold intensity and then converting it into a binary plot, as shown in the inset of Figure 1b. Based on the binary data, the first and last frames of each cycle were determined, representing the beginning and end of the exposure. After determining the frames for the beginning and end of the exposure, ten frames were extracted for each cycle considering equal sampling intervals, as shown in Figure 1c. Using this strategy, 800 frames (selected frame number X repetitions X participants) were extracted for each class, and 8000 frames were generated for ten classes.
We investigated whether the number of frames considered for each pronunciation (i.e., 10) retains sufficient movement patterns. Increasing the number of frames would certainly increase the information but requires more computational effort in Figure 1. Illustration of data generation. a) Schematic illustration of video recording with a mobile phone in a dark environment while speaking or gesturing a specific set of words, with eight PL composite patches on the skin surrounding the lips (The cartoon woman head has been reproduced with permission from TurboSquid). b) A framework to extract ten frames for each repetitive action in a video. c) A sequence of ten frames obtained using the method shown in (b).
www.advancedsciencenews.com www.advintellsyst.com training the deep learning model. To visualize the motion patterns, ten frames were merged into a single image, as shown in Figure 2a.
In some classes such as in "around," one can clearly see multiple shifts in the position of the patches. In other classes, such as in "strain," such shifts are very difficult to see owing to the small movement of the PL patches. For a closer look, two different regions of interest (ROIs) were selected in the merged image, as shown in Figure 2b as ROI-1 and ROI-2. It can be clearly seen that both ROIs show a clear representation of patch motion for most of the classes. Even similar words such as "around" and "ground" show different patterns in both ROIs. Even though there is similarity in one patch, other patches show a unique representation of lip movement. For instance, "smile" shows a similar pattern with strain and fracture at ROI-1 but has a distinct pattern at ROI-2. This suggests that there is local variation in the deformation of the skin surrounding the lips and that these deformations are highly correlated with the words spoken. Because the PL patches helpfully reveal the local variations in skin deformation, it is possible to see how effectively deep learning algorithms can learn these patterns and decode them into speech.

Decoding Motion Patterns of PL Patches into Spoken Words
In this study, a CNN was used to learn the connection between the motion patterns of PL patches and spoken words. A CNN is a type of deep learning model for processing data with grid-like topology (such as images) and is designed to automatically and adaptively learn spatial hierarchies of features from lowto high-level patterns. CNNs are typically composed of three distinct operational layers: a convolutional layer, a pooling layer, and a fully connected layer. The convolution layer applies a convolution operation to the input data busing a convolutional filter (kernel) to create a feature map. The result of the convolution operation is passed through a nonlinear activation function, such as a rectified linear unit (ReLU), before being passed to the pooling layer. The nonlinear activation function gives the CNN the  ability to learn particularly complicated objects. In contrast, the pooling layer is mainly responsible for reducing the in-plane dimensionality of the feature map. This helps to reduce computational requirements and combats the problem of overfitting when training the network. The output feature maps from the last layer of the convolutional layer or the pooling layer are converted into a 1D vector and connected to one or more fully connected layers (dense layers), in which each neuron in one layer is connected to each neuron in another layer by a learnable weight. The output of each neuron in the fully connected layers is passed through a nonlinear activation function, such as ReLU. The last fully connected layer usually has the same number of output nodes as the number of classes, and the activation function applied to the last fully connected layer is typically different from the others. An activation function applied to the multiclass classification task is a softmax function that normalizes the real output values of the last fully connected layer to the target class probabilities, where each value is between 0 and 1, and all values sum to 1. The number of convolutional and pooling layers can be increased in the CNN architecture depending on the complexity of the images; a deep CNN can more effectively capture low-level details but at the cost of higher computational requirements. In recent years, owing to significant increases in computing power, many deep CNNs have been developed, such as AlexNet, GoogleNet, DenseNet, VGGNet, ResNet, and Mobilenet. [23] In the present study, different CNN models inspired by VGGNet were tested with increasing depth in the featureextracting layer. The fully connected layer was fixed in all CNN models. The depth of the feature-extracting layer was increased by simply repeating the smallest unit of the featureextraction layer, which consisted of two convolutional layers and a max-pooling layer. The fully connected layer consists of three different layers. The first and second fully connected layers contained 256 and 128 neurons with a ReLU activation function and 50% dropout layer, respectively. The dropout layers were included to minimize overfitting. The final fully connected layer contained ten neurons for ten classes and was assigned a softmax activation function. Each convolutional layer consisted of a 3 Â 3 kernel, 1 Â 1 stride, and ReLu activation function, whereas the number of filters varied depending on the position of the feature-extraction layer, as shown in Figure 3a. The models were tested using the Adams optimizer to determine the global minima of the categorical cross-entropy loss function with a given learning rate of 0.0001. The model consisting of one feature-extracting layer is referred to as the VGG-1 model. Similarly, models with two, three, four, and five featureextraction layers are, respectively, called VGG-2, VGG-3, VGG-4, and VGG-5. It is important to note that the feature-extracting layer consisted of two convolutional layers and a max-pooling layer. The architecture of VGG-5 is illustrated in Figure 3a. Moreover, a deep CNN model has been implemented using the ResNet-34 architecture to validate the performance of the VGGNet-based models. The architecture of ResNet-34 is illustrated in Figure S3, Supporting Information.
To facilitate feature extraction from the representative 10 frames containing the spatial and temporal information of lip movements, the frames were superimposed after conversion to grayscale, as shown in Figure 3b, which provided a total of 800 data points (class Â participants Â repetitions) with channel dimensions of 10. The augmentation technique was applied in this work to artificially increase the number of data points by five times to reduce the overfitting problem, which resulted in 4000 data points from the augmentation. Augmentation was performed by random rotation (À20°to 20°), random scaling (0.9-1.1), flipping (horizontal), and random translation (horizontal ¼ AE10 pixels, vertical ¼ AE10 pixels) on each stacked image, as shown in Figure 3b. All data from the augmentation were used for training the model, whereas the 800 original data points were equally divided into validation and test datasets. The split was made based on the class, so that each class was represented equally in the validation and test datasets.
The results of the five different VGGNet-based models and ResNet-34 are presented in Table 1, which lists the train accuracy, validation accuracy, and test accuracy. VGG-1 showed a training accuracy of %90% and validation and test accuracy of 72% and 74%, respectively. The lower percentage of accuracy in validation and test compared with the training accuracy percentage clearly shows the problem of overfitting that comes from the shallow feature-extraction layer. The improvement in overfitting was somewhat reduced in VGG-2, where the accuracy of the model in the validation and test datasets remarkably increased to 85.5% and 90%, respectively. Moreover, the additional featureextracting layer in VGG-2 increased the training accuracy by 6% as compared with the VGG-1 model. The trend of increasing training accuracy and decreasing overfitting problems in response to the additional feature-extracting layer is observed up to VGG-5. In contrast, ResNet-34 showed a training accuracy of 100% with validation and test accuracies of 99% and 98.75%, respectively, which is quite similar to the performance of VGG-5 (training accuracy ¼ 99.02%, validation accuracy ¼ 99.02%, and test accuracy ¼ 98.75%). Thus, it can be clearly concluded that the deeper the feature extraction layers, the better the model performs to increase accuracy and combat overfitting problems.
It is interesting to observe the performance of some models on individual classes with test datasets using the confusion matrix plot, as shown in Figure 4. The confusion matrix of VGG-1 shows that the model is most frequently confused between "strain" and "stress" and "t" and "b." Similarly, the model also shows some confusion in distinguishing between "around" and "ground." VGG-1 predicted "smile" with a lower error percentage of 2.5%, whereas the model's accuracy in predicting "fracture" was the worst, at only 27.5%. The poor prediction of "fracture" is due to incorrect predictions of other classes. VGG-2 significantly improved the prediction accuracy for each class compared with VGG-1. For example, the confusion between "strain" and "stress," "b" and "t," and "around" and "ground" was significantly reduced. The accuracy of predicting "smiles" was 100% for VGG-2. The prediction percentage of "fractures" was also improved to 72.5%. The confusion matrix of VGG-5 is excellent, as shown in Figure 4. Similar to the percentage accuracy of the models from VGG-1 to VGG-5, the performance of the models in terms of the percentage accuracy for each class also increased, with VGG-5 showing the best performance. The model was able to predict "around," "b," "ground," "mechanics," "smile," "stress," and "t" with 100% accuracy, while "fracture" and "science" were predicted with 97.5% and 95% accuracy, respectively. VGG-5 occasionally predicted "stress" for a given true class www.advancedsciencenews.com www.advintellsyst.com "strain," which resulted in precision of "strain" to 92.5%. Interestingly, VGG-5 was able to predict "stress" with 100% precision. Furthermore, the ResNet-34 confusion matrix shows a similar tendency of confusion in certain classes. For example, the confusion of VGG models between "strain" and "stress" is also observed in ResNet-34. Moreover, both VGG-5 and ResNet-34 showed a prediction accuracy of 100% for the classes "around," "b," "t," "ground," "mechanics," and "smile." Thus, it can be concluded that there is minimal difference between the two models in terms of the tendency of confusion for a particular class despite the different principles in architecture. All models classified "smiles" with the highest accuracy. With the VGG-based models, 100% accuracy was achieved immediately after VGG-2. This is likely due to the distinct motion patterns of the majority of patches, which result from stretching of the lips. The other classes, such as smiling, do not involve stretching of the lips, which results in distinct patterns that can be distinguished by shallow CNN models. The confusion matrices also show that the models are confounded based on similarity of lip deformation rather than on pronunciation duration. For example, "mechanics" takes the longest time to pronounce compared with "b" and "t," but the VGG models performed well despite their pronunciation durations. In contrast, two points considered more closely in "stress" and "strain" in Figure 2b clearly show some degree of similarity in the motion patterns of PL patches. It is likely that some of the remaining PL patches might also show some degree of similarity owing to the similarity of lip movements when these two words are pronounced. As a result of the similarities, even the deep CNN  www.advancedsciencenews.com www.advintellsyst.com (DCNN) tended to be confused. One way to improve the discriminability of the features between the classes could be to increase the spatial resolution of the skin deformation by adding more PL patches around the lips. Increasing the spatial resolution of skin deformation could potentially increase the number of distinguishable patterns, allowing CNN to classify classes with higher precision. Another way to improve performance is to improve the CNN model itself. Thus, in addition to the VGGNet and ResNet architectures, other deep CNN models, such as AlexNet, GoogleNet, DenseNet, and MobileNet, can be used to determine the best complementary model for this study.

Discussion
To our best knowledge, no previous study has attempted to decipher speech from the movement of the skin surrounding the lips. However, our current work has successfully demonstrated that it is possible to use the skin near the lips to decode speech by simply classifying the motion patterns of the PL patches using a deep learning algorithm. Conventional lip reading technologies require well-lit images of lip deformation to generate features for deep learning to decode speech, unlike the present study, where deep learning does not consider geometric features of the lip. Instead, the deformation of the orbicularis oris muscle during speech is transferred to the PL patches to decode speech. More importantly, speech is decoded in a dark environment. Therefore, it can be useful for many people, including disabled people with limited dexterity, vocal tremors, and poor vision. This dark-mode communication can be used in various fields of science and technology, as shown in Figure 5. Recently, voice-activated vehicle control systems have been developed for automobiles. [24] However, these systems must contend with background engine noise, and passengers have demanded improved performance from lip-reading technology. However, the requirement for illumination in lip-reading technology has limited its application to daylight hours, as driving at night with the interior light illuminated is not safe. Therefore, we believe that our findings will help solve this problem. Dark-mode communication can also be integrated with Google Assistant, Siri, and Amazon Alexa to provide secure private interaction in the dark, and speech-impaired individuals can use it without restriction. Furthermore, dark-mode communication can be www.advancedsciencenews.com www.advintellsyst.com used in combination with physical biometrics such as fingerprints, palm prints, iris shape, etc. or alone in physical access control systems (e.g., to open doors to private or business properties) by setting up a lip motion password. [25] Dark-mode communication between humans and robots in the darkness of deep oceans, deep space, and tunnels could be very promising for building a robust multimodal communication system. Cellphone applications developed for dark-mode communication can be used in nighttime emergencies such as when a person loses their voice during a fire incident or when a house is attacked by intruders. In reality, people move their heads while talking. In the present work, however, head positions were fixed as much as possible when facing the camera. Because the deep learning models in the present work were trained considering random image rotation, image translation, and scaling, a certain degree of flexibility was provided for head movements when facing the camera. However, to accommodate head rotation away from the camera, the models must be trained with new experimental datasets that consider such head motions. Some studies have considered multiple cameras to record multiple views of the speaker while speaking to train the model. [26] Applying a multiview database in the present framework may be of great help in solving problems stemming from the face angle of speakers.
The current trends in lip-learning technology are mainly focused on examining the performance of deep learning models based on the available databases. [26] The available databases are either produced in a controlled laboratory environment or extracted from wild environments such as TV programs. In both environments, illumination is highly important. Although the application of a lip-reading system in dark environments has several benefits, few studies have suggested implementing these systems in such environments. These studies have considered light-insensitive thermal infrared (IR) technology in decoding speech in dark environments. [27,28] Despite the successful use of thermal images for facial recognition, the studies concluded that visual speech recognition using thermal images showed poor performance owing to the low resolution of IR cameras and higher image noise. Selecting ROIs is a technically challenging and time-consuming process in low-contrast thermal image processing. Even though resolutions of IR cameras are improving every year, they are prohibitively expensive compared with standard cameras. In this regard, utilizing luminescent material to decode speech in lip-reading systems is a novel approach and might open the door for exploiting several functional materials in advancing visual speech recognition systems. Furthermore, the present work demonstrates that the skin surrounding the lips can be used to decipher speech, which reduces the inconvenience that otherwise would result from PL patches directly adhered to the lips. Moreover, the PL patches used in this study are biocompatible and ultrasoft, which further increase the convenience of using them. The solution suggested in the present work is designed to be helpful in harsh environments such as deep oceans and deep space, and the visible range wavelength of a PL patch could be a better signal to measure than an IR signal as it can penetrate the protective glass shields of helmets.
In this study, PL has been considered to establish the concept of dark-mode HMC. However, PL materials require an external energy source such as UV or daylight to recharge on a regular basis. To overcome this drawback of PL materials, ML materials are an ideal substitute. Various studies have reported intense emissions from ultrasoft ML composites such as ZnS: Cu-polydimethylsiloxane (PDMS) and ZnS:Mn-PDMS under weak stimuli of moving facial skin. [29] It would therefore be interesting to decode speech using skin-driven ML in the present study. Decoding speech using skin-driven ML will be the subject of our future work. Furthermore, our future work will consider a larger number of distinguishable words, as large-scale databases with multiple speakers play a major role in the development of lip-reading technology. In addition, other more tonal languages such as Mandarin will be considered in future work.

Conclusion
In this study, we present a possible method of communication between humans and artificial intelligence machines based on deep learning of PL motion patterns of the skin surrounding the lips. PL sources of ultrasoft and flexible SAO-Ecoflex composite patches were considered to record lip motion in a darkroom using a cell phone camera for ten different classes. Based on the motion patterns, it was found that there are local variations in the deformation of the skin surrounding the lips and that the local variation in deformation is strongly correlated with the respective class. Various VGG-based CNN models were tested by considering the feature extraction layer as a variable. It has been shown that the deeper the feature extraction layers, the better the model performs to increase accuracy and combat overfitting problems. A ResNet-34 model, which is significantly different from the VGG-based model, was also implemented to verify the performance of the VGG-based models. It was found that the VGG-5 and ResNet-34 models did not differ significantly in terms of the tendency of confusion for a given class and model www.advancedsciencenews.com www.advintellsyst.com accuracy percentage. All models classified "smile" with the highest accuracy due to the unique motion patterns for the majority of PL patches. Dark-mode communication can be useful for all people, including the disabled with limited dexterity, vocal tremors, and poor vision. It can also be used to control automobiles, converse with robots, and for physical access control systems such as opening doors.

Experimental Section
Preparation of PL Composite Patches: PL-polymer composite patches were prepared from a homogeneous mixture of SAO microparticles and Ecoflex at a ratio of 20:80 by weight. SAO was purchased from Nemoto and Co., Japan, whereas Ecoflex was purchased from Smooth-On, Inc., USA. As Ecoflex had a mechanical compliance as high as that of human skin, it was selected for embedding the SAO microparticles. Mixing was performed using a planetary mixture for %2 min at 450 RPM. To improve the homogeneity and avoid agglomeration, ZrO 2 balls with a diameter of 10 mm were added to the container before mixing. The well-mixed composite was then degassed for 5 min to remove the trapped air bubbles. Thereafter, the composite was poured onto the central part of one of the surfaces of the 3D-printed plate, which has several circular molds with a diameter of 2 mm and a thickness of 0.25 mm. A plate with circular molds was printed using CARIMA-IM-96. The spin coater was then used to spread the composite for 20 s at 1000 RPM. The surface was then wiped with a sharp knife, leaving the composite only in the molds. Subsequently, the plate was transferred to an oven and allowed to solidify for 1 h at 60°C. Finally, the solidified composite patches were extracted. The entire process is shown in Figure S4, Supporting Information. The optical microscope (OM) image illustrating the distribution of the SAO in the patch is shown in Figure S1, Supporting Information.
Experimental Setup: Video recordings were made in a dark room, as shown in Figure 1a, using a cell phone camera (ISO 3200) at a frame rate of 30 fps. Four individuals whose native language was not English participated in the voice recording: three Koreans and one Nepalese. During the recording, the participant's head was kept fixed as much as possible while facing the camera. The speech included ten different classes: two English letters (b and t) and eight English words (fracture, mechanics, science, strain, stress, ground, around, and smile). It should be noted that "smile" was chosen to show a gesture performed by stretching the lips rather than a pronunciation.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.