CNN‐SCNet: A CNN net‐based deep learning framework for infant cry detection in household setting

Infants are vulnerable to several health problems and cannot express their needs clearly. Whenever they are in a state of urgency and require immediate attention, they cry, which is a form of communication for them. Therefore, the parents of the infants always need to be alert and keep continuous supervision of their infants. However, parents cannot monitor their infants all the time. An infant monitoring system could be a possible solution to monitor the infants, determine when the infants are crying, and notify the parents immediately. Although many such systems are available, most cannot detect infant cries. Some systems have infant cry detection mechanisms, but those mechanisms are not very accurate in detecting infant cries because the mechanisms either include obsolete approaches or machine learning (ML) models that cannot identify infant cries from noisy household settings. To address this limitation, in this research, different conventional and hybrid ML models were developed and analyzed in detail to find out the best model for detecting infant cries in a household setting. A stacked classifier is proposed using different state‐of‐the‐art technologies, outperforming all other developed models. The proposed CNN‐SCNet's (CNN‐Stacked Classifier Network) precision, recall, and f1‐score were found to be 98.72%, 98.05%, and 98.39%, respectively. Infant monitoring systems can use this classifier to detect infant cries in noisy household settings.


INTRODUCTION
An infant is a child aged between birth and a year.During this time, infants grow and develop at a fantastic rate.They learn to smile, wave, clap, pick up objects, roll over, sit, crawl, and babble.Some even begin to pronounce a few words.They grow to trust and form bonds with their carers and frequently comprehend more than they can communicate. 1Cries are an infant's first verbal expression.It sends a signal of urgency or difficulty.An infant's cry signals the adults to tend to it.A cry is a typical response in infants, who cry for various reasons like pain or hunger. 2 It is necessary to continually monitor infants because it is never known when they will start crying.cries in household settings automatically.Later, several convolutional neural networks (CNNs)-based techniques were created, which performed better. 13Osmani et al. 14 used ensemble learning techniques, including bagging and boosting, to recognize infant cries following a feature selection step.Lavner et al. 15 presented logistic regression (LR) and CNN to automatically detect infant cries in a home setting.Xie et al. 16 used CNN for audio-based continuous newborn cry monitoring at home.Mel-frequency cepstrum coefficients (MFCC), pitch, and formants were taken from the recordings and used to train these algorithms.Liu et al. 17 used Linear Predictive Coding (LPC) to extract several variables, including MFCC, Short-time Energy (STE), and others, and used KNN and artificial neural networks to detect infant cries.
Later, other neural networks and deep learning models were applied to detect infant cries, which improved performance.Chang and Tsai 18 presented a CNN-based technique for infant cry identification and recognition by transforming an infant's crying signal into a two-dimensional spectrogram.Cohen et al. 19 compared deep learning architectures with conventional approaches in detecting infant cries.Different CNN architectures were designed and assessed, and their performances were compared to conventional ML methods, including LR and support vector machines.Ji et al. 20 employed SVM and artificial neural networks to identify asphyxiated infant screams using weighted prosodic features and auditory features.Ting et al. 21classified asphyxia infant cry using hybrid speech features and CNN.Joshi et al. 22 proposed a multistage heterogeneous ensemble model for augmented infant cry classification.Initially, the mel-frequency cepstral coefficients algorithm was used to generate the spectrograms and to analyze the varying feature vectors.Then, a heterogenous ensemble model with boosting algorithms and CNN was designed to classify infant cries.The proposed model achieved a classification accuracy of 93.7%.
A few concerns have been observed from this literature review: (a) Most existing studies have yet to compare their suggested approach to other systems already in use.(b) The accuracy rate does not satisfy most studies' "gold standard" requirements.(c) None of the prior studies applied the stack architecture, a new state-of-the-art technology, and compared it with the existing approaches.Therefore, additional research is required to close the knowledge gaps.

DESCRIBING AN INFANT CRY SIGNAL
Air pulses are produced repeatedly by the vocal cords' vibration.These pulses' fundamental frequency (pitch), or duration, range generally between 250 and 600 Hz in healthy infants.The vocal tract creates resonant frequencies, also known as formats that change the cry signal.The first two formants typically occur at 1100 and 3300 Hz, respectively. 23The process of cry signal detection is often carried out by extracting distinctive features from various audio signal segments.5][26] An example of an infant cry signal waveform is shown in Figure 1.

METHODOLOGY
The workflow diagram of this research is shown in Figure 2. At first, an open-access dataset was collected from Kaggle to train the models.There were 432 audio records in the dataset, 108 of which were of infant cries.The rest of the audio records included sounds from different acoustic events like the sound of speech, laughter, environmental sounds, traffic noises like horns and engines of vehicles, animal sounds, and quiet sounds.Based on the availability of the type of audio records, the entire dataset was labeled into four classes of sound, namely, "cry," "laugh," "noise," and "silence."After this, the dataset was synthesized and preprocessed for model training.The entire dataset was split into train and test data, with 20% of the samples going into the test data.The training dataset was again split into an 80:20 ratio, with 20% for the validation dataset to improve the model performance.As this split created a class imbalance, data augmentation was carried out to bring all the classes with the same number of records in the train, test, and validation data.Then, the audio recordings were segmented into 4096 consecutively overlapping samples, or roughly 93 milliseconds, with a 50% overlap.These segments were further separated into 16-ms frames with an 8-ms step size.Each frame is subjected to a pitch detector, 27 which refines the original pitch value using cross-correlation in the time domain after initial pitch value estimation using peaks in the cepstral domain.Due to the anticipated infant cry pitch period, possible pitch period durations were constrained to 1.6-3.3ms.Finally, the data were ready for input into the models.
Then, the prediction models were developed.The models were developed in four phases.Firstly, various conventional ML models were developed to find the algorithms that better detect infant cries in a household setting.The algorithms were selected based on the literature review of the previous works.The selected algorithms include support vector

F I G U R E 2
The workflow diagram of this study.machine (SVM), 28 LR, 29 random forest (RF), 30 decision tree (DT), 31 Gaussian naive Bayes (GNB), 32 K-nearest neighbors (KNN), 33 and multi-layer perceptron (MLP). 34The models were developed with the help of scikit-learn, 35 a Python tool used in various machine-learning algorithms.To get the best performance out of the models, hyperparameter tuning 36 was carried out using the grid search algorithm. 37The parameter settings are given in Table 1.
Secondly, 11 different deep-learning models were developed.Among them, 5 were conventional deep-learning models that include CNN, 38 and various variants of RNN like long short-term memory (LSTM), 39 bidirectional LSTM, gated recurrent units (GRU), 40 and bidirectional GRU. 41The other six were state-of-the-art deep-learning technologies that include deep belief network (DBN), 42 Patch-based DBN (Pa-DBN), 43,44 OP-ConvNet, 45 Gaussian-weight initialization of CNN (Ga-CNN), 46 stacked sparse autoencoder (SSAE), [47][48][49] and deep stacked auto encoder (DSAE). 50A significant amount of time was taken to determine each model's architecture, which gives the best performance.Among the models, CNN was found to have the best performance.The hyperparameter settings of the deep-learning models are shown in Table 2.

TA B L E 1
Hyperparameter settings of the developed ML models.

Model
Hyperparameter Values The architecture of the proposed CNN model is shown in Figure 3.The architecture has a total of 17 layers.First comes the input layer, which takes the sample data as input in the form of an image shape.Then comes the 1D convolutional layer with 64 filters/layer and a kernel size of 1×6.In a convolutional layer, a filter is applied to the input data, removing unnecessary details while keeping the relevant information.Nonetheless, this layer used the relu function as the activation function.Then comes the batch normalization layer, which makes the neural network training faster by normalization of the inputs.The fourth layer is the max pooling layer.Max pooling is performed in this layer with a pool size of 1×3.The following nine layers are the repetition of convolutional, batch normalization, and max pooling layers.The fourteenth layer is the flatten layer, which converts the pooled feature map into a single column and passes through a fully connected layer known as the dense layer.The following two layers are the dense layer with the relu activation function.Finally comes the output layer, which gives the prediction probability of whether a particular sound is an infant cry or not.The output layer uses a softmax activation layer.

SVM
Thirdly, four hybrid models were developed to detect infant cries.Features were extracted from the top-performing deep-learning model (CNN) to develop the hybrid models.Then, the features were used as input for developing the hybrid models.The algorithms used in this phase include AdaBoost (AB), 51 CatBoost (CB), 52 XgBoost (XGB), 53 and gradient boosting (GB). 54

F I G U R E 3
The architecture of the proposed CNN model.

F I G U R E 4
The architecture of the proposed stack classifier.
Fourthly, a 3-layer stacked classifier was built on extracted features incorporating seven best-performing models among the conventional ML, deep learning, and hybrid models.The architecture of the proposed stacked classifier is shown in Figure 4.The given architecture was found after trying various combinations of models at different positions and layers and finding out the performance of the hybrid stacked classifier.Four models, namely, XGB, MLP, CB, and GB, were kept in the first layer of the SC model, whereas RF and DT models were kept in the second layer.Finally, the third layer has the LR model.The original samples of the dataset are fed into each model in the first layer.The verdicts are obtained from all four models for each instance in the dataset.Now, in the second layer, the verdicts from XGB and MLP were fed as input along with the actual value of those instances to the second layer RF model, and the verdicts from CB and GB were fed into the DT model.Finally, the verdicts from RF and DT models are provided as input to the third-layer LR model.Then, the final judgment was obtained from the third layer LR model.
For the final phase of the research, a different dataset named the ESC dataset created by Piczak 55 was used to train the hybrid models, and their performances were evaluated in terms of precision, recall, and f1-score.This step was carried out to prove the effectiveness of the proposed hybrid model with different datasets.

RESULTS
Each prediction model's performance was assessed based on precision, recall, and f1-scores using both the train and test data.Each evaluation parameter was obtained by performing macro averaging on the actual and model-predicted class labels, which calculated parameters for each class label and found their unweighted means.Accuracy was not regarded as a performance metric for evaluating the performance of the classifiers since the resampling method was utilized to balance the classes.As seen by, 56 accuracy is inappropriate in such cases.The performance of running each of the conventional ML algorithms on train and test data are presented in Figure 5A and 5B, respectively.The performance metrics of these models are also mentioned in Table 3.  RF, DT, and MLP obtained the best performance on train data, followed by LR, KNN, GNB, and SVM (Figure 5A).However, the best performance on test data was obtained by RF, followed by MLP, DT, LR, SVM, KNN, and GNB (Figure 5B).From the selected algorithms, RF, MLP, and DT achieved 100% train performances for different performance metrics, including precision, recall, and f1-score.RF obtained 92.72% precision, 93.05% recall, and 92.79% f1-score on the test dataset.MLP got precision, recall, and f1-score of 91.302%, 92.013%, and 91.51%, respectively, on the test dataset.DT had precision, recall, and f1-score 90.34%, 90.625%, and 90.38%, respectively, on the test dataset.The LR algorithm had a performance of 95.15% precision, 95.07%recall, and 95.101% f1-score on the training dataset.The test performance of LR was 90.27% for all the performance metrics.For the KNN model, values of precision, recall, and f1-score for the training dataset were, respectively, 93.65%, 93.21%, and 92.92%; whereas the values of precision, recall, and f1-score for test data were 86.37%, 86.45%, and 85.64%, respectively.The train performance for the GNB algorithm was 91.22% precision, 89.79% recall, and 90.109% f1-score.GNB had the lowest test score with precision, recall, and f1-scores being 82.5%, 81.52%, and 81.24%, respectively.Again, the result for the SVM algorithm shows that precision, recall, and f1-scores for the training dataset were 88.69%, 88.53%, and 88.21%, respectively, the lowest among all algorithms.In contrast, the precision, recall, and f1-score for the test dataset were 87.61%, 88.26%, and 87.71%, respectively.It can be observed that the lowest f1-score on test data was obtained by the GNB algorithm and the highest by the RF algorithm.

TA B L E 3
The evaluation results for the deep learning-based models for train and test data are presented in Figure 6A, B, respectively.The performance metrics of these models are shown in Table 3.It can be observed from Figure 6A, B that the CNN model obtained the best performance for both the train and the test dataset.The precision, recall, and f1-score of the CNN model on the training dataset was 100%, while the precision, recall, and f1-score on the test dataset were 95.402%, 96.51%, and 95.45%, respectively.The accuracy and loss curve for each epoch of the model is shown in Figure 7.
Again, the performance of the hybrid models for the train and test data is represented in Table 4. AB had the lowest performance on the train data as well as the test data.All the other models had 100% on all performance metrics for train data.Based on the results of the test dataset, it is evident that the Stacked Classifier (SC) achieved the best score and outperformed others.However, precision, recall, and f1-score for AB were 78.31%, 78.95%, and 78.58%, respectively.The performance measures for GB were 92.708% for all metrics.The precision, recall, and f1-score of XGB were 96.52%, 96.18%, and 96.32%, respectively.Furthermore, CB had precision, recall, and f1-score of 95.13%.Thus, the best performance was obtained by SC with 98.727%, 98.05%, and 98.39% of precision, recall, and f1-scores, respectively.
Finally, the hybrid models were trained again with a new dataset.It is observed from Table 5 that the proposed Stacked classifier has precision, recall, and f1-score of 100% on train data and 98.294% precision, 98.704% recall, and 98.484% f1-score on test data, respectively.Thus, even with a different dataset, the proposed stacked classifier network performs remarkably.

DISCUSSIONS AND CONCLUSIONS
Four different outcomes were obtained from this research.Firstly, the labeled dataset was subjected to various classical algorithms, including SVM, RF, DT, GNB, LR, MLP, and KNN, to determine the best-performing ML models for infant cry   detection.RF was discovered to perform better than other ML algorithms (92.79% f1-score).Secondly, out of 6 possible DL models, CNN with seventeen fine-tuned layers achieved the best (f1-score: 95.25%) for detecting infant cries.Third, five distinct hybrid models were proposed by extracting features from the second last layer of the CNN model.The proposed hybrid Stacked Classifier (SC) model performed the best of all the models.Fourth, the hybrid Stacked Classifier proved effective even with a different dataset.However, a thorough review and comparison of related research findings using the other models have been created to comprehend state-of-the-art technologies and their performance metrics for detecting infant cries in a household setting (Table 6).This research had the following limitations.First, this study did not examine classification methods based on signal processing.Second, the dataset used in this study was not very large.Thirdly, this work did not examine the transfer learning models-commonly referred to as the pretrained CNN models because these models require image Used a small dataset data.Fourthly, the computational complexity of the developed models for both time and space was not studied.As a result, future research may focus on (a) developing signal processing techniques with more samples for detecting infant cries in a household setting, (b) training other models that have not been tested and observing their performances, (c) using a much larger dataset for training the models, and (d) examining the time and space complexity of the proposed models.

CONCLUSION
Infant monitoring plays a crucial role in ensuring the well-being and safety of infants, particularly in a household setting.As a result, a rise in interest in infant cry detection studies has been seen recently.With the vulnerability of infants to various diseases and their limited ability to communicate, continuous monitoring becomes necessary.Researchers have recognized the significance of detecting infant cries as an effective means of understanding their needs and alerting parents or caregivers promptly.By leveraging advancements in technology and ML, efforts are being made to develop infant monitoring systems that can accurately detect and interpret infant cries within a household setting.So, an effective deep learning-based SC model named CNN-SCNet is proposed in this research, which can detect infant cries in a household setting.The f1-score attained by the proposed classifier reflects the approach's effectiveness in this study.

F I G U R E 1
An example of an infant cry signal waveform.

5
Performance metrics of selected ML algorithms on train and test data.(A) Train data and (B) test data.

6
Performance metrics of selected DL algorithms on train and test data.(A) Train data and (B) test data.

7
Accuracy and loss of CNN on each epoch.(A) Accuracy vs. epoch curve and (B) Loss vs. epoch curve.
Hyperparameter settings of the deep-learning models.
Performance measures of the ML and DL models.

train) Recall (train) F1-score (train) Precision (test) Recall (test) F1-score (test)
Results obtained for hybrid models.Performance metrics of the hybrid models on a different dataset.
TA B L E 4 Comparison of the previous ML approaches carried out for infant cry detection with the proposed system.