A deep neural network using audio files for detection of aortic stenosis

Abstract Background Although aortic stenosis (AS) is the most common valvular heart disease in the western world, many affected patients remain undiagnosed. Auscultation is a readily available screening tool for AS. However, it requires a high level of professional expertise. Hypothesis An AI algorithm can detect AS using audio files with the same accuracy as experienced cardiologists. Methods A deep neural network (DNN) was trained by preprocessed audio files of 100 patients with AS and 100 controls. The DNN's performance was evaluated with a test data set of 40 patients. The primary outcome measures were sensitivity, specificity, and F1‐score. Results of the DNN were compared with the performance of cardiologists, residents, and medical students. Results Eighteen percent of patients without AS and 22% of patients with AS showed an additional moderate or severe mitral regurgitation. The DNN showed a sensitivity of 0.90 (0.81–0.99), a specificity of 1, and an F1‐score of 0.95 (0.89–1.0) for the detection of AS. In comparison, we calculated an F1‐score of 0.94 (0.86–1.0) for cardiologists, 0.88 (0.78–0.98) for residents, and 0.88 (0.78–0.98) for students. Conclusions The present study shows that deep learning‐guided auscultation predicts significant AS with similar accuracy as cardiologists. The results of this pilot study suggest that AI‐assisted auscultation may help general practitioners without special cardiology training in daily practice.


| METHODS
The study consists of two parts. In the first part, we trained a neural network to classify auscultation findings of patients who have significant AS or not. In the second part, we compared the performance of the trained DNN with the auscultatory skills of 10 experienced cardiologists, 10 residents, and 10 medical students by using a test data set that consisted of a completely disjointed set of patients.
For training, we used auscultation audio files from 100 patients with significant AS and 100 patients without AS. The ground truth was defined by echocardiography. Significant AS was defined as V max of >3.5 m/s measured by continuous-wave Doppler. Although the definition of highgrade AS has not yet been reached, we chose this cut-off value, as these patients require close monitoring. Patients admitted for suspicious coronary artery disease or other cardiac diseases were taken as a control group.
We used an electronic stethoscope (Eko) connected to a smartphone interface via Bluetooth for auscultation. Auscultation was performed at the aortic auscultation point (second intercostal space, right sternal border) and the mitral auscultation point (fifth intercostal space, midclavicular line). Thus, from each patient, we included two auscultation files. At each auscultation point, audio files with an interval of 15 s and a sampling rate of 40 kHz were recorded.
The audio files were recorded as part of the clinical routine in a tertiary teaching hospital with a large valve unit specialized in transcatheter aortic valve implantation (TAVI). Only data from patients of this database were included who got echocardiography within 7 days before or after auscultation. Since the study was retrospective, an explicit ethics vote was not necessary according to the regulations of the responsible ethics committee.
We preprocessed the data in our study before using them to train the network. In the first step, the 15 s sound files were divided into three equal parts of 5 s. This was done to overcome the risk that small portions of an auscultation file falsified by respiration 11 contribute disproportionately to the training of the whole network. Consequently, six sound files per patient (three files for each of the 2 auscultation points) contribute to the network's training. In the second step, we performed Mel Frequency Cepstral Coefficients transformation of the audio files (MFCC-transformation). This transformation maps the perception of human hearing and has been proposed for audio data analysis of heart and lung auscultation. [12][13][14] Subsequently, we trained the DNN with 1200 processed audio files (6 files per patient, 100 patients with AS, 100 patients without AS).
We developed a CNN for classification (Figure 1), which takes as input MFCCs. The sequential model has three two-dimensionalconvolutional layers and one max pooling layer. As an activation function, we used "ReLU". Before each convolutional layer, we applied batch normalization. To combat overfitting, we used a dropout layer between the convolutional layers that sets a random portion of the weights equal to a probability of 0.2 to 0. Thereby the network has to learn different aspects of the data each time. 15 Hyperparameter tuning was done iteratively for learning rate, batch size, number of epochs, number of kernels, and grid size of the convolutional layers. Model comparison was made using K-fold cross-validation.
After training the network, it was applied to the test set that the model had not seen before. The test set consists of 20 patients with AS and 20 without AS. Accuracy, sensitivity, specificity, receiver operating characteristic curves (ROC), and F1 score were calculated. F1 score was calculated using the following formula: F1 score = 2 × (recall × precision)/(recall + precision). Then the same test set was classified by 10 experienced cardiologists, 10 residents, and 10 final year medical students. The performance parameters were averaged in cardiologists, residents, and students.
Audio file processing, training the DNN, and making predictions were made with the general-purpose programming language Python. Preprocessing audio files was done by the audio analysis library Librosa. We generated Mel Frequency Cepstral Coefficients (MFCC) with a hopelength of 10 and 13 coefficients. 16 A CNN was implemented using the Keras framework with a TensorFlow (Google) backend. 17 Continuous baseline characteristics are given as mean ± SD.
Continuous variables were compared using t-test, and categorical variables were compared using chi-quadrat test. Accuracy, sensitivity, specificity, and F1-value for cardiologists, residents, and students are given as mean with a 95% confidence interval.
The confidence intervals for the specific parameters of the DNN were calculated with the formula = z × sqrt((parameter × (1 − parameter))/n), where z is the corresponding parameter and n is the size of the test sample, 40 in the present case. Inter-rater reliability was assessed by calculating Fleiss' kappa. 18

| RESULTS
Data from 120 patients with and 120 patients without AS were taken for the present study. From each group, audio files from 100 patients were allocated to the training group and 20 to the test group, respectively.
Of the 120 patients with AS, in 99 patients femoral TAVI, in 5 transapical aortic valve implantation, in 13 patients open-heart surgery, and in 1 patient only valvuloplasty were performed. Two patients were treated conservatively.
F I G U R E 1 Data processing and analysis were principally done in two steps. In the first step, MFCC feature extraction was done. In the second step, the preprocessed data were fed to the convolutional part of the DNN. After the convolutional layers, the output is flattened to a one-dimensional tensor. Data are then fed to a fully connected layer using the ReLU (rectified linear unit) activation function. To overcome overfitting, which means that the network is too much adapted to the training data set, the regulizer and dropout techniques were applied. In the softmax function, the input values are transformed to a probability distribution that gives the probability of AS or no AS in the present case. AS, aortic valve stenosis; MFCC, Mel frequency cepstral coefficients.

| DNN's diagnostic accuracy
Hyperparameter tuning was done using K-fold cross-validation. The training data were split into fourfolds, and while iterating through the folds, each iteration uses onefold as the validation set. Using this approach, the optimized DNN consists of three convolutional layers with 32 kernels with a grid size of 3 × 3 (convolutional layer 1 + 2) and

| Diagnostic accuracy of the CNN versus cardiologists, residents, and students
Ten students, 10 residents in an advanced stage of training, and 10 consultant cardiologists participated in the study. Participants were blinded for the results of the DNN. They were asked to classify patients whether to have AS using two audio files for each patient.
For inter-rater reliability, Fleiss' kappa was 0.69 in students, 0.64 in residents, and 0.84 in cardiologists. This shows that the agreement in the group of cardiologists is much higher than in the group of residents and students. The F1-score is a parameter to compare the performance of different models or rater groups when seeking a balance between precision and recall. In Figure 2 ROC curves for deep learning model, students, residents, and cardiologists are shown. The DNN showed a higher F1-score than the mean score of cardiologists, residents, and students. Values for accuracy, sensitivity, specificity, and F1-score are given in Table 2.

| DISCUSSION
The present study shows that deep learning-guided auscultation predicts significant AS with similar accuracy as board-certified cardiologists. These results suggest that artificial intelligenceassisted auscultation may help general practitioners without special cardiology training.
Auscultation is one of the pillars of clinical investigation. It is readily available, and no sophisticated technical requirements are necessary. On the other hand, a high level of expertise is essential, and the skills, once acquired, need to be used continuously. These circumstances may explain errors in auscultation between 20% and 80% in residents, primary care physicians, and cardiologists. 19,20 For this reason, computer-assisted auscultation was already proposed for clinical use at the beginning of this century. The developed algorithms decompose the cyclical heart sound, and handengineered processing is applied for classification. In young patients with hypertrophic cardiomyopathy and children with congenital heart disease, sufficient sensitivity and specificity could be achieved with these techniques. 21 However, these studies were done in young patients with no conditions complicating the auscultation results like adiposity and lung emphysema.
Deep neural networks (DNNs) take a completely different direction. A DNN is a machine learning algorithm that models brain architecture. In first applications, the focus was directed on standard diagnostic techniques used by doctors daily but cannot be provided on an expert level in any case. 22 Deep learning systems were developed for ECG interpretation, 23 skin cancer identification, 24 and papilledema detection. 25 In this context, it has been recognized that AI may also be a valuable tool to support doctors in identifying valvular heart disease.
In a recently published study by Chorba et al., physicians assigned 5878 auscultation findings to the labels "heart murmur", "no heart murmur", or "inadequate signal". The DNN was trained with these data using an end-to-end (E2E) network design. In a second step, the DNN was then validated on a test data set of 1774 recordings annotated by separate expert clinicians. 26 In contrast, the ground truth was not defined by physician assignment but by echocardiography in the present study. By using this gold standard, the well-known erroneous annotation of auscultation findings by physicians was avoided. This is a crucial point as machine learning is based on detecting subtle patterns in data. Only when using high-quality training data, noise that masks these patterns can be sufficiently reduced. Furthermore, the data in our study were pre-processed before they were used to train the DNN. In this pre-processing, attributes of the audio data were isolated that have been shown to be essential for pattern recognition in audio files. 27,28 With this hybrid approach, our DNN showed similar accuracy as board-certified cardiologists. The precisely defined ground truth in conjunction with the preprocessing of the audio data compensates for a comparatively low patient number.
A potential limitation of our study is that we only included a few patients with moderate aortic stenosis. This is due to the fact that the study was conducted with data from patients admitted to a tertiary teaching hospital for specialized valve therapy. Moreover, patients with only moderate valve disease are challenging to identify because F I G U R E 2 ROC curve (orange line) achieved by the model in comparison to students (A), residents (B), and cardiologists (C). Individual rater performance is indicated by the black crosses, and averaged cardiologist performance is indicated by the red dot.
VOIGT ET AL.