The validity of the Acoustic Breathiness Index in the evaluation of breathy voice quality: A Meta‐Analysis

The evaluation of voice quality with acoustic measurements is useful to objectify the diagnostic process. Particularly, breathiness was highly evaluated and the Acoustic Breathiness Index (ABI) might have promising features.


| INTRODUC TI ON
The assessment of voice quality is a crucial part in voice diagnostics, complicated by the fact that voice quality is an imprecisely defined term. 1 Voice quality is generally described as perceptually multidimensional construct which cannot be measured by a single parameter. 2 Various diagnostic protocols use different terms in the perceptual evaluation of voice quality such as grade or overall severity, hoarseness, roughness, breathiness, asthenia and strain. [3][4][5][6][7] In case of major subtypes of perceived abnormal voice quality, the terms such as breathiness, roughness and strain have been received wide acceptance. 8 Breathiness can be defined as turbulent noise that is excessively high in frequency during phonation resulting from air leakage during glottal closure. 9,10 Roughness is the result of irregular vocal fold vibrations characterised by frequent, rapid and random changes in the regular vocal fold movement patterns. 3,[11][12][13][14][15][16] Strain, also described as pressedness, means increased vocal effort or "hyperfunction" and can be attributed to an unintentional excessive contraction of phonation, breathing, articulation and neck muscles during phonation leading to increased subglottal pressure and an increased degree of vocal fold adduction during phonation. 17 Next to overall voice quality, the intra-and inter-rater reliability of these dimensions of voice quality reached the highest values both within and among single raters and within rater panels. 13,[18][19][20] However, there is still a notable variability of intra-and inter-rater reliability in all auditory-perceptual features caused by different factors. [21][22][23] In addition and as an alternative to auditory-perceptual judgement, acoustic measurements of the voice signal have a high potential in the assessment of voice quality because they are based on objective measurements and are therefore less-biased by subjective factors. Moreover, they are the most frequently used diagnostic tool in voice research for identifying voice anomalies. 24 Furthermore, the earlier described pitfalls in the judgement of perceived auditory features can be overcome with acoustic methods, thus increasing the reliability and validity of voice assessments. Acoustic methods used so far are based on time, frequency, amplitude and quefrency domains to analyse acoustically the voice signal. However, recent research has shown that the combination of several acoustic parameters in one model or index provides better diagnostic accuracy and higher concurrent validity than single metrics in the evaluation of perceived features of voice quality. 2,15,[25][26][27] Yet, there are still problems to develop a highly valid equivalent for roughness and strain based on acoustic measurements. Because roughness is quite complex in nature, it is difficult to capture all variants of the rough vocal sound (eg multiplophonia, acoustic irregularities or vocal fry) with a single model. 16 The correlations between perceived strain and spectral-cepstral and other acoustic measures revealed in some studies moderate to high results, 17,28 but an evaluation of the diagnostic accuracy of these methods and the validity for the perceived strain is still lacking. Hence, there is currently no acoustic index or model that can accurately assess perceived strain for clinical or research purposes.
For breathiness, an adequate model seems to have been developed. The Acoustic Breathiness Index (ABI) is a multiparametric nine-variable acoustic measure to quantify the degree of breathiness with a single score based on concatenated samples of continuous speech and the sustained vowel/a/. 29 This index has been developed by the first author. Both speech tasks with voiced segments of continuous speech of three seconds and intermediate vowel segments with sustained vowel/a/ of also three seconds are necessary for a high ecological validity of the voice quality assessment. 30 The initial study of ABI showed a strong correlation between ABI scores and perceived breathiness judgements, which was confirmed with cross-correlations statistic. 29 Furthermore, the diagnostic accuracy (ability of the measure to discriminate between the target condition and health) was high both in terms of sensitivity and in terms of specificity. 0.92 (95% CI, 0.89-0.94). The area under the curve of the SROC curve of this analysis showed an excellent value of 0.94. The weighted ABI threshold was determined at 3.40 (sensitivity: 0.86, 95% CI, 0.84-0.87.; specificity: 0.90, 95% CI 0.88-0.92).

Conclusions:
The results confirm the ABI as robust and valid objective measure for evaluating breathiness. • ABI is a valid diagnostic tool to objectify breathiness of phonation across languages.

Keypoints
Neither age and gender nor roughness significantly affects ABI in the evaluation of natural voices. 31 In addition, the ABI also indicates highly sensitive therapy-related voice quality changes. 31,32 The ABI score ranges from 0 to 10, and the higher an ABI score, the more severe breathiness and vice versa. The initial study of ABI investigated a Dutch-speaking population, and the reported threshold was 3.44. 29 Because continuous speech is part of the ABI, inter-language phonetic differences may influence the outcome of an acoustic-based voice quality measurement. Therefore, cross-validation studies are needed to investigate the ABI's level of validity for different languages. In particular, the diagnostic accuracy of ABI, including its sensitivity and specificity, is of primary importance in determining its language-related thresholds.
Thus, the aim of the meta-analysis presented here was (a) to estimate the diagnostic accuracy of this tool together with its sensitivity and specificity levels, based on cross-validation studies and (b) to calculate an overall weighted ABI threshold of all included studies.

| Search strategy
Firstly, a systematic literature search was performed to identify studies in electronic databases. We searched in MEDLINE, Google Scholar and Science Citation Index starting with the first publication of ABI, which was published in the year 2017 to February 2020.
Secondly, a manual search was utilised in grey literature sources.
The hand search strategy of relevant scientific reports for the meta-analysis was done in seven languages (English, German, Dutch, Japanese, Korean, Spanish and Brazilian Portuguese). Except for finding grey and non-English literature, a manual search was performed because electronic databases do not always include relevant search terms in the titles or abstracts, or publications are not indexed with terms that allow them to be easily identified as relevant works for the present study. 33 The Acoustic Breathiness Index was used as the only search term, because it has existed as a proper term without overlapping meanings since 2017.

| Inclusion and exclusion of studies
Studies were included that used the ABI analyses according to equal processing methodology of the initial ABI study 29 : equal proportion of continuous speech and sustained vowel segments in acoustic analyses, 34 recording hardware meeting sufficient standard for voice signal analyses, 35 using the software Praat for signal processing and the customised Praat script for ABI analysis, 29 and consideration of two groups of subjects (a vocally healthy group and a heterogenic voice-disordered group). Potential studies, which will be included in the present meta-analysis, had one of the objectives to analyse the validity levels of ABI. Therefore, and because the present study focused on the diagnostic accuracy of the ABI, other validity aspects such as concurrent validity and internal validity were excluded.
Inter-study differences in data acquisition were tolerated in the present meta-analysis. Each study has unique acoustical settings and configurations in terms of hardware, limiting the comparability between studies. However, many recent studies which used acoustic measurements for its investigation followed recommendations for hardware and recording circumstances to increase comparability of differences in room acoustics, 14 microphone type 35

| Critical appraisal of included studies and data abstraction
Data were abstracted, and the quality of each study was appraised using a customised data abstraction form. The QUADAS-2 tool 38 was applied to assess the risk of bias and applicability to the research ques- With regard to the assessment of applicability, each article was evaluated for low and high concern for applicability to the research question. Using the patient selection, index test and reference standard domains, we defined low applicability concern as follows: (a) patient selection, there is an acceptable range between vocally healthy and voice-disordered subjects; (b) index test, the ABI was performed according to the standards of the initial study to develop ABI 29 ; and (c) reference standard, the diagnosis of voice-disordered subjects was based on multidimensional voice assessments from otolaryngology.
We reported our methods and findings in accordance with Preferred Reporting Items for Systematic Reviews and Metaanalyses (PRISMA) standards 39 regarding the diagnostic test accuracy studies guidelines. 40

| Statistical analysis
Statistical analyses were completed using SAS software, release The heterogeneity of studies was calculated using chi-squared heterogeneity test and the I 2 index. Heterogeneity was confirmed for P-values ≥.05. An I 2 index value between 0% and 25% represents insignificant heterogeneity; >25%-50% low heterogeneity; >50%-75% moderate heterogeneity; and >75% high heterogeneity. 42 For the pooling method, the random-effects model was used. 43 In case of heterogeneity between the studies, this model takes this factor into account. To assess the diagnostic accuracy of the pooled analysis, the summary receiver operating characteristic (SROC) curve with its area under the curve (AUC) and index Q* was conducted. A model with poor performances in diagnostic accuracy has an AUC = 0.5 or lower. 44  Third, a weighted threshold of ABI with its sensitivity and specificity was calculated based on the results of all included studies using SAS software. 47 The weighting procedure is based on the Youden Index, which produces the best threshold provided by the maximum of sensitivity + specificity −1. Figure 1 shows the details of exclusion and inclusion of studies using a flow chart according to the PRISMA statement. 39 We screened a total of 34 unique citations, and 10 full-text articles were reviewed, including 6 studies. Table 1  The evaluation of the risk of bias of the included studies is shown in Table 2. The features of the included studies were judged to be at low risk of bias, although some confounding was possible for one study of the patient selection according to applicability concerns because the reference standard was proportionally high for vocally healthy voices. 49

| D ISCUSS I ON
The objective assessment of breathiness with acoustic measurements such as ABI seems to be sufficient compared with auditoryperceptual judgements. The literature search yielded six studies that evaluated the validity of ABI in recent years. For this purpose, the present meta-analysis investigated the diagnostic accuracy of ABI. The pooled sensitivity of the six included studies was acceptable, and the specificity was quite high. Although there was found high heterogeneity in pooled sensitivity but acceptable homogeneity in pooled specificity, the diagnostic accuracy of ABI was sufficient to high. This meta-analysis with more than 3600 included voice samples verifies that the ABI as an objective acoustic measure identifies pathological voices with high diagnostic accuracy.
Its pooled sensibility was good at 0.84, and its pooled specificity was very good at 0.92. The high specificity of the ABI guarantees that almost all subjects with absence of breathiness are correctly classified as such. Also, the vast majority, but not all cases, of voice disorders are detected by the ABI. This cannot be expected from a single measure either. For an even higher diagnostic accuracy, further measures are required, optimally as a combination of different dimensions, which also meets the requirements of the protocol for the diagnosis of voice disorders. 4,6,7,35 Nevertheless, given the low correlation between perceptive assessment of voice quality by the

F I G U R E 2
Forest plots from the meta-analysis of the six included studies of the ABI with pooled sensitivity and specificity next to their 95% confidence intervals, and heterogeneity statistics The study included voice samples with 43% normal, 41% mild, 11% moderate and 5% severe breathiness (high applicability concerns of patient selection). In other studies, the levels of breathiness were more balanced, especially for abnormal breathiness. For the validation of ABI as a voice diagnostic instrument, the full spectrum of abnormal breathiness should be represented in a balanced way to allow a clear distinction between normal and abnormal breathiness levels.
The present analysis covered four different language groups (Germanic, Roman, Atlantic and Japonic) with mostly two representatives for each language group. According to the present results, the ABI appears to be relatively robust to phonetic inter-language differences regardless of the language differences in the continuous speech part such as stress timing, syllable timing, mixed rhythm, intonation and complex syllable structure. 55 Compared to other linguistic studies on the ABI, the Korean study included the largest sample of voices from various laryngeal diseases. As demonstrated in the forest plots of the present meta-analysis, this study also had the largest power. The power of each study, which is included in the meta-analysis, reflects the magnitude or intensity of the parameters of interest in each study. The SROC analysis, which represents the performance of a diagnostic test based on data from a meta-analysis, 44 confirms that the ABI provides highly results to identify breathy voice quality. The Q* index is useful to summarise the accuracy of screening tests and defines test thresholds. In our analysis, both AUC and Q* index indicate the high diagnostic power of the ABI.
The present results confirm that breathiness can be validly quantified with acoustic measurements. Overall voice quality can be evaluated by other acoustic indices, such as the cepstral spectral index of dysphonia (CSID) and the acoustic voice quality index (AVQI), which require the recording of continuous speech and sustained vowel phonations. The CSID consists of two multiple regression-based mathematical estimates of the severity of dysphonia, with separate analysis of sustained phonation and continuous speech, using several cepstral and spectral-based measures in a commercial software of PENTAX Medical. 15,56 It has proven to be a valid instrument for acoustic voice changes associated with voice disorders such as vocal fold paralysis, adductor spasmodic dysphonia, primary muscle tension dysphonia, benign vocal fold lesions, presbylaryngis and mutational falsetto. 56 AVQI is a multivariate model in the freeware program Praat that includes six acoustic parameters demonstrating high validity in the evaluation of overall voice quality in heterogenic voice disorders. 30,[48][49][50][51] The recording and overall signal processing procedure and scale of AVQI are similar to ABI. Unlike CSID and AVQI, which assess the overall voice quality, ABI is particularly suitable as an acoustic measure for the evaluation of breathy voices, for example in case of benign vocal fold lesions, which are dominantly characterised by breathiness such as nodules with medium or large size, paralysis or paresis of the recurrent laryngeal nerve, and vocal fold bowing associated with presbyphonia. 29

| CON CLUS ION
In summary, our results confirm the ABI as robust and valid objective measure for evaluating breathiness. Its diagnostic accuracy revealed in six different validation studies high to very high results for sensitivity and specificity. A weighted threshold of ABI to discriminate categorically between breathy and non-breathy voice quality of a subject's voice was calculated at 3.40.

CO N FLI C T O F I NTE R E S T
None to declare.

ACK N OWLED G EM ENTS
Open access funding enabled and organized by Projekt DEAL.