Mobile audiometry for hearing threshold assessment: A systematic review and meta‐analysis

Technological advancements in mobile audiometry (MA) have enabled hearing assessment using tablets and smartphones. This systematic review (PROSPERO ID: CRD42021274761) aimed to identify MA options available to health providers, assess their accuracy in measuring hearing thresholds, and explore factors that might influence their accuracy.

Conclusions: MA compares favourably to CA in measuring hearing thresholds and has role in providing access to hearing assessment in situations where CA is not available or feasible.Future studies should prioritize the integration of pure-tone threshold assessment with additional tests, such as Speech Recognition and Digitsin-Noise, for a more rounded evaluation of hearing ability, assesses acceptability and feasibility, and the cost-effectiveness of MA in non-specialist settings.MA is typically used to screen for a pre-determined level of hearing loss (i.e., pass/fail).Participants who do not pass screening are then referred for a full audiological assessment with conventional pure tone audiometry (the 'gold standard' for hearing threshold assessment).
Recent technological advancements have enabled threshold-level hearing assessment at individual frequencies using MA.
MA as a hearing screening tool providing a binary outcome (pass/fail) has been extensively studied.However, there is a paucity of evidence surrounding the use of MA for accurate assessment of hearing thresholds.This systematic review aimed to identify mobile audiometry available to health providers, with a focus on those that have had their threshold measurements assessed against conventional audiometry (CA).This review will summarise published literature, outline evidence regarding the accuracy of MA in hearing threshold assessment and highlight implications for clinical practice.

| Objectives
1. What are the available mobile audiometry systems capable of measuring hearing thresholds at individual frequencies that have peerreviewed published validation data? 2. How accurate is mobile audiometry at measuring hearing thresholds when compared with conventional audiometry?

| METHODS
This systematic review was registered with the International Prospective Register of Systematic Reviews (PROSPERO) on 20 September 2021 (ID: CRD42021274761). 1 An amendment to the review was made to clarify that the comparator would be conventional sound booth audiometry.The review was conducted in accordance with PRISMA guidelines. 2 The template data collection form, data extracted from included studies and data used for all analyses are available from the corresponding author.

| Ethical considerations
Ethical approval was not required for this review as it is a review of existing literature.No external funding was provided for this review.
The authors have no competing interests to declare.

| Eligibility criteria
The authors included full articles of randomised controlled trials, observational studies (cross-sectional, cohort, case series, case studies/reports) and pilot and feasibility studies that reported the use of a mobile hearing testing device/software capable of determining hearing thresholds and compared these to pure tone audiometry in a sound booth.Articles about hearing screening devices/software (i.e., reporting just pass/fail) or clinical Key Points 1. Mobile audiometry, which involves using tablets and smartphones, has typically been utilised for hearing screening purposes.Studies included in the meta-analysis had to report enough detail to enable the authors to determine the mean difference between thresholds measured using MA and CA or the mean difference between MA and CA thresholds measured at 0.5, 1, 2 and 4 kHz.

| Study selection
One reviewer conducted the searches (Babatunde Oremule).Two independent reviewers (Babatunde Oremule and Jonathan Abbas) independently screened titles and abstracts, and assessed studies against the eligibility criteria.Disagreements were discussed between the two reviewers and consensus agreement reached.No automation tools were used for this process.

| Data extraction
Two independent reviewers (Babatunde Oremule and Jonathan Abbas) extracted data from all included studies using a standardised form recorded in Excel (Microsoft Corporation).The data collection tool and data extracted are available from the corresponding author.
Data extracted included: authors, year of publication, country, use case, device type, device, operating system, software and version, headphones, number of participants, ears tested, ages, age ranges, males (%), participants, location of test, use of ambient noise monitoring, masking procedure and measured hearing thresholds.Where studies did not report all the data points, or information provided was unclear, were recorded.To ensure the inclusion of relevant studies and obtain necessary data for meta-analysis, we contacted authors of studies that lacked sufficient information to provide raw data.

| Study risk of bias assessment
Two reviewers (Babatunde Oremule and Jonathan Abbas) each independently assessed the quality of included studies using the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool. 3The QUADAS-2 assessment tool is recommended for systematic reviews to evaluate the risk of bias and applicability of primary diagnostic accuracy studies.It assesses the risk of bias, and the wider applicability of the results in four domains; patient selection, the index test (the new test being evaluated-MA), the reference standard (the current gold-standard-CA) and the study flow and timing.Domains are rated 'low', 'unclear' and 'high' for both categories.Disagreements were discussed between the two reviewers and consensus agreement reached.

| Effect measures
A meta-analysis was performed to obtain the mean difference between thresholds measured with MA and CA in dB HL.

| Synthesis methods
Study characteristics extracted using the data extraction tool were tabulated to display the results of each study and compared to the eligibility criteria.Statistical analysis was performed according to Cochrane methodology using Review Manager 5. 4,5 The outcome measures were the mean difference between thresholds measured using MA and CA, and the mean difference between MA and CA thresholds at each of 0.5, 1, 2 and 4 kHz frequencies.
First, the combined mean difference between MA and CA thresholds, and mean differences at 0.5, 1, 2 and 4 kHz were obtained for each study.Groups were combined using Cochrane combining groups methodology. 5To avoid masking any potential interaural differences within participants that could be obscured by combining groups, we analysed left and right ears independently for studies that reported them separately.To account for the fact that the two ears of a participant are likely to be correlated and may influence each other's hearing thresholds, each ear was treated as a separate observation, and each participant was considered a 'cluster', and analysis conducted analogous to a cluster randomised controlled trial.
We calculated a representative intraclass correlation coefficient (ICC) of 0.178 (95% CI: 0.07-0.35)using data from Whitton et al., the only study for which raw data were obtained. 6This ICC value was then used to calculate the inflated standard error for each study.The inflated standard errors were subsequently used as a basis for meta-analysis.We conducted a meta-analysis of the mean difference between measured MA and CA thresholds using the generic inverse variance method with a random effects model, acknowledging the heterogeneity in study populations, methods and measured outcomes.Subgroup analysis was conducted to investigate potential sources of heterogeneity.A sensitivity analysis was conducted using the upper (0.35) and lower bounds of the ICC (0.07).

| Study selection
A total of 858 articles were identified from database searches, of which 67 full texts were retrieved.Authors of 11 studies were contacted for further information via the listed corresponding author's e-mail on two occasions, 1-month apart.A response was received from one author, who replied with additional relevant information. 6llowing review, 17 studies met the inclusion criteria.The reasons for exclusion for the studies 67 studies that appeared to meet the inclusion criteria, but were excluded after full-text review are included in the PRISMA flow diagram (Figure 1).
Eleven studies involved adult participants, [6][7][8]10,12,14,16,[18][19][20][21] three studies both adults and children 11,13,17 and three studies children alone. 9,15,22 Four studes enrolled participants with specific conditions including otitis media with effusion, 9 occupational hearing loss, 18 extended-high frequency hearing loss 19 and suspected sudden sensorineural hearing loss.20 Tablet computers were most frequently used (10/17), followed by smartphones (6/17) and then computers (1/17). The Appl iPad™ was used in over half the included studies (9/17).The most frequently studied software application was SHOEBOX™ (6/17) followed by Hearing Test™ (3/17) and then HearTest™ (2/17). SHOEBOX (Clear Water Association) is a custom iPad-based software for professional use available for download on the Apple app store, available on payment of a subscription fee. HaringTest is consumer app available on the Google Play Store as a free application. HearTest professional app available on the Google Play Store.Custom software was developed in two studies.6,18 A variety of transducers were utilised, including supra-aural and circum-aural headphones, in-ear headphones, Apple EarPods and participants' own bundled headphones.Eleven studies reported American National Standards Institute (ANSI) compliant calibration, 6,[12][13][14][15][17][18][19][20][21][22] two studies reported using biological calibration, 10,16 three studies did not report performing calibration [7][8][9] and the method of calibration was unclear in one study.11 Sixteen studies reported air conduction thresholds only, while one study reported both air and bone conduction thresholds.8 F I G U R E 1 PRISMA diagram-Identification of studies via databases and registers.Masking was reported in five studies.7,8,10,12,14 Masking in one study was continuously applied during mobile bone-conduction testing, in contrast to conventional manual bone conduction testing.8 One study applied masking with threshold differences of ≥35 dB, 14 two at ≥40 dB 10,12 and masking procedure was unclear in the final study.7 MA was most frequently conducted in a quiet room (8/17) or in a sound booth (6/17), and in both a quiet room and a sound booth in one study.6,18 Automated (self-administered) audiometry was used in 14 studies.Audiologists conducted the testing in two studies.
The final study used untrained assessors. 18These were administrative staff at an industrial worksite who were given written instructions and then observed construction workers doing the test.Ambient noise measurements were taken in 16/17 studies, but it was unclear whether ambient noise was measured in one study. 16In five studies, [13][14][15]18,21 ambient noise testing was conducted during hearing assessment, while in the remaining studies, it was either not performed or not reported.
The most frequently reported outcome measure in the included studies was the mean difference between MA and CA thresholds, along with their corresponding standard deviations.This was reported in 9 out of the 17 studies.The second most frequently reported outcome was the mean threshold levels measured by MA and CA at individual frequencies, which was reported in 7 out of the 17 studies.The mean pure tone average (the average of thresholds at 0.5, 1, 2 and 4 kHz) was reported for both MA and CA in only 3 out of the 17 studies, while the mean absolute difference between MA and CA thresholds at each frequency was reported in 2 out of the 17 studies.
Fourteen studies reported accuracy in terms of the percentage of MA thresholds falling within 10 dB of CA.A <10 dB variation in thresholds between test-retest for hearing assessments is generally considered a sub-clinical difference.This is because a pure-tone threshold measurement at a single frequency has a 90% chance of being repeated within ±10 dB compared to the first measurement, assuming no real change in hearing thresholds has occurred. 23Likewise, the Occupational Health and Safety Administration's (OHSA) defines a standard threshold shift as an average change in hearing threshold relative to the baseline audiogram of an average of 10 dB or more at 2000, 3000 and 4000 Hz in either ear, with correction for age relative to the baseline audiogram. 24Using this metric, all studies reported an accuracy of >80% for MA in identifying hearing loss >30 dB HL, except for one study that reported accuracy of 67%. 15

| Risk of bias in studies
A summary of the authors' assessment of risk of bias is presented in Table 2 and a detailed assessment presented in the Appendix S2.
Overall, the risk of bias in included studies was unclear or high.Two studies presented a low risk of bias in all four domains. 12,2216]21 Seven studies presented an unclear or high risk of bias in two domains, 6,7,[9][10][11]13,17,19 and one study in all four domains. 20 T A BL E 1 (Continued) The most common area of potential risk of bias was due to unclear reporting of the methods of participant selection (14/17), [6][7][8][9][10][11]13,15,[17][18][19][20][21] with one study assessed as having a high risk of bias.18 There was an unclear risk of bias in the index test domain in three studies. Inhese studies, MA was conducted after PTA with the results known or it was unclear whether the assessor was blinded.16,17,20 An unclear risk of bias was presented in one study as it was unclear whether the CA results were interpreted without knowledge of the MA results.20 There was a high risk of bias in the 'flow and timing' of four studies as not all the thresholds for all the participants were included in the T A B L E 2 Author's assessment of risk of bias.analyses. In one study, 17out of 300 MA thresholds were not reported as responses could not be obtained at the maximum intensities with MA but were obtained with CA. 13 Two participants with confirmed profound hearing loss were excluded from analysis in one study due to more than six absent responses during MA testing.11 Out of range MA measurements were discarded from further analysis in one study, equating to: 13 of 980 (1.3%), 14 of 966 (1.4%) and 10 of 980 (1.0%) in the case of test, retest and pure-tone audiometry, respectively. 10Finally, one study only reported 249 out of a possible 276 MA thresholds. 20erall, concerns for wider applicability were low.9][20] Participants were compensated for taking part in two studies, introducing a risk of participant self-selection bias. 18,21One study only included participants with a mobile phone, which may have skewed participants to a younger population. 10
A sensitivity analysis using the upper and lower bounds of the ICC (0.07, 0.

| Reporting biases
Eikelboom et al. had a low risk of bias due to missing results, as they excluded participants who did not complete the automated test or had poor test reliability. 8Yeung et al. had a low risk of bias as they excluded only one participant due to earbuds not fitting. 22Masalski et al. had a low risk of bias as they discarded a small proportion of out-of-range measurements from their analysis. 10Lubner et al. had a high risk of bias due to missing results, as they did not report values for several air conduction thresholds without a clear justification. 20

| Certainty of evidence
Overall, the results suggest that the studies generally had an unclear or high risk of bias, and low applicability concerns, in most domains.
However, there were a few studies with an unclear or high risk of bias in some domains and high applicability concerns.This should be considered when interpreting the results of the meta-analysis.

| Principal findings
This systematic review aimed to identify MA options available to health providers and compare their threshold measurements to conventional audiometry (the current 'gold standard').This review found that there were several MA devices capable of measuring hearing thresholds.The most frequently used was Shoebox™.MA has been used for a variety of clinical indications including assessing patients with suspected sudden sensorineural hearing loss, Extended-High Frequency (EHF) hearing loss, occupational hearing loss and children with otitis media with effusion.
Although MA was most frequently completed in a quiet room with ambient noise monitoring, these rooms were located in multiple locations, including an emergency department, a manufacturing site, an audiology department and in patients' homes.This highlights the utility of MA in widening access to hearing assessment to areas where CA is not readily available, and the possibility of using MA for remote diagnosis and follow-up in non-specialty locations.

| Potential impact of youth and cognitive ability on results
Overall, both children and adults were able to understand and follow instructions for completing automated hearing testing.Two studies were outliers in significantly overestimating hearing loss overall. 9,22th studies included children as the only participants.Ambient noise-levels were measured whilst the clinic room was empty, but the actual study environment was potentially significantly louder as other adults were in the room during testing.Pereira et al.
was the third study that solely included children as participants and studied Shoebox™.Participants in this study were divided into three groups based on age: 6 years, 7-9 years and 10-12 years.Investigators found that there may be an age effect for younger children (<7 years), with 52% of thresholds falling with 10 dB in 6-year-olds, but 75% for 7-9-and 10-12-year-olds, this difference was statistically significant (X 2 (2) = 19.3,p < .05). 15 MA thresholds were within 10 dB of CA thresholds 68% of the cases for children in this study.
The investigators hypothesised that younger children may not have fully comprehended the instructions as they did not yet have the cognitive maturity to do so.This differs from reported accuracy in adults where over 80% of thresholds fell within 10 dB HL. 7,17,20 Inattention may be a factor affecting the results in children.Three studies 15,17,22 utilised SHOEBOX™, which incorporates an animated, game-based, forced-choice testing technique validated for use in children aged 3 years and older. 25The youngest participants in two studies were aged 12 and 16 years, respectively, suggesting the impact of inattention may not be as significant compared the to the younger children included in the previously mentioned studies. 11,13In the final study, 9 it was unclear how self-testing was administered for children and how much support the audiologist provided during testing, nevertheless the authors hypothesised that young age of the participants may have impacted their findings.
These findings raise a concern that MA may be unreliable in young children, particularly those under 7-years of age.Overestimation of hearing loss with MA could lead to the unintended consequence of over-referral to audiology clinics.The findings also raise an important question as to whether automated MA is appropriate in other patient populations with impaired cognitive function, such as in patients with dementia, and participants prone to inattention.
Appropriate patient selection for MA is essential to ensure patients benefit.

| Calibration and ambient noise
Calibration is a critical feature in audiometry as the sound level emitted may be different between different devices, and could also be affected by the transducers (headphone/earphones) used, which could have an impact on the results.Despite the increasing use of consumer devices and transducers for hearing threshold assessment, there is currently no universally accepted or recommended method for calibrating them.This lack of standardisation likely contributes to the varying approaches to calibration observed in the studies included in the analysis.Interestingly, our analysis found no significant difference between MA and CA thresholds in studies that used calibrated devices and those that did not.Ambient noise appeared to impact low frequency thresholds in four studies.Using Shoebox™, Yeung et al.
found significantly elevated thresholds at 500 Hz, with a similar phenomenon observed to a milder degree in two other studies. 17,20Whitton et al. found that low frequency thresholds (≤250 Hz) were slightly but consistently elevated when measured at home. 67][28] One study employed noise cancellation combining passive (insert headphones and circum-aural muffs) and active (circum-aural muffs), which the authors stated resulted in passive attenuation for high-frequency noise and active management of low-frequency noise. 14 is important to note that while ambient noise measurement and a quiet room may be sufficient for performing MA, there is still a need for standardised calibration procedures to ensure accurate and reliable results.Adding remote calibration functionality and real-time ambient noise monitoring to MA devices could potentially improve the accuracy of frequency threshold measurements, but further research is needed to determine the effectiveness of these approaches.

| Accuracy of mobile audiometry in threshold assessment
The mean difference in between MA and CA thresholds of 1.36 dB is not a clinically significant finding, and although there was a statistically significant trend for MA to overestimate hearing loss in children, overall the observed difference would be considered sub-clinical (8.44 dB).It is important to note that there are scenarios where a difference of less than 10 dB may be clinically significant, such as in young children who are at a key stage of speech and language development.However, their threshold measures are more variable so bigger differences are more likely to be observed.
It appears that ability of MA to detect mixed and conductive hearing loss appears to be limited, with only one study reporting assessment of bone conduction thresholds. 8Earlier research with the same device indicated bone conduction threshold measurement may be less precise than air conduction threshold measurement due to issues related to conductor placement. 29Self-administered bone conduction assessment may be more difficult from a technical standpoint, which could be one reason for the limitations in MA's ability to detect mixed and conductive hearing loss.Although not as reliable as the gold-standard method with a trained administrator, but it is still able to address the clinical question of whether mixed or conductive hearing loss is present.However, appropriate support is needed to ensure accurate placement, which may limit the feasibility of routine bone conduction threshold assessment using MA.

| Context of other evidence
This review found that that the accuracy of MA threshold measurements appears to vary with age, with younger children tending to have less accurate results.Interestingly, Chen et al. reported similar findings in their meta-analysis of mobile hearing screeners (pass-fail devices only). 30These findings suggest that it may be prudent to limit use MA to older children and adults until there is more evidence for use in young children.
Mahomed et al. conducted a systematic review and metaanalysis of automated threshold audiometry and found a mean difference between manual and automated air conduction audiometry of 0.4 dB (6.1 SD), a similar finding to this review. 31Automated audiometry that can be fully self-administered has the potential to improve the accessibility of accurate hearing threshold assessment without the need for a trained observer.However, based on the findings of this review, the reliability of self-administered MA in young children may be limited.Therefore, the current utility of this technology in this age group is unclear.Further research is needed to evaluate the effectiveness and reliability of self-administered MA in young children, and to determine whether additional support or modifications to the technology may be necessary to improve the accuracy of its measurements.

| Limitations of evidence in review
Despite exhibiting good overall accuracy, multiple studies reveal considerable variability in MA performance at the individual patient level.
This variability is evidenced by substantial heterogeneity across the studies and a wide standard error in many instances.The heterogeneity of the studies included the participant characteristics, equipment used, calibration methods and testing environments.While the use of a standardised outcome measure (dB HL) enabled a meta-analysis, the methodological differences between studies may have impacted the review's overall findings.
Another limitation is the varying outcomes reported by the studies, which highlights the need for standardised reporting in future studies to enable meaningful comparisons and meta-analysis.The authors recommend that at a minimum, all future studies should report the mean and SD for MA and CA for each recorded frequency, the mean difference at each frequency with SD, and the overall mean difference with SD to support future meta-analysis.
Additionally, many of the included studies had an unclear or high risk of bias, which could potentially impact the accuracy of the review's conclusions.These limitations suggest that further research is needed to understand the benefits and limitations of MA for hearing threshold assessment more clearly.

| Limitations of review process
The review relied solely on published data, and attempts to obtain individual participant data for a more comprehensive analysis were largely unsuccessful.Furthermore, all meta-analyses are subject to publication bias.In this review, studies showing favourable results of MA may have been more likely to be published than studies showing an unfavourable comparison to CA.This bias can lead to overestimation of the treatment effect and affect the validity of the metaanalysis.These limitations should be considered when interpreting the findings in this review.

| Implications for practice, policy and future research
Audiometric testing is a fundamental part of an ear and hearing assessment.CA relies on the presence of a trained audiologist, clinical equipment and a sound booth.This requires a dedicated space that may not always be available, for example in a busy emergency department, and is comparatively costly (cost of training, equipment, estate costs, etc.), limiting access for patients in low-income or rural settings.
MA could be used in areas of low-resources to widen access to specialist hearing healthcare, assist in areas of high-demand and where service redesign is being considered to optimise service provision using telemedicine.Successful pilot studies in the NHS using MA in GP practices and in dedicated community audiology clinics, point towards a future where hearing healthcare can increasingly be provided in the community, reducing healthcare cost and carbonemissions associated with delivering healthcare. 32,33is review has found that MA can accurately determine hearing thresholds in adults and children when conducted in a quiet room or a sound booth, using self-administered or patient-supported testing.
Studies included in this review were found to have an unclear or high risk of bias, which could impact on the review's findings.Concerns for wider applicability were low, as evidenced by the variety of testing locations.A quiet room appears to be an appropriate venue in which to conduct MA assessments, although testing in home environments, with different headphones and potential noise, is not adequately represented in many of the studies.Therefore, there may be concerns regarding the broader applicability of these findings.
Recent advances in remote audiological assessment have expanded the scope beyond pure tone assessment.Studies have explored remote assessment of supra-threshold hearing deficits, 34 and speech recognition via internet tests and the telephone. 35,36Based on this review's findings, future work should prioritise combining pure tone threshold assessment with other tests, such as Speech Recognition and Digits-in-Noise, for a more rounded evaluation of hearing ability.Head-to-head comparative studies may be used to identify the most accurate technology, or highlight whether some applications perform better in some settings than others.Furthermore, a cost-effectiveness analysis would assess the economic impact of incorporating MA into existing clinical workstreams.

| CONCLUSION
This study provides a concise overview of the current mobile audiometry options and offers empirical evidence demonstrating the accuracy of MA for hearing threshold assessment.Overall, while MA shows promise as a tool for hearing assessment, its limitations in detecting mixed and conductive hearing loss and potential for overestimating hearing loss in children suggest that it should not be used as a replacement for traditional CA in all situations.However, in certain contexts such as remote or low-resource settings, where CA may not be feasible or accessible, MA may provide a valuable alternative for screening purposes.Further research is needed to better understand the optimal use of MA and to improve its accuracy and reliability.

K
E Y W O R D S audiometry, hearing loss, hearing test, meta-analysis, mobile application, mobile devices, tablet computer 1 | INTRODUCTION 1.1 | Rationale Mobile audiometry (MA) involves the use of tablet computers and smartphones, with associated application software, to assess hearing.

2 . 3 .
Technological advances have enabled the use of mobile audiometry to measure frequency-specific hearing thresholds.Mobile audiometry is accurate, well-tolerated and has been used in a variety of settings.4.There is a tendency for mobile audiometry to overestimate the degree of hearing loss in children; however, this difference is below the clinically relevant threshold of 10 dB. 5. Future research should explore the integration of puretone hearing assessment with additional audiometric tests to achieve a more comprehensive evaluation of hearing function.audiometers, journal abstracts and conference proceedings were excluded.
Durgut et al studied Hearing Test™ in children aged 5-14 years (mean age 8.2 years) and found no statistical correlation between MA and CA, suggesting that environmental noise and child participants may have affected the results. 9Yeung et al. studied Shoebox™ in children 5-17 years (mean 9.5 years) and hypothesised that the adult-sized, consumer-level headphones used in children may have impacted the level of ambient noise and hence affected the lower frequencies.
Characteristics of included studies.