Consumer sleep technology for the screening of obstructive sleep apnea and snoring: current status and a protocol for a systematic review and meta‐analysis of diagnostic test accuracy

There are concerns about the validation and accuracy of currently available consumer sleep technology for sleep‐disordered breathing. The present report provides a background review of existing consumer sleep technologies and discloses the methods and procedures for a systematic review and meta‐analysis of diagnostic test accuracy of these devices and apps for the detection of obstructive sleep apnea and snoring in comparison with polysomnography. The search will be performed in four databases (PubMed, Scopus, Web of Science, and the Cochrane Library). Studies will be selected in two steps, first by an analysis of abstracts followed by full‐text analysis, and two independent reviewers will perform both phases. Primary outcomes include apnea–hypopnea index, respiratory disturbance index, respiratory event index, oxygen desaturation index, and snoring duration for both index and reference tests, as well as the number of true positives, false positives, true negatives, and false negatives for each threshold, as well as for epoch‐by‐epoch and event‐by‐event results, which will be considered for the calculation of surrogate measures (including sensitivity, specificity, and accuracy). Diagnostic test accuracy meta‐analyses will be performed using the Chu and Cole bivariate binomial model. Mean difference meta‐analysis will be performed for continuous outcomes using the DerSimonian and Laird random‐effects model. Analyses will be performed independently for each outcome. Subgroup and sensitivity analyses will evaluate the effects of the types (wearables, nearables, bed sensors, smartphone applications), technologies (e.g., oximeter, microphone, arterial tonometry, accelerometer), the role of manufacturers, and the representativeness of the samples.


| INTRODUCTION
Sleep-disordered breathing (SDB), especially obstructive sleep apnea (OSA), is extremely prevalent and has been estimated to affect up to 1 billion adults worldwide (Benjafield et al., 2019). The traditional diagnosis approach has been overnight polysomnography (PSG) in a sleep laboratory, although there has been an increasing focus in recent years on ambulatory studies, often using simplified recording techniques (Arnardottir, Islind, & Óskarsd ottir, 2021). Furthermore, there has been increasing interest in consumer sleep technology (CST), which refers to any kind of technology or equipment (usually wearables, nearables, bed sensors, or mobile applications running on smartphones [apps]) that are marketed directly to consumers, without a need for prescription by a health professional, allowing individuals to self-monitor or track their sleep, or to manage or improve a certain sleep-related condition (Khosla et al., 2018;Schutte-Rodin et al., 2021). A large portion of those have not been designed with clinical use in mind (Schmitz et al., 2022). Several CSTs are related to the screening of SDB, and the number of devices and apps available for this purpose has been increasing consistently in recent years.
These CSTs are usually directed at detecting either snoring or OSA, and their sensors and presentations vary considerably (O'Mahony, Garvey, & McNicholas, 2020). This includes the evaluation of pulse oximetry, heart rate and heart rate variability, respiratory rate, breathing-related sounds, and body movements, among others Perez-Pozuelo et al., 2020). The present paper provides a background review of existing CST devices related to OSA and snoring, and proposes a protocol for a systematic review and meta-analysis of existing CST devices regarding their diagnostic test accuracy (DTA).
2 | BACKGROUND 2.1 | The rise of CST for SDB screening The increasing use of innovative technologies for SDB screening (mainly OSA and snoring) might be explained by several factors, from epidemiological, diagnostic, and commercial aspects. From an epidemiological perspective, OSA is an increasingly prevalent condition, ranging from 9% to 42% (Senaratna et al., 2017) and affecting $936 million people worldwide (Benjafield et al., 2019). From a diagnostic perspective, the 'gold-standard' diagnosis of OSA is an inlaboratory PSG recording, which encompasses important limitations related to its costs, availability, data variability, and patient experience (Box 1). Also, the diagnosis is limited to a few diagnostic metrics, such as the apnea-hypopnea index (AHI) and the respiratory disturbance index (RDI), which often fail to capture all the aspects of OSA severity (Pevernagie et al., 2020). Finally, the market for consumer-oriented sleep products has been growing considerably, reaching $2 billion (American dollars) and growing 18.5% per year (Nester, 2019).
In short, the high prevalence of OSA and snoring creates the demand for screening and diagnosis, while the limitations of the traditional PSG diagnosis raise the importance of not only focusing on new screening tools, but also on patient-or user-experience (including comfort, usability availability, and affordability). These factors led the sleep-related market to grow, which stimulates companies to invest, innovate, design, develop, and improve sleep technologies directly to consumers, bypassing the role of health professionals, and in some cases even patients, in the screening process.

| Overview of technologies available
Type-I PSG is considered to be the gold-standard method for OSA diagnosis (Kapur et al., 2017). It consists of an in-laboratory overnight sleep study, which is prepared and monitored by a sleep technologist.
A type-I PSG requires the acquisition and analysis of at least eight biological signals: an electroencephalogram (EEG), electro-oculogram (EOG), chin and legs electromyogram (EMG), airflow signals, respiratory effort, oxygen saturation, body position, and electrocardiogram, as described in the regularly updated American Academy of Sleep Medicine (AASM) manual (Berry et al., 2020).
The scoring of respiratory events in regular type-I PSG depends on two fundamental aspects: (i) sleep staging and (ii) detection and quantification of respiratory events. Sleep staging is important to prevent scoring respiratory events during wakefulness and to assess whether they are associated with a specific sleep stage. In addition, for some respiratory events, the EEG is required to be able to detect associated arousals or sleep fragmentation. Regarding the scoring of respiratory events, current guidelines (Berry et al., 2020) require four different types of sensors: an oronasal thermal sensor, a nasal pressure transducer, pulse oximetry, and some measure of respiratory effort (usually thoracoabdominal belts).
The need for all these sensors that require expert setup reinforces the limitations of type-I PSG (as disclosed in Box 1). Therefore, alternative ways to screen for SDB have been used and proposed, including the advent of home sleep apnea tests (HSATs), which encompasses sleep study types II-IV. A type-II PSG uses the same sensors and montage as a type-I but is thought to be performed unattended and without real-time supervision of a healthcare professional, therefore allowing the sleep study to be performed outside of a medical facility (Kapur et al., 2017). Although it might represent some improvement in the patient-experience in comparison with type-I PSG, the maintenance of all sensors still represents an important limitation.
A common strategy to overcome these issues is reducing the number of sensors, such as in portable cardiorespiratory sleep monitors (including type-III and type-IV sleep studies) (Kapur et al., 2017).
These devices are usually restricted to the monitoring of cardiorespiratory variables, typically not including EEG, EOG, or EMG sensors, therefore not allowing the performance of sleep staging. Although reducing the number of sensors has practical benefits, it might lead to a reduction in diagnostic sensitivity, reducing the ability to rule out OSA (Caples, Anderson, Calero, Howell, & Hashmi, 2021). It also precludes the possibility of evaluating other sleep disorders (such as periodic limb movement disorder), thus impairing a proper differential diagnosis. Therefore, simplified sleep studies are useful in cases of well-grounded clinical suspicion of moderate-to-severe OSA, with no comorbid medical disorders or risk of other sleep disorders (Collop et al., 2007;Kapur et al., 2017). Also, just as for any sleep study, its results alone (i.e., without proper clinical evaluation by a health provider) are not sufficient for diagnosis, evaluation of clinical efficacy, and treatment decision (Rosen et al., 2018).
The CST measurement for OSA and snoring screening also embrace the idea of reducing the number of sensors to the minimum necessary for accurate results. Of note, as CSTs are marketed directly to the consumers bypassing the role of the medical professional and patients, it is more appropriate to consider them as screening devices, rather than as diagnostic tools at this particular time. In any case, the evaluation of the accuracy of CSTs in comparison to proper diagnostic tests is important, in order to assure their reliability.
The first widespread CST options for SDB screening were probably smartphone apps for snoring detection. Their technology is simple, especially for the apps that have no intention to diagnose or correlate it with OSA severity, as their functions are usually restricted to the use of a microphone. The apps using a microphone as the main sensor appear to perform well in the detection of snoring and provide stable data with overall good accuracy (Camacho et al., 2015;Chiang et al., 2022;Figueras-Alvarez et al., 2020;Klaus, Stummer, & Ruf, 2021). However, the specificity might be low in a real-world scenario, as the apps might confound snoring from the bed partner, other respiratory sounds from the user, and background noise with actual snoring sounds from the user (Camacho et al., 2015;Stippig, Hübers, & Emerich, 2015).
Although snoring detection has some clinical usefulness (Camacho et al., 2015), many companies have tried to improve the screening capabilities of their CST by estimating OSA based on the snore events. The use of respiratory sounds and movements has also been used for this purpose, employing more refined data analyses (such as spectral analysis of respiratory sounds or using the smartphone as a sonar for detecting respiratory movements). The respiratory flow or pattern is estimated from it, and changes to background patterns are interpreted as possible obstructive events. Although they perform well in some cases, sensitivity and specificity are usually <90%, being as low as 60% in some cases Nakano et al., 2014;Narayan et al., 2019;Tiron et al., 2020).
B O X 1 Limitations on polysomnography-based diagnosis of sleep disorders.
Availability PSG beds might not be available in many medical centres, especially out of big cities and in rural areas, as it requires specialised healthcare professionals and an adequate laboratory setting. The higher prevalence of sleep disorders associated with the unavailability of sleep medicine centres increases the likelihood of underdiagnosis of OSA.

Costs
Even when PSG is available it might not be affordable to many patients. It is usually an expensive medical examination, as its price must encompass costs related to devices, health professionals, and sleep laboratory maintenance. Costs-related concerns also justify the limited availability of PSG on public health systems and healthcare insurance plans.
Data variability OSA is subjected to an important night-to-night variability. The variation in the AHI is >10 events/h in 65% of the individuals undergoing PSGs on sequential nights (Bittencourt et al., 2001). As the diagnosis of OSA is usually performed with a single-night PSG, there is a risk of misclassification due to data variability, which might affect diagnosis, treatment, and prevalence estimates.
Limitations related to manual analysis Manual analysis of a PSG recording is still the 'gold standard' method to analyse and score it. It requires a sleep technologist to overview the whole recording to perform sleep staging and to score other sleep-associated events (e.g., respiratory events, arousals, leg movements). This process has three main limitations: 1. Time-consuming: the manual analysis of a PSG usually requires $1.5 h of work from a sleep technologist (Fischer et al., 2012). 2. Prone to human errors: although good agreement rates among experienced sleep technologists have been reported (Kuna et al., 2013;Lee et al., 2022;Magalang et al., 2016), the manual analysis might be subjected to a significant amount of imprecision, especially among unexperienced scorers. 3. Costs: the need for sleep technologists scoring the PSG increases its costs, contributing to its limited affordability. These limitations could be overcome by improved semi-automatic analysis or by automatic algorithms (as used by many wearables/nearables devices).
Patient experience Sleeping at a laboratory under constant monitoring might be an uncomfortable experience for many patients. Among the several aspects that might reduce the patient experience while undergoing a PSG are: 1. Sleeping out of their own rooms with bed and pillows they are not used to. 2. Subjected to environmental conditions different from what they are familiar with (including light, noise, and companion). 3. Unable to follow a usual pre-sleep routine. 4. Different timing for going to bed and waking up than normally. 5. Dealing with the discomfort that the PSG devices might cause; and 6. Being monitored by healthcare professionals at a medical facility. All these conditions might lead to altered sleep patterns, which are caused by environmental conditions rather than by a sleep disorder. These effects are especially observed in a first PSG ('first-night effect', Ding, Chen, Dai, & Li, 2022), contributing to the data variability often seen in PSGs.
Several other physiological measurements are currently being used for the portable assessment of OSA, some of them being included in CSTs. These include ultrasound and radiofrequency sensors, airflow analysis, pulse oximetry, arterial tonometry, photoplethysmography, and heart rate variability, among others (Behar, Roebuck, Domingos, Gederi, & Clifford, 2013;Penzel, Dietz-Terjung, Woehrle, & Schöbel, 2021;Uddin, Chow, & Su, 2018). Airflow and pulse oximetry seem to be the most logical variables to be analysed in CST OSA monitoring (Uddin et al., 2018), as they are more closely related to OSA pathophysiology. While devices based on direct airflow analyses are not often seen, oximetry-based analyses became more common with the advent of fitness trackers, smartwatches, and rings. The oximeters embedded into wearable devices seem to be accurate in detecting hypoxia in multiple conditions, including during daily life activities, during sleep, and in experimentally induced hypoxia Marinari et al., 2022;Santos et al., 2022;Zhao et al., 2022).
The incorporation of additional data to pulse oximetry analyses appears to increase the accuracy, with movement, sound, and heart rate being the most commonly used parameters. Taking that all into account, the common sense is that single signal-based OSA detection is less accurate, being only able to differentiate between the presence or absence of OSA, while multi-signal detection is more accurate, being useful for detecting different levels of disease severity (Uddin et al., 2018). However, this might change as technology and data analysis evolve. As an example, a recent study using artificial neural network analysis of oxygen saturation (SpO 2 ) led to a median absolute error in the estimation of the AHI of $1 event/h (Nikkonen, Afara, Leppänen, & Töyräs, 2020).
For consumer-based OSA screening, the most common formats are smartphone apps and wearables. The usefulness of smartphone apps depends on a combination of the smartphone apps' characteristics (and the algorithms embedded in them) and the sensors, which are embedded in the smartphone (or tablet), with variable quality depending on the model. There are 100s of smartphone apps available for OSA detection, but only $3% of them provide proper validation studies (Baptista et al., 2022). A recent metaanalysis (Kim, Kim, & Hwang, 2022) (Rosa, Bellardi, Viana, Ma, & Capasso, 2018). Another characteristic of these devices is the improvement of the sensors used, both in their technology and the position where they are located. Such innovation seems to arise from a concomitant concern related to inventiveness, patentability, and diagnostic accuracy.
Regarding the position of these devices, fingertip oximeters and rings are among the most common (Gu et al., 2020;Zhao et al., 2022), but they also include devices based on neck collars, mandibular movement monitors (Pepin et al., 2022;Pépin et al., 2020), and surface acoustic wave sensors (Jin et al., 2017).

| Problems and concerns regarding CST for SDB screening
Although a user-centred approach has benefits, there are several concerns regarding the validation and accuracy of CSTs. The first concern regards the reduction in the number of sensors.  (Khosla et al., 2018;Schutte-Rodin et al., 2021). The lack of standards on the validation, proposal, and registration of CSTs causes a large accuracy variability, as well as uncertainties about their actual usefulness (Baptista et al., 2022;Fino & Mazzetti, 2019). It has been argued that some CSTs perform poorly in clinical samples in comparison to healthy populations (Baron et al., 2018) Considering all these problems, limitations, and uncertainties, a comprehensive data reassessment of the accuracy of CSTs for the screening of OSA and snoring is needed, and it could be achieved by means of a systematic review and meta-analysis. This approach would help to understand the actual accuracy of new CSTs, being also able to detect which sensors and outcome variables are the most suitable for proper screening of OSA and snoring. Therefore, the present protocol discloses the methods and procedures for a systematic review and meta-analysis of DTA of CST for the screening of OSA and snoring.

| Reporting and registration standards
This protocol was prepared according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA)protocol extension (PRISMA-P) (Moher et al., 2015) and was registered at the International Prospective Register of Systematic Reviews (PROSPERO: CRD42022362186). The final report will be written according to the PRISMA-DTA (Salameh et al., 2020).

| Research question and basic definitions
The basic definitions of the included articles are specified according to the PI(R)T strategy (Participants, Index test, Reference test, Target condition), an adapted version of PICO (Population, Intervention, Comparison, Outcome) for DTA studies (Leeflang, Davenport, & Bossuyt, 2022). These four strategy items are defined below and more details on how each of these items will be addressed and analysed can be found in the 'Inclusion and exclusion criteria' section.
• Participants: individuals aged ≥18 years, regardless of diagnosis or suspicion of OSA, other sleep disorders, and co-morbidities.
• Index test: consumer-based technology (including devices or apps) for screening of OSA and/or snoring according to the classification proposed by the AASM (Schutte-Rodin et al., 2021).
• Reference test: full night type-I or type-II PSG, performed according to the AASM recommendations (Berry et al., 2020) or equivalent guidelines.
Based on these definitions, a list of PI(R)T questions was prepared, by a combination of the target conditions and index tests of interest, considering two different meta-analytical approaches: DTA and mean difference meta-analyses (as properly explained in the 'data synthesis and analyses section'). The list of PI(R)T research questions is presented in Table 1.
Of note, although we acknowledge that CSTs cannot be considered proper diagnostic tools, but rather OSA screening devices, we prefer to keep using the term 'diagnosis' on the research questions and on the statistical analyses. 'DTA meta-analysis' (DTAMA) is an established meta-analytical approach that has been used whenever the performance of an index and a reference test are compared.
The same is true for other statistical terms used throughout the manuscript, including 'diagnostic threshold' and 'diagnostic odds ratio' (DOR).

| Search strategy
A bibliographic search will be performed in four different databases: PubMed, Scopus, Web of Science, and the Cochrane Library. The primary search strategy was developed for PubMed and will be adapted to the syntax and search engines of the other databases. All research questions in the 1.1 level will consider six independent diagnostic thresholds (AHI or RDI ≥5, ≥15, and ≥30 events/h). b Although merged into the research questions, epoch-by-epoch and event-by-event analysis will be performed independently whenever possible.
information on the 'Analysis plan, subgroup analysis, and sensitivity analysis' section). Regarding searching grey literature, which encompasses literature available out of the main databases and often in nonfinal format, we will search ClinicalTrials.gov and Google Scholar (first 200 records).
All sources of data and the subsequent procedures, including data screening, extraction, and analysis disclosed in Figure 1.

| Evaluation and selection strategies
The records retrieved from the search in the four primary databases (PubMed, Scopus, Web of Science, and the Cochrane Library) will be exported to Covidence, where deduplication and eligibility analysis will be performed. Duplicated records will be automatically excluded.
Two independent reviewers will evaluate all non-duplicated records, in a process based on two steps. The first step consists of reviewing titles and abstracts only, while the second encompasses full-text analysis. In each step, the decision considering each article as eligible or not relies on consensus between both reviewers' decisions. Disagreements between reviewers will be solved by a consensus, and if the discordance persists, a third reviewer will be consulted (GNP). Both reviewing steps will be conducted considering the inclusion and exclusion criteria disclosed below.
By the beginning of the evaluation process, a calibration round of the eligibility analysis will be performed, and the agreement rates between each reviewer in comparison with a senior reviewer (GNP) will be measured. Both the senior reviewer and each of the independent reviewers will analyse the titles and abstracts of 200 records.
The Cohen's kappa index between each reviewer and the senior reviewer will be calculated, and the analysis will continue only if a kappa value of ≥0.8 is reached (considered as a strong agreement). If this threshold is not reached, meetings for discussing the inclusion and exclusion criteria and a new calibration round will be performed until a 0.8 index is achieved.

| Inclusion and exclusion criteria
Only studies evaluating the accuracy of consumer-based technologies related to OSA and snoring screening will be considered eligible, based on smartphone apps, wearables, nearables, or bed sensors.
There will be no restrictions regarding publication date and language.
The authors are able to handle records published in English, Portuguese, Spanish, Finnish, and Icelandic. If studies in other languages are retrieved, Google Translate will be used for abstract screening and native speakers will be contacted for assistance with full-text analysis.
The eligibility analysis will be based on the criteria below: F I G U R E 1 Flow diagram disclosing the sources of data, screening procedures, and data analysis. Both title and abstract screening and full-text analysis will be performed by two independent reviewers. The metaanalyses include both diagnostic test accuracy and mean-difference analyses (including the respective subgroup and sensitivity analyses). CST, consumer sleep technology. GRADE, Grading of Recommendations Assessment, Development and Evaluation.

• Abstracts
Inclusion: articles that present an abstract, regardless of the language.
Exclusion: articles that do not present an abstract.
• Source type Inclusion: full original/primary studies, reports, and datasets.
Studies published in non-peer-reviewed sources will also be considered eligible if the data collection and analysis are complete and precisely described. Uncompleted data sources (e.g., congress abstracts, patents, and protocols) will be considered for the systematic review, but not for meta-analyses. In these cases, the authors will be contacted and inquired about the availability of the full study report or a complete dataset. In the case of redundant publications (i.e., secondary or derivative studies coming from the same dataset or source study), only the one with the biggest sample size will be considered eligible.
Exclusion: theoretical articles (including editorials, narrative reviews, and letters to the editors), systematic reviews, metaanalyses, and meta-epidemiological studies.

• Participants
Inclusion: individuals aged ≥18 years, with no restriction regarding sex and presence of concurrent sleep disorders or comorbidities. Studies encompassing paediatric samples will be considered eligible if it is possible to extract data from a subgroup of those aged ≥18 years in an independent and unbiased way.
Exclusion: studies with a sample of participants aged <18 years, or in cases in which it is not possible to dissociate a subgroup of those aged ≥18 years.
• Index test Intervention studies in which the participants are subject to any type of intervention (including for OSA treatment, e.g., continuous positive airway pressure or intraoral devices).

| Data extraction
Data extraction will be performed using Covidence by two independent reviewers and checked by a senior reviewer (GNP). Before actual data extraction, both reviewers will undergo a training session with the senior reviewer. Disagreements in the data extraction between both reviewers will be solved by consensus, and if discordance persists, the senior reviewer will be consulted (GNP). If the senior reviewer is not able to solve the discordance, the article's authors might be contacted.
Numeric outcome data will be extracted as mean ± standard deviation (SD). When a study discloses the standard error of the mean (SEM) rather than SD, SD will be calculated by multiplying the SEM by the square root of the sample size. When it is not possible to determine if data dispersion is displayed in SEM or SD, we will assume them as SD.
The only mandatory items for the analyses are metadata, sample size, type of device/app, and at least one main outcome. When needed, data will be extracted from graphs using a digital ruler (Plot Digitizer, plotdigitizer.sourceforge.net/). In case of missing data not extractable from charts or doubts regarding any specific result or methodological aspect, authors will be contacted and will be asked to provide information about their protocol, results, or raw data. Two contact attempts will be made and if no successful response is obtained, the article might be excluded from the sample (when they fail to provide one of the mandatory items) or from a specific subgroup analysis.
In the case of longitudinal studies, data from all available nights will be extracted, provided that both the reference and the index tests were used concomitantly. The unit of analysis in this systematic review is not the articles, but studies within the articles. Therefore, when an article has two or more separate and fully independent studies, data may be extracted more than once from the same article. In studies evaluating two or more index tests, each of them will be considered as a separate study, even considering that they are compared with the same reference test group (data corrections might be needed for continuous outcomes).
The following variables will be extracted:

Descriptive information
Metadata: full reference string, including first author, title, publication year, and publication source (journal).
Country: defined as the country in which the sample was recruited.
In international multicentric studies, the proportion of participants from each country will be extracted. In these cases, and for descriptive purposes, the study will be considered as belonging to the country that contributed the most to the sample.

Participants
Sample size: considering only the final sample, composed of those participants subjected to both tests.
In the case of 'both', the proportion of each sex in the final sample will be extracted.
Age: both the recruitment age range and the mean age (±SD) will be extracted.
Body mass index.
Self-reported ethnicity and Fitzpatrick skin colour scale. The proportion of each ethnicity and phototype will be extracted, whenever available.
Exclusion criteria: every criterion is taken for considering a potential participant as ineligible.
Concurrent health conditions: any health condition or descriptive characteristic considered as part of the population description in a study (i.e., only individuals presenting a specific disease were included). For example, studies evaluating the accuracy of OSA apps among individuals with hypertension or with morbid obesity.
Pre-test assessment OSA risk: average score and the number of individuals considered as having a high risk of OSA according to screening questionnaires for OSA risk (Małolepsza et al., 2021), namely the Berlin (Chung et al., 2008b), the STOP-BANG (Chung et al., 2008a) or the NoSAS (Marti-Soler et al., 2016) questionnaires.

Study design and description
Sample representativeness: population-based study, probabilistic sampling, non-clinical convenience sample, or clinical convenience sample.

Index test:
Commercial name (when available).

Manufacturer (when available).
Version of the device/app (when available).
Type of device/app: it could be filled as 'wearable', 'nearable', 'bed sensors', 'app', or 'other'. Wearable devices are defined as any device that is worn or used in close and conditional contact with the participant's body. For technologies considered wearables, additional information about their mode of use and presentation will be extracted (e.g., ring, wristband, smartwatch, fitness tracker, headband, or chest belt). Nearables are considered as any device positioned close to the participant, but with no contact with its body, not being worn nor composing the bed and bed linen. For devices considered as nearables, information regarding their position will be extracted (e.g., on the nightstand, below the bed, or the headboard). Bed sensor devices include all devices integrated into the linen and other fabric materials that are not worn by the participants, but that compose the sleeping environment. For devices considered bed sensors, they will be categorised according to their use (e.g., mattresses, pillows and pillowcases, linen, or blanket).
Apps are defined as software integrated and conditionally used in a smartphone or tablet. Although it might be seen as a nearable (as the smartphone or tablet should be positioned near the individual), its main feature is the software, while the nearables have the hardware as their main feature. Any other device will be categorised as 'other', and a further explanation will be provided. New categories not foreseen in this protocol might be considered depending on the characteristics of the retrieved studies.
Variables detected to measure OSA or snoring: any physiological or environmental variable used to detect OSA and snoring (and to measure sleep time or sleep stage when those data are integrated into the detecting algorithm). For example, oxygen saturation, heart rate variability, respiratory rate, body temperature, movements, sound, and chest movements.
Equipment used to measure the intended variables: the embedded technology or sensors that are primarily responsible for data acquisition. For example, an oximeter, microphone, arterial tonometry, and accelerometer.
Diagnostic threshold: any diagnostic threshold using any variable to diagnose OSA or categorise its severity. More than one diagnostic threshold can be extracted per study.

Reference test:
Diagnostic criteria for apneas: the exact definition of the event (e.g., ≥90% reduction in oronasal thermal sensor amplitude, lasting at least 10 s) and/or the scoring guideline considered (e.g., the AASM manual 2020).
Diagnostic criteria for hypopneas: the exact definition of the event (e.g., ≥30% reduction in nasal pressure amplitude, lasting at least 10 s and associated with either a ≥3% desaturation or an arousal) and/or the scoring guideline considered (e.g., the AASM manual version 2.6recommended criteria).
Diagnostic threshold: any diagnostic threshold using any variable used to diagnose OSA and snoring, or categorise its severity. More than one diagnostic threshold can be extracted per study. For the PSG-based diagnosis, the expected thresholds include AHI or RDI of ≥5, ≥15, or ≥30 events/h, but the least used thresholds (such as AHI ≥20 events/h) or diagnosis based on variables other than AHI or RDI will also be extracted.
PSG scoring approach: although the AASM recommendations require manual scoring, the use of automatic approaches is becoming increasingly common. The PSG scoring approach will be extracted, being categorised as manual or automatic.
Number of technologists scoring the PSG recording: the accuracy of a diagnostic test depends not only upon the sensitivity and specificity of the index test but also on the precision of the reference test.
PSGs are always subjected to inter-rater variability (Kuna et al., 2013;Lee, Lee, Cho, & Choi, 2022;Magalang et al., 2016), and the bigger it is, the harder it will be for an index test to reach high accuracy. Having more than one technician scoring each PSG is a strategy to increase diagnostic accuracy within the reference test.
Main outcomes:  (Dinnes, Deeks, Leeflang, & Li, 2021). When these data are provided in contingency tables larger than 2 Â 2 (such as disclosing all OSA severity groups), data will be grouped in a way that 2 Â 2 tables can be built for each diagnostic threshold (as demonstrated in Figure 2).

Number of TPs, FPs, TNs, and FNs for epoch-by-epoch and event-
by-event detection of obstructive events, apneas, hypopneas, and snoring.
AHI for both index and reference tests (mean ± SD).
RDI for both index and reference tests (mean ± SD).
REI for both index and reference tests (mean ± SD).
ODI for both index and reference tests (mean ± SD).

Snoring duration for both index and reference tests (mean duration
[s] ± SD).

Snoring frequency for both index and reference tests (mean
[s] ± SD).

Secondary outcomes
Any other sleep-related respiratory variable reported in the article, including but not limited to absolute number and indices of respiratory effort-related arousals (RERA), apneas, hypopneas, central events, obstructive events; time spent with SpO 2 >90% (mean ± SD).

Role of sponsors:
Study commissioned or sponsored by a device/app manufacturer (yes/no) One or more authors directly affiliated with the device/app manufacturer (yes/no)

| Publication bias
Publication bias will be assessed using Deeks' test, which was specifically designed for systematic reviews of DTA (Deeks, Macaskill, & Irwig, 2005) and performs better than the Begg and Egger test in these cases (van Enst, Ochodo, Scholten, Hooft, & Leeflang, 2014). It is based on plotting the DOR in natural logarithmic form (lnDOR) against the inverse of the effective sample size.

| Quality assessment and risk of bias
Quality assessment within the included studies will be evaluated using the revised version of the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) (Whiting et al., 2011). This tool was specifically designed for quality assessment of diagnostic accuracy studies and is the most recommended option for risk assessment in systematic reviews. It consists of four domains, related to participant selection, index test, reference standard, and flow and timing. Each of these domains is evaluated in two independent ways: risk of bias and concerns regarding applicability (except for 'flow and timing', which is judged only about the risk of bias). Each item is judged as having low risk, high risk, or unclear risk. As common for quality and bias assessment, no summary statistics or final scores are provided. The results will be displayed in tables, disclosing the results of the assessment for each included article, and in charts, disclosing the percentage of low, high, and unclear risk for each item.
Additionally, the procedures used for the validation of point-ofcare OSA screening devices in each included article will be evaluated by using the rating system proposed by Tangudu et al. (2021). This method evaluates 11 aspects related to validation studies: the number of PSG readers/scorers, subject conditions (SDB diagnosis), subject data (patient history), caffeine and alcohol restrictions, daytime sleepiness evaluation, instructions for self-application of home-based devices, sleep metrics under analysis, methods for data extraction, methods for quantitative analysis, methods for qualitative analysis, and data protection and security issues. Each item is rated from 1 to 3 based on the appropriateness of the methods employed in each study, leading to a final score ranging from 1 to 33.

| Data synthesis and analyses
Meta-analyses will be performed whenever three or more studies can be grouped using the same outcome measure. Two different types of meta-analysis will be performed in this study: DTA meta-analyses and mean difference meta-analyses.

| Diagnostic test accuracy meta-analyses
The DTA meta-analyses will be performed for outcomes for which 2 Â 2 contingency tables were extracted. It includes both studies assessing diagnostic accuracy (PI(R)T questions #1) and studies employing epoch-by-epoch or event-by-event analyses (PI(R)T questions #2). Independent analyses will be performed for each diagnostic threshold, and no analyses will be performed by adding data from different diagnostic thresholds. For each study, the number of TP, FN, FP, and TN will be used, and summary statistics will be calculated using the random effects bivariate binomial model of Chu and Cole (Chu & Cole, 2006). This model allows the calculation of a summary estimate for both sensitivity and specificity among the whole sample. The estimated sensitivity and specificity (and their 95% confidence interval [95% CI]) for each included study will be displayed in Forest plots, and the summary estimate for the whole sample will be displayed in a summary receiver operating characteristics (ROC) curve (SROC curve), plotted using the sensitivity against the FP rate (1-specificity). No heterogeneity tests will be performed (such as the I 2 index), as they do not perform well in DTA metaanalyses (Leeflang, 2014). All DTA meta-analyses will be performed using the MetaDTA application (Freeman et al., 2019;Patel, Cooper, Freeman, & Sutton, 2021).

| Mean difference meta-analysis
Mean difference meta-analysis will be performed for continuous outcomes (mean ± SD), in cases in which both the index and the reference tests provide results in the outcome and use the same unit of measurement (PI(R)T questions #3). These meta-analyses are not commonly performed in systematic reviews of diagnostic accuracy, but in the present study, they will be used to explore the concordance of the index and the reference tests in detecting a continuous numeric variable used for screening purposes, regardless of the diagnostic threshold. For each study, the mean difference (M index test -M reference test ) will be calculated for each included study. Meta-analyses will be performed using the DerSimonian and Laird random-effects model. Heterogeneity will be assessed using both the I 2 index and the Cochran's Q test. Data will be presented as effect size ± 95% CI in Forest plots. Statistically significant results (p ≤ 0.05) with effect size ± 95% CI greater than zero will be interpreted as cases in which the index test overestimates the reference test measure, while effect size ± 95% CI less than zero and p ≤ 0.05 will be interpreted as an underestimation of the index test in comparison with the reference test. Non-significant results (p > 0.05) with effect size ± 95% CI crossing the zero line will be interpreted as an equivalence between the index and the reference tests for a given outcome.
All mean difference meta-analysis will be performed using the Comprehensive Meta-Analysis software.
3.9.3 | Analysis plan, subgroup analysis, and sensitivity analysis The primary level analysis will include all possible studies for each given outcome, regardless of the device/app category and technology used. Although it is likely to result in highly heterogeneous analyses, it will be useful to conclude the general accuracy of OSA and snoring screening devices and apps.
Second-to-fourth-level analyses correspond to subgroup analysis with an increasing level of methodological homogeneity. Second-level analyses will group studies by the type of devices/app (wearables, nearables, bed sensors, smartphone apps). Third-level analyses will group studies by the technology and equipment used to detect OSA or snoring (e.g., oximeter, microphone, arterial tonometry, and accelerometer). Fourth-level analyses will group studies by the exact commercial presentation (including commercial name and manufacturer).
To evaluate whether the accuracies of CSTs have increased over time, the stratified analysis will be performed according to the publica- 3.10 | Grading of Recommendations Assessment, Development and Evaluation (GRADE) assessment The GRADE system is a methodology increasingly used to assess the certainty of evidence and to decide about the strength of recommendations in systematic reviews and guideline development (Guyatt et al., 2008), especially when related to therapeutic questions. Initial methodological suggestions have been made to adapt the GRADE system to questions related to DTA (Brozek et al., 2009;Schünemann et al., 2008). However, the use of the GRADE system has been proven to be challenging, mostly due to the lack of proper and explicit guidance on how to perform it (Gopalakrishna et al., 2014). More recent guidelines are being implemented, which will be used in this systematic review (Schünemann et al., 2019;Schünemann et al., 2020aSchünemann et al., , 2020b).
The GRADE assessment in this review will apply only to the categorical outcomes assessed in terms of their accuracy (including TP, FN, FP, and TN), as the continuous outcomes are analysed from a more exploratory perspective. For each question, sensitivity (TP and FN grouped) and specificity (TN and FP grouped) outcomes will be assessed independently. The certainty of the evidence for each outcome can be considered as high, moderate, low, or very low. Crosssectional within-subject paired studies will start being considered as high-certainty evidence, as this design can be considered appropriate to assess test accuracy (Schünemann et al., 2020a). Based on this initial assessment, the level-of-evidence certainty can be decreased based on five criteria (risk of bias, indirectness, inconsistency, imprecision, and publication bias) (Schünemann et al., 2020b). Certainty of evidence can also increase based on three criteria (consistent sensitivity-specificity relationship, large estimates of test accuracy, and minimal plausible bias and confounding). However, rating up the certainty of the evidence is discouraged for test accuracy outcomes, as there is no consensus regarding this procedure for DTA systematic reviews and it still warrants further methodological development. All GRADE assessments will be performed with GRADEpro GDT (https:// www.gradepro.org/).

| DISCUSSION AND EXPECTED RESULTS
Several CSTs related to OSA and snoring screening are commercially available and are becoming increasingly popular. As these technologies are designed for being used directly by the customers, usually without supervision or assistance from medical professionals, it is important to assure their results and reports are reliable and accurate.
One of the main concerns regards sensitivity, as FN results would refrain a user from seeking professional assistance and treatment when it is needed.
The present protocol describes the methods and procedures for a systematic review and meta-analysis, which will evaluate the accuracy Although we acknowledge an overlap between these two previous meta-analyses and the present protocol, we understand they can be complementary. In addition, the present systematic review improves the knowledge by addressing points that were not addressed in the previous studies: • • Detailed strategy to analyse data: as the technology on these tools and devices varies considerably, a solid strategy to analyse data is needed to encompass the most important sources of variability and heterogeneity. In our analysis plan, we have included analyses related to different outcome measures (including but not limited to AHI, RDI, REI, ODI, snoring duration, and snoring frequency), categories of devices (wearables, nearables, bed sensors, and smartphone apps), technologies (e.g., oximeter, microphone, arterial tonometry, and accelerometer), sample representativeness, and the role of manufacturer, among others.
• Concerns regarding sponsorship: most research on new sleep technologies are directly sponsored or even primarily performed by the manufacturers. This is an important source of potential bias, increasing the likelihood of publication bias and selective outcome reporting. For this reason, we intend to broaden our search strategy by contacting manufacturers directly. Also, the role of the manufacturers will be included in subgroup analyses.
• Specific methodology for DTA meta-analysis: this is a very specific type of meta-analysis, for which the methodology is under constant improvement. The previous meta-analyses have encompassed some of the DTA-specific methodologies, but a few aspects might have been overlooked. The present protocol aims to encompass the most recent methodology for DTA meta-analysis.
• Sample size: the previous meta-analyses have been performed with a limited number of studies, especially when subgroup analyses are considered. We believe that with our enlarged search strategy and by contacting manufacturers and companies directly, we might have a larger sample of studies, therefore increasing our external validity on conclusive potential.
As currently understood, CSTs are not intended to be used for diagnosis, but rather for screening of sleep disorders, as most of them are marketed to be used autonomously by a consumer/user without medical prescription or supervision. However, two movements in the development of new CSTs have been observed. First, their overall diagnostic accuracy seems to be increasing, as new technologies are used, and more refined algorithms are implemented. Second, transitional technologies, which lie somehow in between CSTs and clinical-grade devices, are becoming increasingly common. Our results will help to assess whether the diagnostic accuracy of OSA-related CSTs is adequate, therefore bringing screening and diagnosis closer one to another. However, it should be kept in mind that diagnosis is not restricted to the measurement of certain diagnostic measures (such as AHI, RDI, or ODI), and it might involve a proper differential diagnosis or the assessment of comorbidities and other concomitant conditions. Both issues are highly dependent on a throughout clinical evaluation, therefore being overlooked when a device is used directly by the consumer regardless of a medical professional.
The meta-analyses resulting from this protocol will help to direct future technologies, assisting in the process of continuous technological development in the field of sleep medicine. This growth should be achieved both by focusing on consumer needs and data reliability. However, meta-analyses such as the one proposed here are limited by the fact that they analyse previously published data, therefore serving as a post hoc appraisal tool. It also focuses specifically on the accuracy of diagnostic accuracy of studies, not intending to evaluate other aspects related to the reliability of CSTs, such as personal data security, data protection, and data storage. The knowledge arriving from the present and other meta-analyses, as well as from the mutual collaboration between manufacturers, healthcare professionals, and sleep researchers, should be reverted into practical achievements and definitions that should be implemented before new devices become commercially available, impacting the way they are designed, developed, registered, and evaluated by health agencies, and promoted to the general public.

AUTHOR CONTRIBUTION
Gabriel