Validity, reliability, and responsiveness of daily monitoring visual analog scales in MASK‐air®

Abstract Background MASK‐air® is an app that supports allergic rhinitis patients in disease control. Users register daily allergy symptoms and their impact on activities using visual analog scales (VASs). We aimed to assess the concurrent validity, reliability, and responsiveness of these daily VASs. Methods Daily monitoring VAS data were assessed in MASK‐air® users with allergic rhinitis. Concurrent validity was assessed by correlating daily VAS values with those of the EuroQol‐5 Dimensions (EQ‐5D) VAS, the Control of Allergic Rhinitis and Asthma Test (CARAT) score, and the Work Productivity and Activity Impairment Allergic Specific (WPAI‐AS) Questionnaire (work and activity impairment scores). Intra‐rater reliability was assessed in users providing multiple daily VASs within the same day. Test–retest reliability was tested in clinically stable users, as defined by the EQ‐5D VAS, CARAT, or “VAS Work” (i.e., VAS assessing the impact of allergy on work). Responsiveness was determined in users with two consecutive measurements of EQ‐5D‐VAS or “VAS Work” indicating clinical change. Results A total of 17,780 MASK‐air® users, with 317,176 VAS days, were assessed. Concurrent validity was moderate–high (Spearman correlation coefficient range: 0.437–0.716). Intra‐rater reliability intraclass correlation coefficients (ICCs) ranged between 0.870 (VAS assessing global allergy symptoms) and 0.937 (VAS assessing allergy symptoms on sleep). Test–retest reliability ICCs ranged between 0.604 and 0.878—“VAS Work” and “VAS asthma” presented the highest ICCs. Moderate/large responsiveness effect sizes were observed—the sleep VAS was associated with lower responsiveness, while the global allergy symptoms VAS demonstrated higher responsiveness. Conclusion In MASK‐air®, daily monitoring VASs have high intra‐rater reliability and moderate–high validity, reliability, and responsiveness, pointing to a reliable measure of symptom loads.

Impairment Allergic Specific (WPAI-AS) Questionnaire (work and activity impairment scores). Intra-rater reliability was assessed in users providing multiple daily VASs within the same day. Test-retest reliability was tested in clinically stable users, as defined by the EQ-5D VAS, CARAT, or "VAS Work" (i.e., VAS assessing the impact of allergy on work). Responsiveness was determined in users with two consecutive measurements of EQ-5D-VAS or "VAS Work" indicating clinical change.
Results: A total of 17,780 MASK-air® users, with 317,176 VAS days, were assessed.
Concurrent validity was moderate-high (Spearman correlation coefficient range: 0.437-0.716). Intra-rater reliability intraclass correlation coefficients (ICCs) ranged between 0.870 (VAS assessing global allergy symptoms) and 0.937 (VAS assessing allergy symptoms on sleep). Test-retest reliability ICCs ranged between 0.604 and 0.878-"VAS Work" and "VAS asthma" presented the highest ICCs. Moderate/large responsiveness effect sizes were observed-the sleep VAS was associated with lower responsiveness, while the global allergy symptoms VAS demonstrated higher responsiveness.

Conclusion:
In MASK-air®, daily monitoring VASs have high intra-rater reliability and moderate-high validity, reliability, and responsiveness, pointing to a reliable measure of symptom loads.

K E Y W O R D S
allergic rhinitis, mobile health, reliability, responsiveness, visual analog scales

| INTRODUCTION
Allergic rhinitis is a burdensome condition contributing to a substantial loss of work and school productivity, as well as to decreased quality of life. 1,2 While there have been important advances on the treatment of allergic rhinitis, many patients remain poorly controlled. 3 Mobile health-based approaches may contribute to addressing this problem. 4 As an example, MASK-air® is a mobile app available in 25 countries, comprising a daily monitoring questionnaire assessing the impact that allergic symptoms have on the user each day. 3,[5][6][7][8][9][10] In MASK-air®, visual analog scales (VASs) are used for several questions on daily monitoring. Such VASs range from "not at all bothersome" (0) to "extremely bothersome" (100) and indicate the degree to which nose, eye, or asthma symptoms bother users during the day (or specifically impact their work or sleep activities). While the concurrent validity of some of these VASs has already been assessed, 5 it has not yet been established for all VASs, particularly when taking all validated comparators into account. In addition, their reliability and responsiveness have not yet been evaluated. In fact, reliability estimates are needed to provide information on the SOUSA-PINTO ET AL.
-3 of 15 stability of measures/inputs obtained at different times of the day from the same users (intra-rater reliability) or on different days in users considered clinically stable (test-retest reliability). 11 On the other hand, responsiveness estimates inform on the ability of daily monitoring VASs to change over a specific period of time in cases where changes in a reference measure of health status have occurred. 12 An assessment of such properties is essential for determining whether daily monitoring VASs can actually be used as a reliable tool for measuring rhinitis control. Therefore, this study aimed to assess concurrent validity, intra-rater and test-retest reliability, as well as the responsiveness of MASK-air® daily monitoring VASs.

| Study design
We assessed the concurrent validity, reliability, and responsiveness of each daily monitoring VAS of MASK-air®. Reliability was evaluated by assessing intra-rater reliability (assessing the agreement between multiple values provided by the same users within the same day) and test-retest reliability (assessing the agreement between different daily VAS results provided by clinically stable patients). The main analysis concerned all countries in which MASK-air® is available. Sub-analyses using only data from European users were also performed.

| Setting and participants
MASK-air® has been available since 2015. It is currently being used in 25 countries and 19 languages (www.mask-air.com). We included the daily monitoring data of MASK-air® users aged 16-90 years and with a self-reported diagnosis of allergic rhinitis.
MASK-air® is used by people who find it on the Internet (namely on Apple App store or Google Play). Some of the users are patients who were asked by their physicians to use it. However, due to privacy rules, it is impossible to determine how patients come to use the app.

| Ethics
MASK-air® is CE1-registered but was not considered by the Ethical Committee of the Cologne Hospital (2017) as a medical device, given that it does not provide any recommendations concerning treatment or diagnosis. MASK-air® follows GDPR regulations. An independent review board approval was not required for this specific study, as all data had been anonymized prior to the study (including geolocationrelated data) using k-anonymity, and users agreed to have their data analyzed in the terms of use (translated into all languages and first customized according to the legislation of each country, allowing the use of the results for research purposes). 13,14

| Data sources and variables
We analyzed the MASK-air® daily monitoring data up to December 6, 2020. The daily monitoring of symptoms comprises six mandatory questions (addressing the period of 1 day), whose responses are provided by means of VASs: i. How much the overall allergic symptoms bothered the user on that day ("VAS global allergy symptoms"); ii. How much nasal symptoms bothered the user on that day ("VAS nose"); iii. How much ocular symptoms bothered the user on that day ("VAS eyes"); iv. How much asthma symptoms bothered the user on that day ("VAS asthma"); v. How well the user slept on the previous night ("VAS sleep"); and vi. How sleepy the user was during the day ("VAS sleepiness").
In addition, if users report that they are working on that day, they are asked how much their allergic symptoms affected work activities on that day ("VAS work").
When reporting daily VAS, users are asked to provide their daily medication using a scroll list customized for each country. 15 When responding to the MASK-air® daily monitoring questionnaire, it is not possible to skip any of the questions, precluding missing data. While symptoms should be monitored on a daily basis, some users may have provided more than one daily input.
In addition to the daily monitoring of symptoms, MASK-air® users need to provide further clinical information (e.g., indicate their diagnosed allergies), and may respond to other questionnaires, including EuroQol-5 Dimensions (EQ-5D-5L), the Control of Allergic Rhinitis and Asthma Test (CARAT), Work Productivity and Activity Impairment: Allergic Specific (WPAI:AS), and the Epworth Sleepiness Scale (ESS). MASK-air® users can answer these questionnaires on any day they want to, and either before or after answering the daily monitoring questionnaire. EQ-5D-5L assesses the respondents' health status through five dimensions/questions (each with five levels) followed by a VAS assessing the general health status on that day. 16,17 CARAT is a 10-item questionnaire assessing the control of allergic rhinitis and asthma in the previous 4 weeks, with four questions specifically concerning nasal symptoms. 18,19 WPAI:AS is a 9-item questionnaire assessing the weekly impact of allergies on work and academic productivity, with three of its questions allowing estimation of the overall work impairment due to allergy, and the ninth question assessing the overall activity impairment due to allergy. 20,21 Finally, ESS evaluates respondents' sleepiness by assessing the chances of dozing in eight possible scenarios on that day. 22,23

| Biases
Potential information biases were addressed by restricting our analyses to data from users with a self-reported diagnosis of allergic rhinitis. Exclusion of data from users aged less than 16 years allowed us to address potential variability associated with age (i.e., differences between children and adults).

| Sample size
We did not perform sample size calculation, but rather analyzed all data from users meeting the eligibility criteria and with valid data.
Nevertheless, analyses were not performed for situations/comparators in which the number of users providing data for at least one daily monitoring VAS never exceeded 50.

| Data analysis
Concurrent validity was assessed using statistical functions in Microsoft Excel 2016. All the other analyses were performed using software R (version 4.0, R Foundation for Statistical Computing, Vienna, Austria), with intraclass correlation coefficients (ICCs) being calculated with the "irr" package. 24 2.7.1 | Concurrent validity Concurrent validity was assessed by computing Spearman correlation coefficients for the associations between three daily monitoring VASs ("VAS global allergy symptoms," "VAS Nose," and "VAS work") and EQ-5D VAS, CARAT (considering CARAT as a whole, as well as just the first four questions of CARAT-which concern the upper airways-and just the last six questions) and WPAI:AS (considering the "percentage of overall work impairment due to allergy" and question number 9-"percentage of activity impairment due to allergy"). In addition, correlations between different daily monitoring VASs were computed. Confidence intervals were estimated with alpha at 0.001 (indicating a 99.9% confidence level) and with standard deviation being Fisher's z-transformation of the Spearman correlation coefficient. We considered coefficients of 0.5 to 0.8 (or −0.8 to −0.5) to indicate moderate correlation, and of >0.8 (or <−0.8) to indicate strong correlation.

| Assessment of intra-rater reliability
The assessment of intra-rater reliability for daily monitoring VASs was estimated considering users providing multiple inputs within the same day. For each day of use (of the same user) with multiple inputs for the same VAS, we computed the difference between the first and the second inputs for each VAS. We computed the average difference for each VAS, as well as the frequency of cases in which (i) such same-day values were the same, (ii) the difference between such same-day values did not exceed 10 units, and (iii) the difference between such same-day values exceeded 10 units (differences lower than 10 units in MASK-air® VAS point to low intra-individual response variability 5 ). We calculated the ICC for assessment of intra-rater reliability (using two-way models estimating absolute agreement, based on average measurements 25,26 ), also taking into account the first and second inputs within the same day by the same user. A sensitivity analysis was performed considering the first and last inputs for each VAS (instead of the first and second inputs).

| Assessment of test-retest reliability
We assessed test-retest reliability for each daily monitoring VAS.
Assessment of test-retest reliability implies the identification of users with two measurements of a validated comparator indicating clinical stability. In this study, we used three different validated comparators for assessment of test-retest reliability: EQ-5D VAS, CARAT, and "VAS work" (despite being a daily monitoring VAS, the validity of "VAS work" was demonstrated in previous studies as well as in the assessment of its concurrent validity with WPAI:AS).
WPAI:AS ("percentage of overall work impairment due to allergy") was also used to define clinical stability when assessing test-retest reliability for "VAS work." On the other hand, the ESS was used to define clinical stability when assessing test-retest reliability for "VAS sleep" and "VAS sleepiness." Clinical stability was assumed whenever a user had two consecutive measurements less than 5 weeks apart, with results for validated comparators having a difference smaller than the minimal clinically important difference (MCID) value. Whenever the same user had more than two consecutive measurements (or more than one set of measurements) meeting the aforementioned criteria, the first two measurements were selected. Agreement was assessed by estimating ICCs using two-way models estimating absolute agreement, based on average measurements. 25,26 We considered that ICCs of <0.5 indicate low reliability (both for test-retest reliability and intra-rater reliability), those of 0.5-0.75 indicate moderate reliability, and those of >0.75 indicate high reliability. 25 In the case of CARAT, differences ≤3 were considered to be lower than the MCID. 27 For the remaining comparators, no MCID for patients with allergic rhinitis has been defined. Therefore, such values were determined based on distribution-based methods-we considered the MCID to correspond to 0.5 � standard deviation of the baseline observations. 28 Based on such an approach, we estimated an MCID of 10 points for EQ-5D-VAS, of 11 points for "VAS work," and of 14% for WPAIS:AS. For ESS, we considered clinical stability if the same category was observed for the different daily measurements. 29

| Responsiveness
We assessed responsiveness for each daily monitoring VAS. Assessment of responsiveness implies the identification of users with two measurements of a validated comparator indicating clinical change. In SOUSA-PINTO ET AL.
-5 of 15 this study, validated comparators to indicate clinical change included the EQ-5D VAS and "VAS work." We were not able to use CARAT, WPAI:AS (to assess "VAS work" responsiveness), or ESS (to assess "VAS sleep" and "VAS sleepiness" responsiveness) as comparators, given that clinical change based on such measurements was observed in less than 50 users for every daily monitoring VAS.
Clinical change was assumed whenever a user had two consecutive measurements more than 5 weeks apart, with results for validated comparators having a difference equal to or higher than the MCID value. Following the impossibility of using CARAT (which assesses a period of 4 weeks) as a comparator, we performed a subanalysis defining the EQ-5D VAS and "VAS work" clinical change based on periods more than 3 weeks apart. Whenever the same user had more than two consecutive measurements (or more than one set of measurements) meeting the aforementioned criteria, the first two inputs were selected.
Responsiveness was determined by calculating Cohen's effect size and the standardized response mean (SRM). 12 Cohen's effect size was calculated by dividing the mean difference between daily monitoring VASs by the standard-deviation of "baseline" VAS. The Daily monitoring VAS median values ranged from 0 ("VAS asthma") to 17 ("VAS sleep") ( Table 1). For the EQ-5D VAS and CARAT, median values of 80 (interquartile range = 14) and 16 (interquartile range = 10) were respectively observed.

| Intra-rater reliability
Between 2412 ("VAS work") and 5827 ("VAS nose" and "VAS eyes") days with more than one daily monitoring VAS input provided by the same user were recorded. For all VASs, more than 50% of the days had no differences in the first and second values provided within the same day (Table 3). Differences between the first and second daily values differing by more than 10 units ranged between 11.2% ("VAS asthma") and 24.4% ("VAS nose").
ICCs varied between 0.870 ("VAS global allergy symptoms") and 0.937 ("VAS sleep"). Similar results were observed when analyzing data from MASK-air® European users, or when taking into account the first and last daily measurements (Tables 3 and 4).
Similar results were observed when analyzing data from MASKair European users (Table 5).

| Responsiveness analysis
Using the EQ-5D VAS to define clinical change (based on observations at least 5 weeks apart), we assessed the responsiveness of daily monitoring VASs based on data from up to 85 users ("VAS global allergy symptoms," "VAS nose," "VAS eyes," and "VAS asthma").
We observed moderate effect sizes for "VAS asthma"  Table 6).
When defining clinical change based on a period more than 3 weeks apart, similar results were observed when using "VAS work" as the comparator to define clinical change, while overall lower effect sizes were observed with EQ-5D as a comparator (Table 6).

T A B L E 3
Results of intra-rater reliability for daily monitoring visual analog scales (VASs) using data from MASK-air® users where MASK-air® is available N days with more than one daily value a

| DISCUSSION
In this study, we observed that, overall, daily monitoring VASs presented with high intra-rater reliability and moderate-high concurrent validity, test-retest reliability and responsiveness. This is particularly relevant when taking into account the fact that, overall, the VAS has been shown to be a simple and sensitive instrument for measuring allergic rhinitis symptoms, having been used in both randomized controlled trials and observational studies. 30 The incorporation of VASs into a mobile app and the demonstration of their reliability and responsiveness supports their use as a tool to guide users in controlling disease activity and adapting medication. Previous studies have already assessed other properties of daily monitoring VASs, observing strong correlation between "VAS work" and other VASs (namely "VAS global allergy symptoms," "VAS nose," "VAS eyes," and "VAS asthma"). 5,31 The high intra-rater reliability observed across the different daily monitoring VASs suggests that values provided within the same day for such scales do not tend to be substantially different. This is further supported by the consistency of results observed when the first and last daily inputs are taken into account (instead of the first and second ones).
On the other hand, test-retest analysis results indicate that daily monitoring VASs remain reasonably stable when clinical stability is attested by other validated comparators (EQ-5D VAS, "VAS work" and CARAT). Overall, the presented ICCs are not dissimilar to those estimating test-retest reliability within the context of other measurement tools used in patients with rhinitis.
For the Rhinitis Control Assessment Test, an ICC of 0.78 was observed, 32 while for CARAT, the ICC was of 0.82. 19 The lower ICC observed for sleep-related VAS suggests that several other factors can potentially affect sleep, that the question is too simple (simply asking whether the user had slept well in the previous night or felt sleepy during the day), and/or that these VASs are not associated with the severity of allergy. However, there are several studies suggesting that sleep is impaired by rhinitis. [33][34][35][36] "VAS sleep" was also found to be associated with lower responsiveness. Responsiveness measures the occurrence of change (in this case, of daily monitoring VASs) in cases where clinical change is attested by other validated comparators. 12 Responsiveness is typically assessed by effect size measures reporting on the magnitude of variation in relation to baseline or between-subject variability, with higher values indicating larger changes. 12 Therefore, in this study, we observed that daily monitoring VASs more strongly accompanied clinically relevant changes in "VAS work" than clinically relevant changes in EQ-5D VAS. This may be related to the fact that the latter is not specific to allergic diseases. Differences of methods used to assess responsiveness impair comparisons with other tools used in rhinitis patients, such as the Rhinitis Control Assessment Test, 32 CARAT,19 WPAI:AS,20 and ARIA-C. 37 This study has important limitations that are worth noting.
Firstly, for some analyses (e.g., assessment of responsiveness in relation to the EQ-5D VAS and CARAT), the number of users/days meeting the required conditions was relatively small (although not too dissimilar to the sample sizes used in the assessment of testretest reliability in CARAT 19 and in some groups of Rhinitis Control Assessment Test 32 ), negatively affecting estimate precision. Such small numbers are explained not only by the conditions required to assess test-retest reliability and responsiveness, but also by the fact that, in MASK-air®, CARAT, EQ-5D, WPAI:AS, and ESS are not mandatory questionnaires within the context of daily monitoring. On the other hand, for each analysis, the number of users/days of use is not the same for each daily monitoring VAS. This is explained by the later introduction of certain VASs (namely "VAS sleep" and "VAS sleepiness") in the daily monitoring questionnaire, as well as by the T A B L E 4 Results of intra-rater reliability for daily monitoring visual analog scales (VASs) using the first and last daily inputs N days with more than one daily value a fact that "VAS work" is only answered on days when users report to be working.

N (%) observations with no difference between first and last daily values
Intra-rater reliability was assessed based on different same-day questionnaires by the same patient. However, within the same day (e.g., during the pollen season), a patient may experience changes in his/her symptoms. This potential limitation, however, results in an underestimation of intra-rater reliability. That is, real intra-rater reliability values may possibly be even higher than those we obtained. On the other hand, we used consecutive measurements less than 5 weeks apart in the definition of clinical stability. We cannot, however, exclude the possibility of "unstable periods" (i.e., clinically relevant changes) between these measurements. Nevertheless, the magnitude of such phenomenon is not expected to be particularly high, on account of the observed similarities in the test-retest ICC calculated based on CARAT (which assesses the previous month) and EQ-5D-VAS or "VAS work" (which assess a single day). Another important limitation concerns the main validated comparators used to assess concurrent validity, test-retest reliability and responsiveness. In fact, the EQ-5D VAS is not specific for allergic diseases, measuring instead how good or bad the health of the respondent is on that day. On the other hand, while CARAT is specific for asthma and allergic rhinitis, it assesses allergic rhinitis (and asthma) control within the period of the last 4 weeks, 38 while only one single day (the day being assessed) is contemplated in daily monitoring VASs. In addition, we were not able to use it as a comparator when assessing responsiveness, due to an insufficient sample size.
Furthermore, no MCID for allergic rhinitis patients had been previously defined for the EQ-5D VAS and for "VAS Work." We determined the corresponding MCID values within the context of this study, based on distribution-based methods instead of anchorbased methods. While the latter are often preferable, they imply the existence of an anchor-based estimate, 39 which was unavailable in this context (i.e., there was no "gold-standard" variable to which the EQ-5D VAS or "VAS work" could be compared accurately). Finally, a selection bias may also be present on account of the representativeness of users and of days on which daily monitoring questionnaires were completed. In fact, it is expected that MASK-air® users may not be representative of allergic rhinitis patientsonly 5% of the daily monitoring data concerned users aged more than 65 years, and only 7% concerned current smokers. This suggests that MASK-air® users are probably younger and more concerned about their health than overall allergic rhinitis patients. It is possible, however, that MASK-air® users are representative of those allergic rhinitis patients using apps in the management of their disease. On the other hand, and despite the low median values of daily monitoring VASs, daily monitoring may more often be performed when users feel bothered about their allergic symptoms. In other words, regarding the days on which daily monitoring VASs were used, there may be an overrepresentation of "more troublesome" days. This is all the more relevant taking into account that, in this study, the mean adherence to MASK-air® was found to be of 2.9% (with adherence/ intensity of use calculated as the number of actual reporting days divided by the reporting period-following the methods of Di Fraia et al. 40 -computed as the period between December 6, 2020 and the date the user first used MASK-air®). While this low adherence may be motivated by the fact that, for most users, MASK-air® was not prescribed and promoted by their allergists, future studies should assess whether adherence patterns may have an impact on the validity and reliability of MASK-air® VASs.
This study also has important strengths. We analyzed real-world data from a diverse set of users with allergic rhinitis in 25 different countries and 19 languages. The structure of the app precluded the existence of missing data within each response to the MASK-air® daily questionnaire. The validity, reliability and responsiveness of several VASs were assessed, with these VASs reflecting different types of allergic symptoms and different ways by which such symptoms can impact on users' activities. In our analyses, we used three different main validated comparators (along with ESS for sleep-related daily monitoring VASs), with consistent results being mostly obtained when identifying the best-performing and worstperforming daily monitoring VASs across the different analyses performed with different comparators. Finally, our results were robust to different sub-analyses or sensitivity analyses performed: similar results were observed when analyzing data restricted to European users or, for intra-rater reliability, when considering the first and last daily measurements (instead of the first and second ones).
In conclusion, in this study, we observed that daily monitoring VASs present high intra-rater reliability and moderate-high concurrent validity and test-retest reliability, with responsiveness being more variable across the different daily monitoring VASs and according to the chosen comparator. These results indicate that daily monitoring VASs are accurate instruments for measuring the daily impact of allergic rhinitis, possibly providing support for adapting medication and for controlling disease activity. Future research may focus on improving already existing scores or on developing new tools: (i) assessing daily rhinitis control, (ii) combining results from different daily monitoring VASs, (iii)