Multicentre sleep‐stage scoring agreement in the Sleep Revolution project

Summary Determining sleep stages accurately is an important part of the diagnostic process for numerous sleep disorders. However, as the sleep stage scoring is done manually following visual scoring rules there can be considerable variation in the sleep staging between different scorers. Thus, this study aimed to comprehensively evaluate the inter‐rater agreement in sleep staging. A total of 50 polysomnography recordings were manually scored by 10 independent scorers from seven different sleep centres. We used the 10 scorings to calculate a majority score by taking the sleep stage that was the most scored stage for each epoch. The overall agreement for sleep staging was κ = 0.71 and the mean agreement with the majority score was 0.86. The scorers were in perfect agreement in 48% of all scored epochs. The agreement was highest in rapid eye movement sleep (κ = 0.86) and lowest in N1 sleep (κ = 0.41). The agreement with the majority scoring varied between the scorers from 81% to 91%, with large variations between the scorers in sleep stage‐specific agreements. Scorers from the same sleep centres had the highest pairwise agreements at κ = 0.79, κ = 0.85, and κ = 0.78, while the lowest pairwise agreement between the scorers was κ = 0.58. We also found a moderate negative correlation between sleep staging agreement and the apnea–hypopnea index, as well as the rate of sleep stage transitions. In conclusion, although the overall agreement was high, several areas of low agreement were also found, mainly between non‐rapid eye movement stages.

cially between centres each interpreting the rules slightly differently with distinct protocols and practises.These differences exist even in the healthy population and become even more apparent in patients with sleep-disordered breathing (Norman et al., 2000;Penzel et al., 2013), most likely due to fragmented sleep with abnormal sleep patterns and frequent sleep stage transitions.Overall, the transition from the Rechtschaffen and Kales scoring rules to the AASM scoring rules has provided improvements in the scoring agreement (Danker-Hopfe et al., 2009), but the inter-scorer agreement could still be improved significantly.For these reasons, investigating the agreement in sleep staging practises systematically in a multicentre and multidisciplinary manner is crucial to gaining a better understanding of the shortcomings of current sleep staging rules and identifying the uncertain areas with the most significant variation in scoring.
In addition, the studies have only reported quantitative sleep parameters or overall agreement metrics without providing more detailed investigations of the segments with disagreement in scoring.
Therefore, the aim of this study was to provide a comprehensive and detailed evaluation of the inter-rater agreement in sleep staging.
Our aim was to focus on highlighting the areas of disagreement and the possible reasons behind these disagreements instead of only reporting the agreement values.The sleep staging agreement was evaluated between specialised sleep units across Europe and Australia within the Sleep Revolution project (Arnardottir et al., 2022).We utilised 50 PSG recordings, each scored once by 10 experienced individual scorers from seven different sleep centres to investigate the agreement and highlight the uncertain areas with the most disagreement in scoring.We hypothesised that the agreement in scoring N1 sleep would remain the lowest, as reported in previous studies, and that the transitions between sleep stages would cause uncertainty in the scoring.

| Dataset
This study was based on 50 prospective type II PSG recordings conducted at Reykjavik University from February 2021 to June 2021 using a Nox A1 device (Nox Medical, Reykjavik, Iceland).To get a well-rounded study population, the subjects were recruited based on information gathered from online screening questionnaires.The initial goal was to have a similar ratio of subjects from the following groups: obstructive sleep apnea (OSA) risk group, restless leg syndrome (RLS) risk group, insomnia risk group, and healthy individuals.However, this goal was not fully reached in the end and the population ended up slightly OSA risk dominant.The STOP-BANG (snoring, tiredness, observed apnea, high blood pressure, body mass index, age, neck circumference, and male gender) questionnaire (Chung et al., 2016) was used to assess OSA risk, the Insomnia Severity Index (Morin et al., 2011) was used to assess insomnia risk and the International Restless Legs Syndrome Study Group Questionnaire (Horiguchi et al., 2003) was used for determining RLS risk.The RLS risk was determined in the same manner as in Benediktsdottir et al. (2010).
The study was approved by National Bioethics Committee of Iceland (21-070, 16.3.2021).All subjects gave written informed consent to participate in the study.
All 50 PSGs were manually scored by 10 scorers from seven different sleep centres ( for all scorers and subjects were exported (Nikkonen et al., 2022a, b) as xls-files and all data analyses were performed using Matlab R2022a.

| Data analysis
To aid in evaluating the scoring agreement, we used the 10 manual scorings to calculate a majority score for each analysed epoch.The majority score was formed by taking the sleep stage that was the most scored stage for that epoch.For example, if two scorers had scored the stage as N2, five scorers the stage as wake, and three scorers the stage as N1, the majority score was wake.If there was a tie between two or more stages, the stage higher in the tiebreaker Abbreviation: REM, rapid eye movement.Note: observed agreement was defined as the number of rater pairs in agreement relative to the number of all possible rater pairs that is, the observed agreement without being adjusted by chance agreement.
As some of the recordings included excessive amounts of scored wake before and after the sleep period, we defined the analysis period to start from the first non-wake epoch scored by any scorer and to end at the last non-wake epoch scored by any scorer.Thus, effectively all excess wake periods were trimmed from the start and end of the recordings.In addition, as different scorers started and ended their analyses at different times, the unscored epochs before the first scored epoch and after the last scored epoch were considered as wake.This trimming method also makes the amount of scored wake directly comparable between the scorers as the exact same epochs are considered for each scorer.
We evaluated the scoring agreement using κ statistics as well as calculating the observed agreement and agreement with majority scoring metrics.We used Cohen's κ (κ c ; Cohen, 1960) for pairwise compari-  Note: Excess wake was trimmed from before and after sleep and thus the amount of wake is directly comparable between the scorers.The overall agreement across all sleep stages was κ f = 0.714.The agreement was highest when the majority sleep stage was REM and lowest when the majority sleep stage was N1 (Table 3).The agreement was also calculated for N1 and N2 as a combined light sleep stage (Table 3).The agreement remained highest when the majority sleep stage was REM, and the mean agreement with the majority score was lowest when the majority sleep stage was the light sleep stage.However, the observed agreement of light sleep stage was then higher than for wake and N3.The κ fb values were 0.763 for wake, 0.406 for N1 sleep, 0.673, for N2 sleep, 0.696 for light sleep, 0.742 for N3 sleep, and 0.861 for REM sleep.
The total minutes scored of each sleep stage varied between the scorers (Table 4).The variation was greatest in N1 where, e.g., scorer 8 scored over three times the N1 compared to scorer 3. The variation was lowest in REM where even the largest difference between the scorers was <25%.
The scorers were also compared pairwise (Figure 1).The lowest agreement was between scorers 9 and 10 at κ c = 0.58.Scorers 9 and 10 also had the lowest overall agreement with all other scorers.Scorers 1 and 2, 4 and 5, and 6 and 7 were from the same sleep centres and also had the three highest pairwise κ values at 0.79, 0.85, and 0.78 respectively.Scorers 6 and 7 also reached the highest overall agreement with majority scoring.The scoring and scoring agreement also varied between the 50 subjects (Figure 2).Similarly, least variation was found in REM and the most variation was found in N1 also at the subject-by-subject level.Although there were considerable variations in the recordings lengths and clock times between the subjects, they had little effect on the scoring agreement (Figure 3).Confusion matrices against the majority score highlighted the differences between the scorers in scoring specific sleep stages (Figure 4).There was great variation in which sleep stage the disagreement occurred with no clear pattern between the scorers.For example, although scorer 9 had the lowest overall agreement with the majority score, the N3 agreement with the majority score was the highest.In addition, scorer 8 had a high overall agreement but had considerably lower wake agreement compared to the other scorers.
Investigating the frequency and distribution of different scoring combinations showed that 48.0% of all scored stages reached 100% agreement among the scorers (Figure 5a).The most scored scoring combination was N2 with 100% agreement followed by N2/N3 with 50%-90% agreement (Figure 5b).All stages where the agreement was 100% were in the top six most scored combinations except for N1, which was only scored with 100% agreement in a total of 141 epochs (0.3%).Mean apneahypopnea index (AHI; Figure 6a) arousal index (ArI; Figure 6b) and the frequency of stage transitions (Figure 6c) were negatively correlated with the scoring agreement.The frequency of stage transitions was also correlated with the proportion of confusions between the stages for wake/N1 (Figure 6d), N1/N2 (Figure 6e), and N2/N3 (Figure 6f).

| DISCUSSION
Although the overall agreement between the scorers was high with κ f = 0.714 and mean agreement with the majority of 0.863, there was    5.The relatively lower wake agreement in the present study is likely due to the fact that all excess wake was trimmed from the start and end of the recordings.The wake periods before the subject first falls asleep and the periods after they wake up in the morning should be easier to score and including large portions of this type of wake would over-inflate the wake accuracy.It should be also noted that the agreement values are not directly comparable between the studies as the values are not calculated the same exact way due to the differing study setups, subject populations, number of recordings, and number of scorers.In addition, how the sleep stage-specific agreements and κ values were determined, have not always been fully elaborated (Danker-Hopfe et al., 2009;Magalang et al., 2013;Rosenberg & Van Hout, 2013;Zhang et al., 2015).
The agreement was highest when the majority sleep stage was REM.N3 had also a higher-than-average agreement while N1 had the lowest agreement.The low agreement in N1 was expected as it has also been low in previous studies (Danker-Hopfe et al., 2009;Magalang et al., 2013;Rosenberg & Van Hout, 2013;Zhang et al., 2015).Thus, we also evaluated the agreement of N1 and N2 combined into a light sleep stage.This is often done in, e.g., automatic sleep staging applications as it has been noticed that many models have great difficulty separating N1 and N2 correctly (Korkalainen et al., 2019(Korkalainen et al., , 2020)).When combined, the mean agreement with majority scoring was 0.827, while with standard five-stage scoring, it was 0.656 for N1 and 0.860 for N2.The agreement values also slightly increased for all other stages, which was to be expected as the total number of classes was reduced by one.From the agreement values, it is not fully clear whether combining N1 and N2 actually significantly increases the agreement or whether the much larger amount of N2 (Table 4) simply dominates the light sleep stage, and thus, the agreement appears higher.In addition, there is considerable disagreement between wake and N1 (Figures 4 and 5b  affect.However, the correlation between confusions and transitions was considerably higher in N1/N2 than in wake/N1 or N2/N3 (Figure 6).In addition, mixed N1-N2 was the fifth most scored scoring combination with a 7.7% frequency, while N1 alone was only scored with a 0.3% frequency (Figure 5b).

Scoring frequency for each scoring combination
Furthermore, while overall 48.0% of all scored epochs reached 100% agreement within the scorers (Figure 5a), in N1 this was managed in only 141 total epochs, which is 4.3% of the N1 epochs defined by the majority scoring.In contrast, all the other sleep stages where the agreement was 100% were in the top six most scored combinations with frequency around half the total frequency of the stage (Figure 5b, Table 4).Thus, there is a clear indication that differentiating between N1 and N2 is difficult and that there is almost never full confidence in N1 scoring, even for expert sleep technologists using the same scoring rules.
One of our hypotheses was that stage transitions would be areas of low agreement as it might be difficult to judge accurately the exact point when the transition occurs.This was supported by our analyses as we found that there was a negative correlation between the number of stage transitions and the sleep staging agreement (Figure 6c).
This effect is also evident from Figures 7 and 8, where the agreement consistently drops during stage transitions while long periods of the same sleep stage usually eventually reach a perfect agreement with all scorers.One likely reason for the drop in agreement during the transitions is that often epochs start as one stage and end as another, e.g., N2-N3 transition.Thus, even if all scorers would agree that a transition happens during the same epoch, some scorers may still consider it to be more N2 while some would score it N3.Thus, it is expected that the agreement drops at least briefly during stage transi- agreement as the stage transitions (Figures 6a-c).This is also not surprising as many of these stage transitions are arousals, which are in turn caused by respiratory events.Thus, these effects are not independent.
However, the stage transitions are certainly not the only or even necessarily the predominant reason for disagreement in the scoring as the wake/N1 and N2/N3 confusions percentages showed only weak correlations with wake⟷N1 and N2⟷N3 transitions (Figures 6d,f).
Although the correlation was considerably higher between N1/N2 transitions and confusions (Figure 6e).Considering all of the agreement metrics, neither the low amount of N1 nor the stage transitions can fully explain the low agreement in N1, and it inherently seems to be considerably more difficult to score.One possible explanation for this difficulty could be the individual differences in sleep onset alpha activity, as alpha frequencies vary markedly between individuals and $10% of the general population display no alpha rhythm upon eye closure (Berry et al., 2020).
Interestingly, there were large variations where each scorer disagreed with the majority score (Figure 4).For example, scorer 1 reached a very high agreement in N1 (85.7%) but the N2 agreement was one of the lowest (76.3%).Similarly, scorer 8 had a very high agreement in all other sleep stages, but the wake agreement was by far the lowest (64.5%).Scorer 10 only reached 58.6% agreement in N3 where there was an otherwise high agreement between all other scorers.REM was the only stage where each scorer reached ≥80% agreement with the majority score and it was also the stage with the highest agreement in all metrics.It is good to note, that as the largest differences were between the NREM stages, the disagreement may be less crucial from a clinical perspective as most OSA metrics only consider the total sleep time and a NREM/REM distinction to assess REM-related OSA.This can also be seen in Figure 2, where although there is considerable variation between the scorers, the scorers mostly follow the same trend between subjects, and the total sleep time is not affected as much.However, there were also differences in sleep/wake scoring as, e.g., scorer 8 scored considerably less wake than other scorers (Table 4, Figure 4).Overall, these findings show that the disagreement in scoring is not only due to events or stage transitions but that there seem to be significantly different interpretations of the same visual-scoring rules.For example, in N3 scoring, some disagreement might be caused by a different propensity to count the exact length of all delta waves within a given epoch versus looking at the overall morphology of the epoch without exact measurements.Differences in arousal scoring might also have an effect as some scorers may score arousals where there are delta waves with superimposed alpha activity and exclude these segments when counting the length of delta activity in the epochs.Being from the same sleep centre reduced the variation between the scorers as expected (Figure 1), although there were still considerable differences.
The overall agreement remained mostly the same regardless of the clock time and there was no apparent improvement or decrease in scoring agreement during the night (Figure 3a).More variation is visible in the early and late hours.However, this is simply because fewer subjects are sleeping during those hours.In comparison, during the middle of the night, the mean agreement is calculated from all 50 subjects, which limits the variation.The start or end time, or the length of the recording had no apparent effect on the scoring agreement either (Figure 3b).No apparent change in the agreement was found from the beginning of the analysis to the end of the analysis either (Figure 3c).This indicates that there was no significant loss of attention, or familiarisation to the specific EEG features of the subject during the night.
This study also has certain limitations.The tiebreaker method used when forming the majority score may have a slight impact on the results as 3.6% of the epochs resulted in a tie between two or more stages.However, this should only be a minimal effect as in these cases the agreement must already be very low that is ≤50%.The areas with low agreement would also still stay the same and, e.g., the mean agreement with majority scoring would be exactly the same regardless of the tiebreaker method.Some recordings had issues with signal quality and sensor connections, and some scorers handled these segments differently from others.For example, if the pulse oximeter was disconnected for a part of the recording, most scorers still scored sleep stages as pulse oximetry is not required for sleep staging, while some scorers marked these segments as invalid periods or artefacts and thus did not perform sleep staging.For those scorers who did not score these sections, these epochs were considered as wake in the agreement analyses.We chose not to include an additional 'unscored' stage as only 0.2% of all scored epochs were labelled as unscored.
Instead, we considered these epochs as wake as it is the closest stage for these epochs as artefacts and invalid periods are not counted towards total sleep time and no respiratory events would be scored over them either.Finally, as the recordings were type II PSGs, they included varying amounts of wake before and after the actual sleep period, ranging from a few minutes to multiple hours, with no reliable markings for 'lights-on' or 'lights-off' times.This makes accurately determining sleep latency and sleep efficiency difficult.Therefore, as there were no reliable markings that could be used to determine the time in bed, we elected to trim all excess wake periods from the start and the end, even though the sleep latency period may be partially cut also.We chose this approach over including an arbitrary amount of wake before and after the sleep period as this would have only caused a positive bias to the accuracy of wake scoring.
In conclusion, although the overall sleep staging agreement was high, there are several areas for improvement.The most frequent disagreement was in the NREM stages, especially in N1 sleep.As there is almost never 100% agreement in N1 scoring, there may be a need to re-evaluate its value in sleep staging and whether it should be scored separately in the future.In addition, although stage transitions were identified as a partial cause of disagreement, there seem to be fundamental differences in how different scorers perform sleep staging.As the agreement was higher between the scorers from the same sleep centres, the disagreement is likely at least partially due to different interpretations of the same scoring rules.Thus, it may be necessary to re-evaluate and improve some of the scoring rules if the sleep staging agreement is to be improved.
sons and Fleiss' κ (κ f ; Fleiss, 1971) for multi-rater comparisons.To get sleep stage-specific κ values for comparison with previous studies, the κ f for each sleep stage was also calculated in a binary manner, that is, each sleep stage was individually compared to all other stages.Thus, this binary κ (κ fb ) represents more of a detection accuracy for each sleep stage.However, it should be noted that as the κ values are dependent on the number of scorers and the number and length of recordings, the κ values between different datasets cannot be directly compared.In addition, we calculated the agreement in each sleep stage based on the majority score in the standard five sleep stages and additionally when N1 and N2 were combined into a light sleep stage.We also calculated how the number of sleep stage transitions and respiratory events affect the scoring agreement.For this analysis, we calculated confusions between sleep stages.A confusion expressed an epoch that had mixed scoring of two stages from scorers that is, an N1/N2 confusion would be an epoch where, e.g., two scorers had scored the epoch as N1 and eight scorers as N2.The level of agreement did not matter for this calculation, that is, 3 Â N1 + 7 Â N2 was counted similarly as an N1/N2 confusion as 6 Â N1 + 4 Â N2.Finally, we calculated the proportions of all the different scoring combinations between the five sleep stages and confusion matrices for each scorer against the majority scoring.T A B L E 4 Total scored minutes of each sleep stage across all 50 subjects.
U R E 2 The specific scoring agreement for each of the 50 subjects and the amount of each sleep stage scored by all 10 scorers.The subjects are ordered from lowest to highest mean agreement.Mean agreement is calculated as the mean agreement with majority scoring.
Mean agreement with majority score across all subjects during a specific clock time (a), the clock times when each recording was conducted (b), and the agreement as a function of time from analysis start (c).considerable variation in the agreement between the sleep stages (Table3).The agreements presented in the present study are mostly in line with previous studies, which have reported an overall agreement of between κ = 0.57 and κ = 0.76(Danker-Hopfe et al., 2009;Magalang et al., 2013;Norman et al., 2000;Penzel et al., 2013;Rosenberg & Van Hout, 2013;Zhang et al., 2015 Confusion matrices against the majority score for each of the 10 scorers.detailed agreements reported in previous studies are presented in Table Cumulative distribution for the scoring agreement (a) and a bar chart of all scoring combinations and their proportion of all scored epochs from all 50 subjects (b).
) that the change to light sleep stage would not
tions.The low agreement during stage transitions could also at least partially explain the low agreement in N1 as the length of N1 cycles is usually quite short and thus, the stage transition areas are therefore a much larger portion of the total amount of N1.The number of respiratory events and arousals were similarly correlated to the scoring Example of a subject with a high inter-scorer agreement.Hypnodensity represents a stacked histogram of scored stages for each epoch.Agr., agreement with majority score; R, rapid eye movement; W, wake.
Example of a subject with a low inter-scorer agreement.Hypnodensity represents a stacked histogram of scored stages for each epoch.Agr., agreement with majority score; R, rapid eye movement; W, wake.

Table 1
) between April 2021 and September 2021.All scorers used Noxturnal software, Research Version Scoring agreement in each sleep stage across all 50 subjects. ).More