Learning curve using the Sunnybrook Facial Grading System in assessing facial palsy: An observational study in 100 patients

Little is known about facial function assessments of inexperienced observers in facial palsy. In this observational study learning curve was examined of two inexperienced observers assessing facial function of 100 patients using the Sunnybrook Facial Grading System. Interobserver agreement gradually improved over time, stabilizing after approximately 70 assessments. Best agreement on the voluntary movement subscore was observed, followed by synkinesis and resting symmetry subscores. Inexperienced observers can perform facial function assessments in facial palsy, but should be adequately trained first.

Learning curve using the Sunnybrook Facial Grading System in assessing facial palsy: An observational study in 100 patients 1

| INTRODUC TI ON
Assessment of facial function in facial palsy patients is important to evaluate current status and treatment effect. The Sunnybrook Facial Grading System (SB) is one of the clinician grading's of facial function. 1 Inter-and intraobserver reliability ranges from 0.838 to 0.980 and 0.831 to 0.997, respectively. 2 However, most reliability studies included experienced observers. In research projects, facial palsy assessment is often done by medical students. Additionally, general practitioners and starting residents or physical therapists may not have extensive experience in facial palsy assessment using the SB.
Aim of this study was to analyse a learning curve for facial function assessment in facial palsy using the SB in a 7-week prospective observational study with two inexperienced final-year medical students.

| Ethical considerations
This study was performed at the University Medical Center Groningen, the Netherlands. No formal ethical review by the Institutional Review Board was required. All patients provided written consent prior to this study.

| Procedure
Two medical students without previous extensive knowledge of facial palsy (TB and MA) participated in this 7-week training period in March and April 2019 using the SB. Both students were final-year medical students, approximately 6 months prior to obtaining their MD-degree, having done two years of clinical rotations. Prior to the start, the students were informed of the criteria for grading the SB, 3 watched the SB and eFACE tutorial videos (https://sunny brook.ca/conte nt/?page=facialgradi ng-system and http://links.lww.com/PRS/B355, respectively) and performed two SB assessments together with a researcher with experience in facial palsy grading (MMvV). Thereafter, the students independently watched 10 videos of facial palsy patients performing standard facial movements and performed a SB assessment. At the end of the week, a meeting was held in which the students and experienced researcher watched the videos and discussed disagreements. In an open dialogue, reasons for choosing a certain grading were shared and discussed, creating a platform for reflection and learning. In week two to seven, 15 sets of videos were assessed, reviewed after each week in a joint meeting, resulting in a total of 2 × 100 assessments. The learning curve was investigated by examining changes in interobserver agreement at the (sub)score and item level from week to week.

| Sunnybrook Facial Grading System
The SB is a clinical grading system of facial function in facial palsy.

| Statistical analysis
Descriptive statistics were presented as numbers and frequencies, mean and standard deviation (SD) and median and interquartile range (IQR) when appropriate. For describing the sample, the mean SB scores of both observers were used. Interobserver agreement was analysed by calculating intraclass correlation coefficients (ICC, two-way random effects model, single measures, absolute agreement) for the SB composite score and the three subscores. Item level interobserver agreement was assessed by calculating Cohen's κ statistic and percentage absolute agreement. For the first SB item (resting eye), an unweighted κ was calculated since categories are non-ordered. For the second to thirteenth item, a quadratic weighted κ was calculated, since these items are ordered.
Additionally, we assessed final interobserver agreement between the two assessors of the last 50 SB assessments (videos 51-100), since it is generally advised to perform reliability studies with at least 50 participants. 4 A value of 0.7 was taken as an acceptable level of agreement for both the ICC and Cohen's κ, preferably for the lower border of the 95% CI. 4 This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. Interobserver agreement for individual SB items improved gradually, but improvement differed per item (Table 1). Interobserver agreement on the last 50 patients was best for voluntary movement items (κ from 0.65 to 0.78), compared to resting items (κ from 0.58 to 0.80) and synkinesis items (κ from 0.38 to 0.73). In total, seven of 13 SB items scored Cohen's κ larger than 0.7 (Table 1).

| Synopsis of key findings
Initial agreement between two inexperienced medical students was poorer and below any acceptable threshold for agreement than that of experienced observers. During the 7-week training and feedback programme, they were able to reach acceptable interobserver agreement on the SB composite score and voluntary movement subscore; agreement on other subscores and individual items was lower. The ICC values of the last 50 observations remained lower than those reported for more experienced observers in literature. [1][2][3]5,6 Although examining individual items should be done with caution-since the items are initially not individually validated and investigated-it was our impression that some items performed better than others. For example, "Brow lift" seemed relatively easy, while "lip pucker" seemed to be more difficult.

F I G U R E 1 Graphical representation of interobserver agreement on Sunnybrook
Facial Grading System (SB) (sub)scores over time. Circles with interconnecting line represent the intraclass correlation coefficient (ICC) for interobserver agreement for the videos in each of the 7 wk. 95% confidence interval is only presented for SB composite score, since is it the most important scale. Circle with error bars (right) represents the interobserver agreement on the last 50 videos (point estimate ICC and 95% confidence interval, respectively) on the last 50 videos. A horizontal line was placed at ICC = 0.7, the pre-set acceptable level of agreement Contrary to our findings, previous publications including inexperienced observers reported adequate inter-and intraobserver reliability (ICCs > 0.819). 3,5,7 Reasons for this difference could be that we used only two observers, or that these other studies have included the full range of SB composite scores thereby automatically increasing reliability.
However, in our study most of the facial palsy patients were referred to our plastic surgery department for smile reanimation surgery. Therefore, SB score ranged on the lower end of the spectrum (SB composite score range: 0 -62) and most patients presented with flaccid facial palsy.

| Strengths and limitations
We described the results of only one protocol. Therefore, we cannot draw conclusions on the optimal number of videos inexperienced observers should watch each week or the number and timing of feedback sessions. Future studies could focus on these questions in order to determine an optimal training protocol for inexperienced observers of facial function in facial palsy.
A limitation of this study was that we assessed only two observers. This number was a practical choice, since both students were performing a research project at our department at the same time, but limits generalisability to the general population of possible observers.
Additionally-since the SB is subjective-we chose not to use expert scores as gold standard. Instead, disagreements were discussed with the experienced researcher. Including a "reference" score might have changed the results. Thirdly, the assessments were performed from a standardised video, instead of in-person. Although video assessments have been shown to be reliable in experienced observers, 6,8 this may have been an extra barrier for our students. Actually, they reported ease of the SB assessment depended considerably on the quality of the videos. Additionally, the time to complete an SB assessment got considerably shorter over time, although a formal analysis could not be done since we did not collect these data.
Two limitations are perhaps due to the setting of the study. Our results are only valid for patients with a SB composite score range 0 to 62. Secondly, the "snarl" movement has previously not been incorporated in our standard video for facial palsy. Hence, scoring the "snarl" and its associated synkinesis asked quite some insight of our students, although this was not replicated in the interobserver agreement.
Lastly, we assessed interobserver agreement using ICCs and Cohen's κ statistics. Although correct, these tests highly depend on the distribution of scores. 4 Since the theoretical range of scores for resting symmetry is much smaller than the range of scores for voluntary movement for example, the ICC for resting symmetry is automatically lower. Therefore, especially for the individual items, the κ can sometimes be very low, while the proportion agreement is still relatively high. This makes interpretation of our results a bit difficult.
a Unweighted kappa since this item is not ordered.
b Cohen's κ could not be calculated since one observer scored all patients as "0," and hence, a 2 × 2 table could not be formed.

| Clinical applicability
The results of our study can be used in future studies in which medical students participate, when training starting residents, or when communicating with colleagues less exposed to facial palsy. We eyeballed that there was little improvement of interobserver agreement from week 5 onwards. Therefore, we advise that inexperienced observers are supervised for at least 70 SB assessments of facial palsy before their assessments are considered to be adequate.

| CON CLUS IONS
Our study shows that initial agreement on facial function assessment in facial palsy between inexperienced observers is low. During our 7-week training period, agreement gradually increased to acceptable levels, especially for the SB composite score.