Agreement on the assessors' ratings for the individual items in the fall risk assessment tool
The results showed that the simple and weighted κ-values were significantly high, indicating that there was a significantly high degree of agreement on the ‘gold standard’ assessor's and the facility assessors' ratings for the individual items in the fall risk assessment tool with the exception of one item, ‘effects of medications’, having significant substantial simple and weighted kappa values (κ and κw = 0.630) of < 0.8 to achieve almost absolute agreement.
By comparing the weighted κ-values, the range was relatively large (0.630–1.000). The low degrees of inter-rater reliability were possibly because of ambiguous operational definitions of some items. For instance, it was stated in the work protocol that 15 points are scored if the patient verbalized that he/she feels weak. However, there was no definition or elaboration on the meaning of ‘weak’. This supports further improvements in the definitions of the items, enabling an accurate identification of high-risk patients.
For the item ‘use of assistive devices’, the weighted κ-value (κw = 0.835) was slightly lower than the simple κ-value (κ = 0.838). For the purpose of illustration, the contingency table of the item ‘use of assistive devices’ is presented (Table 4). Both raters had the exact agreements in 129 out of 142 ratings. Three ratings deviated from exact agreements in one category. However, maximum disagreements occurred in nine cases; the ‘gold standard’ rater rated ‘crutches, cane, walker’, whereas the facility raters rated ‘no device’. These strong disagreements subsequently decreased the weighted κ-value. This is perhaps because of the assessors' different perceptions on the individual patients' need of assistive device. Training with evaluations of performance on real patients and usage of photographs to illustrate the items is recommended to enhance the learner's memory.
Table 4. Contingency table of item ‘use of assistive devices’
|Facility rater||‘Gold standard’ rater|
|No device||Furniture, support (e.g. walls, wheelchair)||Crutches, cane, walker||Total|
|Furniture, support (e.g. walls, wheelchair)||0||10||3||13|
|Crutches, cane, walker||1||0||67||68|
Contrary to the item ‘use of assistive devices’ which had the weighted κ-value (κw = 0.835) slightly lower than simple κ-value (κ = 0.838), the item ‘unsteady gait’ had a weighted κ-value (κw = 0.958) that was slightly higher than the simple κ-value (κ = 0.955). This indicates that the relative agreement on the item ‘unsteady gait’ was slightly higher than the exact agreement. Because the difference was minimum, the effect from the raters' disagreement would not result in a very negative impact. However, in terms of mobility, there should be an absolute agreement between the raters to ensure the implementation of the appropriate fall prevention strategies. It might be that the explanations of the terms were not specific enough. A more distinct measure of the mobility should be utilized, and an example would be the ‘get up and go’ test.
It is expected that the items ‘fall(s) over the past 6 months’ and ‘continuous intravenous therapy’ should have been assessed by all the staff members in the same manner. Although the simple and weighted kappa values appeared relatively high for ‘fall(s) over the past 6 months’ (κ and κw = 0.968) and for ‘intravenous therapy’ (κ and κw = 0.913), both did not achieve an excellent value of 1. The problem with rating correctly for the item ‘fall(s) over the past 6 months’ was likely attributed to the patients' ambiguous answers and abilities to recall their fall histories because of their old age. For instance, the ratings between the assessors might vary if the patients recalled differently on separate occasions. Recall bias was therefore introduced. For an easy reference by the ward staff, a fall over the past 6 months should be documented in the case sheets on the patient's admission into the hospital.
Assessment of continuous intravenous therapy was by direct observation of the patients' present status. However, there was a time gap in the ratings by the ‘gold standard’ assessor and the facility assessors, leading to a discrepancy in the assessments of the patients' conditions. As stated in the study hospital's work protocol, the patient's fall risk status should be assessed on admission, when there is a change in patient's condition/ treatment, when transferred from another department or after a fall. It was observed that the majority of the ward staff only reassessed the patients' fall risk status while writing their nursing reports within 2 hours before the report passing time. Prompt reassessments could be compromised by their busy work routines. However, the importance of accurate and prompt fall risk assessments should be reinforced among the ward staff so as to facilitate the implementation of effective fall prevention interventions.
When taking only weighted κ-values into consideration, the highest exact agreement between the raters was on the items ‘post-GA/RA’ and ‘risk-taking behaviour’, having an excellent weighted κ-value of 1. This shows that the ‘gold standard’ rater agreed with the facility raters on all the assessment occasions. Conversely, the results revealed that the raters disagreed most during the rating of the item ‘effects of medications’ (κw = 0.63). It was observed that not all the patient case folders or inpatient medication records had a list of ‘fall risk’ medications inserted for reference and that the majority of the staff did not refer to the patient's medication records during the fall risk assessment. Similarly, there was no absolute agreement on the item ‘secondary diagnosis’ (κw = 0.902). This could be assessed easily and accurately by reading through the patients' medical history in the case notes, but some of the ward staff did not do so. Assessment based on the nurses' prior knowledge of the patients' conditions could cause inaccuracy in the documentation, directly leading to miscategorization of the patient's fall risk status. Counterchecking the reliable sources should be reinforced.
Correlation in the overall scores of the fall risk assessment tool between the assessors
From the results, the rho coefficient (rs) was 0.89, and the significant level was very small (P < 0.001) and was < 5% (0.05) level of significance as set. This showed that there was a significantly high correlation in the overall scores of the fall risk assessment tool between the ‘gold standard’ assessor and the facility assessors.
Inter-rater reliability is an important quality of any assessment tool in health-care settings, and the overall agreement was relatively good with significantly high correlation coefficient of 0.89. Though there was a difference in the median scores of the ‘gold standard’ assessor (80.0) and of the facility assessors (77.5), the resulted sum score difference (2.5) was not relatively large, and both scores were still categorized as ‘high fall risk’.
The original version of Morse Fall Scale had been tested rigorously in other settings. It was revealed that its inter-rater reliability coefficient varied in different study contexts: high inter-rater reliability (r = 0.96) tested by Morse in 1997, relatively low inter-rater reliability of 0.68 by McCollam in 1995 and moderate reliability (κ = 0.80) by Ang et al. in 2007. The inter-rater reliability (rs = 0.89) in this study seems to be relatively high as compared with the previous studies. However, caution is warranted in directly comparing the inter-rater reliability across the studies as the studies took place in different settings with different populations, study designs and data analyses. The values with different units cannot be compared directly.
The variability in the overall scores between the assessors may have been because of different levels of knowledge regarding the patients being assessed. On admission, the facility assessor could have the privilege to observe the patient's condition and carry out a more detailed interview with the patient when filling in the nursing assessment record. Their levels of knowledge on the patient's background may therefore be higher than that of the ‘gold standard’ rater because the ‘gold standard’ rater only focused on the fall risk assessment tool without having a more in-depth knowledge about the patients' backgrounds.
This study is a single-centre study and a purposive selection of the study wards; the generalizability of the results to the population is therefore limited. Secondly, it is impossible not to prevent any patient from falling while doing the fall risk assessment on them because of ethical consideration. Having both prediction of the criterion variable and prevention of its occurrence simultaneously renders the research study to be imperfect as the accuracy of the assessment tool was affected. Thirdly, a method on gold standard and paired observer is considered when resources are insufficient to allow all the subjects to be assessed simultaneously by two raters. However, the time gap separating ratings of the ‘gold standard’ rater and facility raters might decrease the inter-rater reliability coefficient because of the unpredictable changes within the time lapse. Fourthly, emphasizing the researcher as the ‘gold standard’ is truly debatable. As compared with the ward nurses, the researcher was certainly in a less advantageous position with less observation periods or prior knowledge of the patients' conditions. Lastly, there is a possibility that the ‘gold standard’ assessor might have been more careful in her fall risk assessment because the data were collected as part of a research study. This ‘Hawthorn effect’ can artificially alter the inter-rater reliability.