An intervention to improve the interrater reliability of clinical EEG interpretations


address: Dr Hideki Azuma, Department of Psychiatry, Nagoya City University Medical School, Mizuho-cho, Mizuho-ku, Nagoya 467-8601 Japan.


Several studies have noted modest interrater reliability of clinical electroencephalogram (EEG) interpretations. Moreover, no study to date has investigated a means to improve the observed interrater agreement. The purpose of the present study was to examine (i) the interrater reliability of EEG interpretations among three raters (two psychiatrists and one pediatrician); and (ii) how to improve the reliability by establishing a consensus guideline for EEGinterpretation. Three raters, two psychiatrists and a pediatrician, interpreted 100 consecutive EEG recorded at Tajimi General Hospital. After discussing the results of the first trial, the raters established a consensus guideline for EEG interpretation. They then interpreted 50 consecutive EEG recorded at Nagoya City University Hospital following this guideline. Kappa for global judgment of EEG abnormality in three grades (abnormal/borderline/normal) was 0.42 on the first and 0.63 on the second trial. Kappa significantly improved by using the guideline (P = 0.004). It is suggested that discussing and establishing the consensus guideline among the raters offers a feasible method to improve interrater reliability in clinical EEG interpretations.


Psychometrics teaches us that reliability, or interrater agreement, sets the upper limit to validity. However, in any clinical examination there is bound to be some interrater variability, especially when the interpretation depends on the subjective judgment of the examiner.

Electroencephalogram (EEG) is just such an examination. There have therefore been several studies examining this problem. For example, earlier studies on EEG reliability generally found relatively low agreement among interpreters.1–4 However, none of these studies reported chance-corrected coefficients of reliability such as the kappa or intraclass correlation coefficient, and their results are hard to interpret in modern psychometric terms. Later studies reported chance-corrected coefficients in terms of the overall judgment of whether the record was normal/abnormal or normal/borderline/abnormal, these studies reported kappas of around 0.5, 0.8 and 0.5, respectively.5–7 According to Landis and Koch a kappa <0.00 means poor, 0.00–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and 0.81–1.00 almost perfect agreement.8 In other words, except for Struve et al.6 these more recent authors confirmed the moderate interrater agreement in EEG readings.

Many factors seem to influence the interrater reliability of EEG interpretations.9 From a clinical point of view, however, it is not enough to examine the reliability of EEG interpretation per se. We must seek ways to increase the interrater reliability.10 However, as far as the present authors are aware, none of the aforementioned studies demonstrate how to improve the moderate reliability observed in EEG interpretations. In fact there are only a few studies that show how to improve reliability in the clinical laboratory overall.11–13 It is generally agreed that jointly examining the sources of interrater variability and establishing a consensus guideline to address the uncertainties can enhance the reliability.

The objectives of the present study are therefore twofold: first, to investigate the interrater reliability of EEG readings among three raters (two psychiatrists and one pediatrician); and second, to test if establishing a consensus guideline for EEG interpretation increases its reliability.


The first set of EEG was the 100 consecutive recordings in the year 2001 at Tajimi General Hospital in Gifu, Japan, where there are a wide variety of referrals of both sexes, wide age ranges and diagnoses. At first three electroencephalographers interpreted the EEG independently. Two of them were psychiatrists and one was a pediatrician. All of them had >10 years experience in clinical electroencephalography. Both of the psychiatrists were trained at the same department of psychiatry, but one had subspecialty experience with dementia in electroencephalography and the other did not.

After interpretation of the first 100 EEG, we examined whether there were notable disagreements among the three electroencephalographers. If there was any substantive variation we discussed the possible source and then we next read 50 consecutive EEG recorded at Nagoya City University Hospital in Nagoya, Japan, in accordance with the guideline we established.

The EEG used in these trials were recorded by the international 10-20 system of electrode placement in usual clinical settings. All the EEG were recorded with referential recording including average potential reference, linked bipolar recording, triangulation and circumferential bipolar recording. The methods of activation such as eye opening and closure, hyperventilation, photic stimulation and sleep activation were routine in both hospitals.

The three raters were not informed of the patients’ characteristics except for age. Furthermore there was no special instruction for technicians and the recordings were performed in their natural clinical settings. The EEG were rated for the presence/absence of the following three types of abnormalities: epileptic abnormalities; paroxysmal abnormalities (including epileptic ones); and non-paroxysmal abnormalities. In addition, the global judgment of abnormal/borderline/normal was used.


The interrater reliability was expressed in terms of kappa. The kappa statistic was used to adjust the observed agreement for chance. The kappa statistic is defined as (Po–Pc)/(1ûPc), where Po is the percent observed agreement and Pc is the percent agreement expected to occur by chance alone. Kappa equals zero when observed agreement and agreement by chance do not differ, and assumes increasingly positive values, with a maximum of +1, when observed agreement exceeds chance. The level of clinical significance for kappa is based on Landis and Koch.8 We used pc-agree (MS-DOS-based software)14,15 to calculate kappas. Two paired kappas were compared using Z-values.16


The EEG recordings were obtained from subjects of all ages. On first trial the distributions of the subjects in the age ranges of 0–10, 10–20, 20–60 and 60–100 years were 19%, 17%, 41% and 21%, respectively. On second trial the distributions of the subjects aged 0–10, 10–20, 20–60 and 60–100 years were 10%, 36%, 40% and 14%, respectively.

Table 1 shows the base rates of EEG abnormalities and the agreement among the three raters on first and second trial. On first trial, kappas of global judgment and paroxysmal abnormalities were moderate, those of epileptic abnormalities were substantial, and those of non-paroxysmal abnormalities were fair.8

Table 1. Kappa among three raters (95%CI) and base rate on first and second trial
 Base ratekappa (95%CI)
First trialSecond trialFirst trialSecond trial
  • *

    All kappa of paroxysmal abnormalities, epileptic abnormalities, non-paroxysmal abnormalities and global judgment on second trial statistically significantly improved compared with those on first trial (P < 0.05)

  • The base rates of abnormal findings were defined as the ratio of the number of abnormalities noted by at least one rater to the number of total electroencephalograms (EEG). The base rates of paroxysmal abnormalities and epileptic abnormalities were medium but that of non-paroxysmal abnormalities was rather high. CI, confidence interval.

Paroxysmal abnormalities (yes or no)51%42%0.51 (0.42 to 0.60)0.80* (0.70 to 0.90)
Epileptic abnormalities (yes or no)31%32%0.69 (0.59 to 0.79)0.87* (0.78 to 0.96)
Non-paroxysmal abnormalities (yes or no)89%78%0.25 (0.17 to 0.33)0.45* (0.33 to 0.58)
Global judgment (normal, borderline, abnormal)79%70%0.42 (0.34 to 0.50)0.63* (0.51 to 0.74)

Consensus guideline

After reading the first set of 100 EEG all investigators discussed disagreements among the three raters. We found that each of us followed the pre-established interpretative guidelines which, however, were not consistent among themselves. We therefore set up a guideline for the second trial among the three raters, following one single representative international textbook in electroencepahlography.17–19

Basic rhythm

No attenuation of basic rhythm just after eye opening is not abnormal. Diffuse alpha pattern is an abnormal observation in adult EEG if its frequency is lower than 8 Hz. The point to be evaluated for basic rhythm is after 5–10 s from eye closing or just after eye closure.

Fast wave

Amplitude >30 µV is abnormal in adult EEG.

Slow wave

In adult EEG it is abnormal for slow waves smaller than 5 Hz to be significant. In child EEG, it is not abnormal for slow components to be seen when the frequency of basic rhythm is within the normal limit for age.

Sleep spindle

Spindle duration of >2 s is abnormal.


A 14 Hz and 6 Hz positive burst, 6 Hz spike-and-slow-waves and small sharp spikes are regarded as abnormal. An EEG for which it is difficult to decide whether the subject is awake with few basic rhythms or at almost stage 1 is regarded as a borderline record.

The second set of EEG was interpreted following this guideline. The kappas of global judgment and paroxysmal abnormalities were now substantial, those of epileptic abnormalities almost perfect, and those of non-paroxysmal abnormalities moderate. All these interpretations demonstrated highly significant improvements in comparison with those on first trial (Table 1).


Interrater agreement among the three raters on first trial was fair to moderate and the kappa for global judgment was 0.42 (95% confidence interval (CI): 0.34–0.50). Overall, our study can be said to have once again confirmed the moderate interrater reliability of EEG readings in clinical settings. For global judgment we chose normal, borderline, and abnormal. We considered a gray zone to be necessary in current EEG interpretations, just as is the case of many clinical judgments.20

We discussed the three raters’ first trial disagreements, and evaluated their observations. Some of the statements in our guideline represent controversial issues among neurophysiologists. However, in making a summary evaluation of an EEG as normal/borderline/abnormal, we must establish what we mean by these expressions. The situation is analogous with diagnostic criteria in psychiatry. No diagnostic criteria in Diagnostic and Statistical Manual of Mental Disorders (4th edn; DSM-IV) or International Statistical Classification of Diseases and Related Health Problems (10th revision; ICD-10) provide the ultimate truth in psychiatric nosology, but we must all know explicitly what we mean when we use a certain diagnostic label.21,22 We consider that our guideline applies only in the setting of this second trial and is inadequate for interpreting EEG findings in routine clinical settings. On the assumption that there were many observations where the investigators were in agreement, we were able to raise the investigators’ reliability by a limited guideline based on their disagreements concerning clinical EEG interpretations.

We know of no study that actually demonstrates increased reliability in clinical EEG interpretation by the same electroencephalographers. Two studies reported good reliability in EEG interpretation or seizure classification when they set down explicit definitions.23,24 Another study reported a better reliability coefficient than a former, separate study when the investigators defined recognizable ictal EEG findings.25,26 The present study shows that construction of a guideline that is aimed at disagreed observations resulted in the improvement of the reliability findings from 0.42 in the first trial to 0.63 in the second trial (P < 0.05).

After discussing and establishing the consensus guideline, observer variabilities in paroxysmal abnormalities, epileptic abnormalities, non-paroxysmal abnormalities and global judgment were markedly improved. Of note is the kappa for epileptic abnormalities in the second trial, which was almost perfect, but that for non-paroxysmal abnormalities in the second trial was moderate.8 Kappa for paroxysmal abnormalities significantly increased from 0.51 to 0.80. The kappa for paroxysmal abnormalities seemed to be influenced by that in epileptic abnormalities, which in turn increased from 0.69 on first trial to 0.87 on second trial. We included epileptic abnormalities in paroxysmal ones.18 We found that one reason for improved kappa was that the raters had recognized the presence of oversights of epileptic discharges on first trial. Houfek and Ellingson called this kind of omission a ‘mental lapse’.2 Our kappa of 0.69 for epileptic abnormalities on first trial was superior to that of 0.50 by Donselaar et al.7 We chose the dichotomous evaluation (yes or no) for epileptic abnormalities, but it has been reported that the continuous evaluation is superior to the dichotomous one.7,27

For non-paroxysmal abnormalities kappa significantly increased from 0.25 to 0.45. The reliability for non-paroxysmal abnormalities is lower than those for paroxysmal and epileptic abnormalities, which were related to each other; and the reliability for global judgment seemed to be mostly influenced by that for non-paroxysmal abnormalities. This may have been due to one rater adopting strict interpretative standards for slow waves. Establishing consensus guidelines helped reduce this variance. It seems to be difficult to simply compare our result for non-paroxysmal abnormalities with a correlate coefficient of 0.77 on theta wave in dementia patients.28 Some studies have pointed out that interpretive disagreements regarding abnormality or normality of slowing may result from confusion over whether or not the observed slowing represents transient periods of normal drowsiness or sleep activity.4,6

In conclusion, we suggest that discussing and establishing the consensus guidelines among the raters offers a feasible method to improve interrater reliability in clinical EEG interpretations. We are fully aware, however, that the guidelines we developed in the present study may be controversial in some respects because the clinical significance of some EEG findings, such as the 14 Hz and 6 Hz positive burst, is not yet established. Ultimately the development of the science of clinical neurophysiology is needed before we can develop the validated guideline for interpreting EEG. To resolve that problem we will need to assess the validity of EEG by using precise diagnosis, pathological observation or other laboratory findings as a gold standard. Until that time arises, we need to make explicit our interpretative standards.


We would like to thank SD Walter, Department of Clinical Epidemiology and Biostatistics McMaster University, Hamilton, Ontario, Canada for his kind and helpful advice.