Nonalcoholic fatty liver disease (NAFLD) is a complex metabolic liver disease of increasing clinical importance that encompasses simple steatosis, nonalcoholic steatohepatitis (NASH), and cirrhosis.1, 2 The diagnosis of NAFLD is clinicopathological. Although the clinical component of the diagnosis is largely one of exclusion of significant alcohol ingestion and other liver diseases, the histopathological examination is the key part of the evaluation of suspected NAFLD.3 The absence of reliable surrogate markers (either serum, radiologic, or genetic) of NAFLD presence or severity emphasizes the pivotal requirement that the liver biopsy accurately represents true disease extent. Typically, histopathology confirms the suspected diagnosis and permits the assessment of the grade and stage of disease, providing prognostic information that is otherwise unavailable. The central role of histology in assessing the response to therapeutic intervention is further emphasized as potential therapeutic agents emerge for the treatment of NASH.4–7 One large clinical trial8 showed apparent improvement rates for inflammation and fibrosis of 23% and 24%, respectively, in the placebo arm, adding further uncertainty to the true natural history of the condition and the extent of sampling variability.9 The histological findings of NAFLD are assumed to be homogenously distributed throughout the liver and therefore representative of overall hepatic involvement. However, few published data exist to support these presumptions.10, 11 Recent publication of the NAFLD Activity Score (NAS) also provides us the first opportunity to use this simple, validated, semiquantitative, discriminative scoring system for the assessment of a range of histological features of NAFLD.12 Therefore, the aim of this study was to determine the correlation of detailed histological findings of paired contemporaneous right and left lobe liver biopsies in the diagnosis of NAFLD using commonly employed diagnostic grading/staging systems and also the NAS in persons with a high probability of having NAFLD—morbidly obese patients undergoing bariatric surgery. We hypothesized that the composite histological data from two biopsy cores would provide a more accurate representation of the true histopathology of the liver than a single core. Consequently, by comparing a single-sided biopsy with the composite findings of two biopsies, we aimed to determine whether the data provided by a second biopsy would impact the diagnostic yield.
In the absence of surrogate markers, the evaluation of suspected nonalcoholic fatty liver disease (NAFLD) is highly dependent on histological examination. The extent of sampling variability affecting the reliability of a single liver biopsy in patients with suspected NAFLD is poorly characterized. This prospective study aimed to correlate precise histological findings in paired biopsies—right and left lobe—in the diagnosis of NAFLD in morbidly obese subjects undergoing bariatric surgery employing both Brunt and Matteoni classifications and the NAFLD Activity Score (NAS). We also aimed to determine whether the composite histopathological findings of the two biopsies would improve diagnostic accuracy. Consecutive subjects had an intraoperative biopsy from both right and left lobes, evaluated and scored in a blinded manner. Intraobserver agreement was also assessed. Kappa coefficients of agreement were calculated. Forty-one subjects had acceptable biopsies. Agreement for steatosis was excellent and moderate for fibrosis. Concordance was only fair for most features of necroinflammation. Intraobserver agreement was only moderate for lobular inflammation. Excellent agreement was seen for the diagnosis of NASH using Brunt criteria and good agreement when using Matteoni and NAS scoring systems. Composite biopsy data particularly improved identification of hepatocyte ballooning. The diagnostic accuracy also improved substantially when composite features were compared with single-sided biopsy features, especially for the Matteoni and NAS scoring systems. In conclusion, significant sampling variability occurs in NAFLD, particularly for features of necroinflammation. This should be factored into the design of clinical trials and studies of the natural history of the disease. (HEPATOLOGY 2006;44:874–880.)
Patients and Methods
Consecutive, morbidly obese subjects without known liver disease undergoing elective bariatric surgery for the management of morbid obesity at a single center were enrolled from July 2002 until December 2004. Subjects were excluded if they reported a history of excessive alcohol use (greater than 20 g/d male, or 10 g/d female). All subjects had standard testing to exclude chronic hepatitis C and B and iron overload. Subjects with histopathological findings suggestive of another liver disease also were excluded. Standard clinical, anthropometric, and biochemical measurements were obtained. The study was approved by the institutional Committee on Human Research, and all subjects provided informed consent. All persons underwent a Roux-en-Y gastric bypass, performed laparoscopically in 90% of cases. Intraoperatively, each subject had a percutaneous 16 gauge Tru-cut (Cardinal Health, McGaw Park, IL) biopsy performed of both the right and left lobes of the liver under direct laparoscopic guidance but without targeting of a specific area in each lobe. Each biopsy was individually coded and processed separately for histopathological interpretation in a standard manner, stained with both hematoxylin-eosin and Masson trichrome. Biopsies were interpreted by a single experienced hepatic histopathologist (L.D.F.), blinded both to patient identity and site of biopsy, with each biopsy of a pair read randomly and at separate times. A minimum of six portal tracts per biopsy specimen was deemed adequate for study analysis.13
The study was performed prospectively. A detailed descriptive scoring system for NAFLD was employed using precisely the recently published histological scoring system for NAFLD of Kleiner et al.12 Briefly, steatosis was graded 0 to 3 as follows: 0 representing <5%, 1 representing 5% to 33%, 2 representing >33%-66%, and grade 3 representing >66%. The predominant zonal distribution pattern of fat was assessed as zone 3, zone 1, azonal, or panacinar. Lobular inflammation was graded 0 to 3, with 1 representing less than two inflammatory foci per high-power field; 2, two to four per high-power field; and 3, more than four per high-power field. Liver cell injury as reflected by ballooned hepatocytes was determined as absent (0), few (1), or many/prominent (2). Fibrosis was staged as none (0); mild, zone 3 perisinusoidal (1a) requiring trichrome staining; moderate, zone 3, perisinusoidal (1b) evident on hematoxylin-eosin staining; portal or periportal only (1c); perisinusoidal and portal/periportal (2); bridging fibrosis (3); and cirrhosis (4). Other more uncommon histological features considered less important in the characterization of NAFLD were noted and scored precisely as described by Kleiner et al.12
Three different rigorous grading systems were employed to evaluate the extent of sampling variability in both the diagnosis of NASH and the determination of disease activity. In the first, the diagnosis of NASH was made according to Brunt et al.,14 with modifications, defined a priori: NASH was defined based on the presence of at least grade 1 steatosis with hepatocyte ballooning or at least grade 1 steatosis with perisinusoidal fibrosis (but not isolated periportal fibrosis) with lobular inflammation. Also, simple steatosis was defined as grade 1 or more fat but without the above defining features whereas subjects without pathological steatosis were classified as normal. In the second, biopsies were categorized using the classification of Matteoni et al.15 as types 1 to 4 (with the addition of type 0 reflecting the absence of pathological steatosis). This categorization encompasses the spectrum of NAFLD and also appears to have prognostic significance.14 For the third, the NAS was calculated, the unweighted sum of the scores for steatosis (0-3), lobular inflammation (0-3), and ballooning (0-2) ranging from 0 to 8. This scale was used to score disease activity with an NAS ≥ 5 correlating with a diagnosis of “definite NASH,” an NAS of ≤ 2 correlating with a diagnosis of “not NASH,” and “indeterminate NASH” defined by an NAS of 3 to 4.12 To determine the extent of intraobserver agreement as a contributor to and potential confounder of sampling variability, 44 biopsies were randomly selected for a blinded second interpretation several months later, and these were then compared with the original interpretations. Finally, the composite biopsy histological features were defined as the highest severity of each individual feature gleaned from interpretation of both right and left lobe biopsies.
Percentage of concordance was calculated for observations made for histological interpretation of the right and left lobes. The agreement between observations (right vs. left lobes) was assessed using the kappa (κ) coefficient that assesses how much better agreement is than would occur by chance alone, where κ = 1 indicates when agreement is perfect; κ > 0.80, excellent; κ > 0.60, good; κ > 0.40, moderate; κ > 0.20, fair; and κ > 0 indicates poor agreement.16 With ordered categorical data, weighted kappa coefficients were calculated: otherwise simple kappa values were determined. For intraobserver agreement, kappa coefficients were similarly calculated, comparing the original interpretation with the repeat interpretation. Data from single-lobe biopsies, either left or right, were compared with the composite findings of the two biopsies: the type, number, and percent difference of discordant observations were determined.
There were no complications associated with biopsy procurement. Sixty subjects had paired biopsies submitted: two biopsy samples from different pairs were deemed unsatisfactory for interpretation by the pathologist. Of the remaining 58 pairs, 17 pairs were excluded because one or both did not contain a minimum of six portal tracts, leaving 41 biopsy pairs that served as the basis for the study. The mean length of the biopsies was 14.0 mm (range, 9-24 mm), containing a mean of nine portal tracts (range, 6-17), and findings did not differ significantly between right- and left-sided samples. Relevant clinical and demographic features of the 41 subjects are presented in Table 1. Sixty-eight percent of subjects were either diabetic, hyperlipidemic, or hypertensive in addition to being morbidly obese. Two subjects had an elevated aspartate aminotransferase or alanine aminotransferase using our institutional laboratory criteria: this increased to 26 subjects if the more recently proposed upper limits of alanine aminotransferase (male ≤ 30 U/L; female ≤ 19 U/L) were applied.17 Table 2 summarizes the correlation of histopathological features of right and left lobe liver biopsies.
|Subjects||n = 41|
|Sex: Female||32 (78%)|
|Age: years, median (range)||40 (22-65)|
|Body mass index: kg/m2 median (range)||50 (34.5 – 69.8)|
|Diabetes mellitus||10/41 (24.4%)|
|AST, median (normal range 16-41 U/L)||22*|
|ALT, median (normal range 11-59 U/L)||24*|
|Kappa Coefficient (95% CI)|
|Right vs. Left Lobe||Intraobserver|
|Grade||0.88 (0.80-0.96||0.98 (0.95-1.0)|
|Lobular inflammation||0.32 (0.04-0.60)||0.58 (0.31-0.84)|
|Hepatocyte ballooning||0.20 (−0.12-0.53)||0.64 (0.37-0.92)*|
|Portal inflammation||0.19 (−0.15-0.53)*|
|Pigmented macrophages||0.48 (−0.12-1.0)*|
|Glycogenated nuclei||0.81 (0.62-0.99)*|
|Absolute NAS score:||0.66 (0.57-0.75)|
|Fibrosis:||0.53 (0.34-0.72)||0.68 (0.51-0.86)|
|Normal/steatosis/NASH||0.89 (0.79-0.99)||0.90 (0.81-0.99)|
|NASH vs. non-NASH||0.82 (0.62-1.0)*|
|Type 0/1/2/3/4||0.71 (0.55-0.87)||0.85 (0.74-0.96)|
|Types 0,1,2 vs. types 3,4||0.32 (−0.10-0.74)*|
|Disease activity by NAS:|
|2, 3-4, or ≥5||0.69 (0.55-0.83)||0.86 (0.74-0.97)|
|0-4 vs. ≥5||0.09 (−0.27-0.45)*|
Forty-seven of 82 biopsies had evidence of pathological steatosis. Thirty-four biopsy pairs (82.9%) were in complete agreement regarding the grade of steatosis (0-3) that included 17 pairs showing identical grades of pathological steatosis (1-3). Seven biopsy pairs were nonconcordant, with six of those seven pairs showing steatosis differing by no more than one grade and with only one biopsy pair interpretation in disagreement as to the presence (grade 1) or absence of steatosis. The weighted kappa co-efficient of agreement for grade of steatosis, comparing right and left lobe biopsies, was excellent at 0.88. A very high degree of concordance also was seen in the zonal distribution pattern of fat (zone 3, zone 1, azonal, or panacinar), with 38 of 41 biopsy pairs (91.4%) assessed identically (weighted kappa value of 0.86).
Ballooning degeneration of hepatocytes was evident in 12 of 82 biopsies. The extent of ballooning when present in both biopsies was identical in one biopsy pair (2.4%), and one further pair (2.4%) differed in the severity of ballooning (grade 1 vs. 2). Eight biopsy pairs (19.5%) were not in agreement as to the absence or presence (grade 1) of ballooned hepatocytes. Thirty-one biopsy pairs (75.6%) had no evidence of ballooned hepatocytes. Overall, the weighted kappa value for hepatocyte ballooning was 0.20.
Lobular inflammation was apparent in 68 of 82 biopsies. Twenty-two biopsy pairs had identical grades of inflammation, whereas four were in agreement as to the complete absence of inflammation. Of the remainder, when inflammation was identified in both biopsies, seven pairs differed by one grade only, and only two differed by more than one grade. Six pairs were discordant as to the presence (grade 1) or absence of lobular inflammation. The calculated weighted kappa value for lobular inflammation was 0.32.
Glycogenated nuclei were present in 42 biopsies, with 90.2% of pairs in agreement. Micro-granulomata were noted infrequently, and other minor aspects of necroinflammation were rare, with pigmented macrophages identified in two specimens, lipogranulomata noted in one biopsy, and acidophil bodies noted in one specimen. Mallory bodies and megamitochondria were not seen in any biopsies. Portal inflammation was present in only 17 biopsies: three pairs were concordant and 11 in disagreement (κ = 0.19).
Fibrosis was evident in 35 of 82 biopsies, with 33 of these biopsies showing stage 2 or less fibrosis. Where fibrosis was present in both biopsies of a pair, 8 pairs were in agreement as to the precise stage; when absent, consensus was seen between 19 pairs. Nine biopsy pairs were discordant in the absence or presence (stage 1a-1c) of fibrosis. Five pairs were in disagreement as to the extent of fibrosis when it was evident in both biopsies: two differed by more than one stage, and three pairs differed by one stage only. For fibrosis, the weighted kappa value was 0.53.
Diagnostic Category and Disease Activity.
The NAFLD disease category was compared between paired right and left lobe biopsies by diagnosis using modified Brunt criteria and the Matteoni classification and by disease activity using the NAS grading system. Employing the Brunt criteria, 37 of 41 of the diagnoses (90.2%) were concordant, including 10 of NASH, 10 of steatosis, and 17 of normal. Four pairs were not in agreement: three showing NASH versus steatosis and one showing steatosis versus normal: the weighted kappa score was 0.89. Using the Matteoni classification of NAFLD, 32 of 41 of the diagnoses were in agreement (type 0, 16; type 1, 1; type 2, 13; type 4, 2), with a weighted kappa of 0.71. Finally, by using the NAS, when the absolute NAS (0-8) was compared, 19 of 41 paired observations (46.3%) were in agreement with a weighted kappa value of 0.66. Scoring disease activity within pairs by the NAS category, either definite NASH, indeterminate NASH, or absence of NASH, 32 of 41 paired observations (78%) were in agreement: 1 for definite, 8 for indeterminate and 23 for absent NASH. One pair was not in agreement as to a diagnosis of absent versus indeterminate, and eight pairs differed in a diagnosis of indeterminate versus definite NASH. Overall, the weighted kappa statistic for categorization of disease activity using the NAS measured 0.69. When the overall assessment of biopsy pairs was dichotomized using Brunt criteria into either NASH or non-NASH (steatosis or normal), 38 pairs (92.7%) were in agreement with a kappa value of 0.82. Similarly, dichotomization by types of NAFLD (types 0, 1, 2 vs. 3, 4) and NAS (a score of 0-4 vs. ≥5) yielded kappa values 0.32 and 0.09, respectively.
Single-Biopsy Versus Composite Biopsy Features.
We compared the histological findings of a single-sided biopsy, either right or left with the composite histopathological findings of the right and left lobe biopsies. Table 3 presents the prevalence of histological features in the right, left, and composite biopsies. It also details the precise type, number, and total percentage of discordant observations obtained by comparing the composite features with those determined on a single-sided biopsy, either right or left. Important nuances emerge from the analysis of the comparison of the findings from the right lobe biopsy (typically the site of most percutaneous biopsies) compared with the composite findings of the two biopsies. For steatosis, using the composite biopsy findings, only one additional observation change is noted, and this altered the interpretation from absent steatosis to pathological steatosis. More importantly, however, in the case of hepatocyte ballooning, the composite interpretation yielded a further five instances in which the ballooning grade changed from 0 to 1 compared with the findings in an isolated right lobe biopsy. Though this represented only a 12.2% change in total number of observations for that feature, the number of subjects with ballooning doubled from 5 to 10—indicating that ballooning would not have been detected in 50% of cases if the right-sided biopsy were solely relied on. The grading of lobular inflammation changed most, with an increase in 9 of 41 instances (22%) when composite features were compared with the right-sided biopsy (though mostly accounted for by an increase from grade 1 to 2). In a similar comparison, three of five of the altered observations related to fibrosis involved a change in stage assignment from stage zero, changing the fibrosis stage by 12.2%. Diagnosis or disease activity categorization also changed substantially (Table 3). Comparing the composite biopsy versus right lobe features resulted in an increase in the number of subjects diagnosed with NASH by modified Brunt criteria from 11 to 13 (18.2%). Dichotomizing the NAFLD types and NAS as before (NASH vs. non-NASH) yielded an increase in the number of subjects with findings consistent a diagnosis of NASH by 60% and 80%, respectively. These figures translate to 15%, 38%, and 44% of diagnoses or disease activity classifications (for the Brunt, Matteoni, and NAS scoring systems, respectively), consistent with NASH having been missed by relying solely on a single right-sided biopsy: the corresponding figures for the left lobe biopsy were very similar at 8%, 38%, and 44%, respectively.
|Prevalence of Histological Features in Right, Left, and Composite Biopsies||Type, Number (and % of Total) of Discordant Observations|
|R||L||R & L||R vs. R & L||L vs. R & L|
|0||18||17||17||0 → 1:1|
|2||3||9||3||2 → 3:6|
|Hepatocyte ballooning grade:|
|0||36||34||31||0 → 1:5||0 → 1:3|
|1||4||7||9||1 → 2:1|
|Lobular inflammation grade:|
|0||7||7||4||0 → 1:3||0 → 1:3|
|1||30||27||27||1 → 2:5||1 → 2:2|
|2||2||5||7||1 → 3:1||1 → 3:1|
|0||22||25||19||0 → 1b:1||0 → 1a:2|
|1a||4||2||3||0 → 1c:2||0 → 1c:4|
|1b||2||4||3||1a → 2:1||1a → 2:1|
|1c||6||3||7||1b → 3:1|
|2||5||7||7||1c → 2:1||2 → 4:1|
|Normal (0)||18||17||17||0 → 1:1|
|Steatosis (1)||12||12||11||1 → 2:2||1 → 2:1|
|1||2||1||1||0 → 2:2|
|2||16||19||16||1 → 2:1|
|3||0||0||0||2 → 4:3||2 → 4:3|
|NAS ≤ 2 (0)||23||24||23||0 → 1:1|
|NAS = 3-4 (1)||13||12||9||1 → 2:4||1 → 2:4|
|NAS ≥ 5 (2)||5||5||9||(9.8%)||(12.2%)|
The analysis of intraobserver agreement (Table 2) shows that correlation of the major histopathological features is good to excellent for most features. However, agreement on lobular inflammation was only moderate, with 35 of 44 pairs (79.5%) interpreted identically: the weighted kappa value measured 0.58. Intraobserver error for the diagnosis of NASH using either of the two diagnostic classifications or the activity score was excellent, with kappa values ≥0.85.
Unlike most other common hepatological conditions, NAFLD lacks widely accepted surrogate markers of either disease presence or severity; consequently, histological evaluation assumes a unique, elevated role, central to both the diagnosis and management of NAFLD.18 Although some sampling variability is inevitable in a diffusely distributed disease, the extent and acceptable degree of variability is unclear.
Our study shows that the agreement for grade of hepatic steatosis was excellent, with a kappa value of 0.88. Agreement for most of the varied components of necroinflammation (with the exception of glycogenated nuclei) was only fair, especially notable for the variables often considered most important—lobular inflammation (κ = 0.32) and hepatocyte ballooning (κ = 0.20). For lobular inflammation, only 63.4% of paired observations were in agreement, and though agreement was better for hepatocyte ballooning (78%), the low trait prevalence of the latter feature likely contributed to a similarly low kappa coefficient.19 Intraobserver variation (rather than sampling error) appears not to contribute significantly to the lack of agreement for ballooning. In contrast, intraobserver agreement for lobular inflammation was moderate (κ = 0.58), echoing similar rates of 0.37 to 0.60 in other series.10, 12 Interobserver variation has also been shown to be substantial for lobular inflammation in other series, with kappa values ranging from 0.33 to 0.45.12, 20 Overall, the extent of variability in these two key features (ballooning and lobular inflammation) is troubling. They play a central role in the diagnosis, determination of disease activity (particularly in the NAS system), and, for ballooning, predicting the natural history of NASH.15 Whereas agreement in the assessment of fibrosis was moderate (κ = 0.53), this modest value could not be significantly attributed to intraobserver variation (κ = 0.68). In spite of variability in the interpretation of necroinflammation, when the diagnosis of NASH was compared, agreement was excellent using the Brunt criteria (κ = 0.86) and good for Matteoni type (κ = 0.71). These exceeded agreement for either the absolute NAS score (κ = 0.66) or NAS category (κ = 0.69), both of which are heavily weighted by the scores for lobular inflammation and ballooning. When the overall assessment of biopsy pairs was dichotomized into either NASH or non-NASH using different grading systems, the Brunt criteria performed excellently.
When the findings of either an isolated right lobe or left lobe biopsy were compared with the composite features of both, those features that changed most were lobular inflammation and fibrosis (Table 3). However, when composite findings are compared with those from a single biopsy, substantial improvement occurs in more accurately reflecting the true liver histopathology, especially for lobular inflammation and ballooning that in turn resulted in increased accuracy in both the diagnosis of NASH and assessment of disease activity. For example, when determining NAS category, comparing the composite features with the findings of isolated right- or left-sided biopsies, the number of subjects with definite NASH increased by 44% for each, though when employing the Brunt criteria, this was more acceptable at a modest 15% and 8%, respectively.
This study has notable strengths, involving a large number of subjects evaluated prospectively with precisely detailed histological findings. The high prevalence of NASH disease associations, biochemical abnormalities, and significant histopathological findings of the group expand the relevance of these findings to patients being evaluated for suspected NASH. The use of several scoring systems broadens the interpretability and applicability of the data and should encourage further consideration of the merits of each, particularly in clinical trials design. It should encourage a broader discussion as to whether a single biopsy should be considered adequate to assess NAFLD, especially when the biopsy sample size is limited.
Few data exist quantifying the extent of sampling variability in NAFLD.10, 11, 20 Ratziu et al.10 compared findings in 51 subjects, each of whom had two right lobe biopsies, though the grading and staging systems employed limit interpretation and direct comparison. Although good agreement was also noted for steatosis, agreement for ballooning was moderate (κ = 0.45) but poor for lobular inflammation (κ = 0.13). Contrasting our study findings with that of sampling variation in chronic hepatitis C is instructive. Though methodologies limit direct comparisons, an analysis of paired lobar biopsies in subjects with chronic hepatitis C showed kappa values for grading and staging of 0.64 and 0.57, respectively.21
Our study is arguably limited by several aspects of study design. The study population was morbidly obese and therefore may not be representative of all NAFLD patients. The low prevalence of certain pathological features such as ballooning contributed to their low coefficients of agreement; however, rather than selecting a biased cohort of individuals with a diagnosis of NASH, one may argue that in fact this study more accurately reflects clinical practice when these findings indeed can be infrequent. By comparing composite biopsy features versus single biopsy features, we have assumed that this disorder is distributed homogenously and that the variation identified reflects variability in sampling rather than in disease. This quandary could only be truly addressed by numerous geographically distributed biopsies, an impractical undertaking. The liver biopsies included in the analysis were deemed adequate for interpretation based on the number of portal tracts included with an average biopsy size that closely approximated the usually accepted length.7, 10, 13 In persons with suspected NAFLD, procurement of a long biopsy sample in an often obese population can be a practical challenge.
In summary, through a careful comparison of paired lobar biopsies in subjects at high risk of NAFLD, we have shown that agreement for steatosis is excellent, moderate for fibrosis and only fair for most components of necroinflammation. Agreement between biopsies of a pair on diagnosis by Brunt criteria was excellent and good using either the Matteoni classification or NAS category. The addition of findings from a second biopsy substantially improved the diagnostic yield both for specific histological features and when used to generate a diagnosis or disease activity category, particularly when employing the Matteoni and NAS classifications, though it added little when using the Brunt criteria. For a liver disease so uniquely dependent on histology in its clinical assessment, the substantial degree of sampling variation revealed in this study should be factored into the design of clinical trials and studies of NAFLD natural history.
The authors acknowledge the statistical assistance of Peter Bacchetti, Ph.D.