• Open Access

Estimating chronic disease prevalence among the remote Aboriginal population of the Northern Territory using multiple data sources


Correspondence to:
Dr Yuejen Zhao, Health Gains Planning Branch, Dept of Health and Families, 4th Floor, AANT Centre, PO Box 40596, Casuarina, NT 0811. Fax: (08) 8985 8077; e-mail: yuejen.zhao@nt.gov.au


Objective: To determine the prevalence rates of hypertension, diabetes, ischaemic heart disease (IHD), renal disease and chronic obstructive pulmonary disease (COPD), and their co-occurrence among the remote Aboriginal population of the Northern Territory (NT) in 2005.

Methods: Information from a primary care chronic disease register (CDR) and hospital inpatient database were linked to a population list by using a unique patient identifier. A capture-recapture method (CRM) and multivariate log-linear models were then applied to analyse the multiple datasets to estimate the prevalence rates for the selected diseases and case ascertainment in each data source.

Results: The NT remote Aboriginal communities had considerably higher prevalence rates across all five chronic diseases than national health survey figures. At ages 50 years and over, the prevalence rates for hypertension and renal disease were above 50%, diabetes 40%, COPD 30% and IHD above 20%. In terms of data completeness, CDR and hospital sources were both relatively incomplete, generally around 20–60%. The most common co-occurrences for the five chronic diseases were between hypertension, diabetes, IHD and renal disease.

Conclusions and implications: The prevalence rates calculated using this method are comparable to estimates from rigorous small area studies, but are markedly higher than those from single clinical data sources. The results indicate that there is a considerable under-diagnosis of preventable chronic diseases in the Aboriginal communities.

For preventive and early intervention strategies for chronic disease management to be effective, it is central to have reliable information about the pattern of disease and exposures to major risk factors. This requirement is not limited to one-off or intermittent estimates, but extends to systematic monitoring and surveillance of the conditions to inform ongoing management. The implementation of systematic monitoring is particularly important among the Northern Territory (NT) Aboriginal population for whom non-communicable diseases are estimated to contribute 77% of the life expectancy gap between Aboriginal and non-Aboriginal populations.1 The importance of chronic diseases management has prompted a comprehensive NT intervention strategy, with a focus on five preventable chronic diseases (PCDs) — hypertension, diabetes, ischaemic heart disease (IHD), renal disease and chronic obstructive pulmonary disease (COPD).2 These five conditions are also significant in the broader Australian population and make up 22% of national burden of disease.3 The five conditions share several underlying characteristics. They are commonly developed at working age after latent exposure to a limited number of core risk factors including poverty, childhood malnutrition, systemic infections, tobacco smoking, alcohol abuse, poor access to fruit and vegetables, obesity and physical inactivity. The conditions are characterised as being preventable, costly to manage and rarely curable. They also have an uncertain time of onset, are of prolonged duration and are complicated by acute manifestations. In the NT, the acute manifestations of the five conditions consume 40-56% of public hospital resource4 and result in pain, disability and premature death for many Territorians. There are also substantial indirect costs through the impact on quality of social and family life, and work-related productivity.

Despite national efforts to collate existing health survey, hospital morbidity and mortality data, there remains a paucity of relevant epidemiological information on prevalence, incidence and survival rates of the chronic diseases, particularly at a regional level.5 Existing electronic, clinical datasets remain an under-utilised option, with only sporadic reports on the prevalence rates of selected preventable chronic diseases among Indigenous Australians. The existing estimates have been based on single-source data and have generally focused on diabetes,6–8 hypertension,9,10 and renal diseases.11,12 Recent reports have combined these as the metabolic syndrome related conditions.13–15 Less is known about current prevalence rates of IHD and COPD for Indigenous Australians although the mortality rates from these causes are known to be extremely high.16 This project responds to the need for improved data on chronic diseases through examining the prevalence and co-occurrence of the selected chronic diseases among the remote Aboriginal population by linking multiple large-scale data sources and using capture-recapture methods (CRM).17 The study also has wider resonance by providing an alternative approach to ongoing chronic disease surveillance advocated by the World Health Organization,18 and reiterated through the Australian Health Ministers’ Conference endorsement of a nation-wide surveillance network for chronic diseases and associated determinants.19


The CRM is devised to use the degree of overlap between linked multiple data sources for estimating the true number of patients with chronic diseases. It was applied to two clinical datasets. The first dataset was the chronic disease register (CDR) central database, which is a computer-based clinical management system for remote NT Aboriginal communities. The CDR was developed in 1991 and has been managed by district medical officers (DMOs). The data covers medical consultations for 45 remote community health centres (CHC), with 28 health centres from the Central Australian region and 17 from the Top End region of the NT. The data for this project were sourced as a snapshot in 2006, from all consultations during the five-year period between 1 January 2001 and 31 December 2005. The CHCs are the sole primary care providers in these communities.

The second dataset was drawn from the NT public hospital morbidity database for the same five-year period. To assess and adjust for source dependency, the central hospital morbidity database was split into two data sources: one for large hospitals (Royal Darwin Hospital and Alice Springs Hospital) and the other for small hospitals (Katherine Hospital, Gove District Hospital and Tennant Creek Hospital).

The CDR data and hospital morbidity data were matched by using hospital registration numbers (HRN) provided from the population list of each CHC. The HRN is a unique patient identifier shared between hospital and primary care services in the NT. The population tables contained all potential users of the health centres. Indigenous status was validated through matching HRN with total hospital morbidity data. Death records prior to 1 January 2005 were identified by cross-referencing separation modes in the datasets and eliminated for both patient and population data sources. New diagnoses of PCDs after 31 December 2005 were also separated from the patient datasets. The five PCDs are defined according to International Classification of Diseases (ICD-9-CM and ICD-10-AM) codes in hospital morbidity datasets, mapped with the health problem codes in CDR. There is widespread use of standard procedures and definitions in NT remote area primary care services, including a specific emphasis on chronic disease.20 Diagnoses of PCDs were made by DMOs or hospital medical officers and were assumed to conform with the standard NT definitions. For this study, hypertension includes a blood pressure persistently over 140/90 mmHg (130/80 in diabetes), or the prescription of antihypertensive treatment. Diabetes includes insulin and non-insulin dependent diabetes mellitus, identified by medical history, prescription of antidiabetic drugs or raised blood glucose level (random ≥11 mmol/L or fasting ≥7.7 mmol/L). IHD covers myocardial infarction and unstable angina, identified by recorded clinical diagnoses based on symptoms, positive findings in electrocardiogram and anti-angina medications. Renal diseases includes micro-albuminuria (albumin-creatinine ratio (ACR): 3.4-35 mg/mmol), macro-albuminuria (ACR: >35 mg/mmol), and reduced glomerular filtration rate (less than 100 mL/minute). COPD includes chronic bronchitis, emphysema and bronchiectasis, mainly identified by clinical diagnoses and treatment history. The remission rates have been assumed to be zero for all five PCDs. Suspected diagnoses or health problem codes were excluded to minimise false positive diagnoses. By definition, the prevalence rate in this study was derived by using all existing PCD cases at 1 January 2005 plus all new PCD cases from 1 January to 31 December 2005 inclusive, divided by the number of people in the 2005 population list.

To assess the dependency and overcome the need to assume independency between data sources, we separated the hospital data into two sections (large hospitals and small hospitals) and used log-linear models to deal with the data source dependency, as suggested by Hook and Regal.17 The multivariate log-linear models are instrumental in CRM estimation of prevalence and completeness adjusted for multi-source dependence and demographic co-variates simultaneously.21 The different levels of dependency were investigated between large, small hospitals and CDR.

The selected model is based on best goodness-of-fit and least interactive terms.

Stepwise forward-model selection was applied on the basis of likelihood ratio changes and Bayesian Information Criterion (BIC) (a goodness-of-fit measure — a small and negative value in BIC indicates a good fit). The confidence intervals were deduced from the log-linear model selected.22

The analysis followed six steps:

  • Obtain population list for each health centre with key demographics (age, sex and Indigenous status) and HRN;
  • Cross-match HRN on the population list with CDR and hospital data sources;
  • Assess independence using log-linear model;
  • Conduct three-source log-linear multivariate analysis for each condition and key demographic variables;
  • Estimate completeness of data sources, observed and expected true prevalence.
  • Use log-linear models to assess the co-occurrence of PCDs.

Stata 9.1 was used for statistical analysis. The observed prevalence represents the proportion of patients actually recorded by the data sources. Expected true prevalence includes the observed cases as well as an estimate of the unrecorded cases. The unrecorded cases are estimated statistically by using the probability of overlapping of the multiple data sources. A substantial gap between observed and expected true prevalence rates indicates a large number of unrecorded or undiagnosed cases, and an associated unmet need. Completeness refers to the percentage of patients ascertained by a particular data source. Comparisons of prevalence data are based on National Health Survey results,23 with the national estimates age adjusted to the 2005 NT Aboriginal population.

Ethics approval for this study was granted by both the Human Research Ethics Committee of NT Department of Health and Families and Menzies School of Health Research (Approval Number: 07/30) and Central Australian Human Research Ethics Committee (Approval Number: CD2007/00178).


The total population of this study was 29,687 Aboriginal residents from remote communities in the NT, with a sex ratio 100:107 (M:F). The population represented 96% of the total service population of the 46 CHCs and covered approximately 72% of the total remote NT Aboriginal population (41,408). More than one-third (35%) were under the age of 15 years, 54% were between 15 and 49 years and 11% were 50 years and over. Regionally, 54% were from the Top End (Darwin Rural, Katherine and East Arnhem health districts) and 46% from the Central Australia (Alice Springs Rural and Barkly health districts).

The results of the log-linear model for assessing interactions between the three data sources are remarkably similar across the five PCDs (Table 1). Model 1 is the base independent model and assumes large hospitals, small hospitals and CDR are totally independent. Among single two-way interactions (Models 2, 3 and 4), large hospitals and small hospitals appear to be closely related for all PCDs except COPD in which large hospitals and CDR show a closer relationship. Among the double two-way interactions (Models 5, 6 and 7), the best model is consistently model 6 showing statistically significant interactions between large and small hospitals, and between large hospitals and CDR for all PCDs with the exception of IHD. The coefficient estimates are all positive, indicating two source estimates (using CDR and combined hospital data) underestimate the true prevalence.17

Table 1.  Independence assessment for data sources using log-linear models.
     LH•SH,LH•SH,SH•CDR,Model 6 coefficient estimate (95%CI)
  1. Notes:

  2. LH=large hospitals; SH=small hospitals; CDR=chronic disease register; comma (,)=independency; dot(•)=dependency. CI=confidence interval; BIC=Bayesian Information Criterion; d.f.=degree of freedom; (a) p<0.01, (b) p>0.05. IHD=Ischemic heart disease; COPD=chronic obstructive pulmonary disease.

Renal Disease

Table 2 presents observed and estimated true prevalence rates for the five PCDs and completeness by data sources using the best model (Model 6). The results are adjusted for sex, age group, and region simultaneously using multiple log-linear models. There is a clear female propensity in the prevalence rates of the index conditions except for IHD. The prevalence rates for all conditions increased progressively with age. Prevalence rates in NT remote Aboriginal population for people aged 50 years and over were above 50% for hypertension and renal disease, around 40% for diabetes, 30% for COPD and over 20% for IHD. The gap between the observed and estimated true prevalence was greatest for renal disease followed by COPD, and was larger for older age groups. Aboriginal people from the Central Australian remote regions were more likely to have renal disease, diabetes and hypertension than those from the Top End, and the reverse was true for COPD. The estimated true prevalence of IHD is similar in the two regions. For the whole study population there were substantially higher prevalence rates across all five PCDs than the Australian population estimates from the National Health Survey (Figure 1).23

Table 2.  Estimated prevalence rates of preventable chronic diseases and completeness of data sources, remote Aboriginal population of the Northern Territory, 2005.
   ObservedEstimatedPotentialCompleteness (%)
ConditionDemographicsPrevalence (%)True prevalence (%) (95% CI)Gap (%)CDRHospital
  1. Notes:

  2. CDR=chronic disease register; CI=confidence interval; COPD=chronic obstructive pulmonary disease.

 SexMale8.811.9 (11.1-12.8)
  Female10.514.1 (13.2-15.2)3.641.845.1
 Age<150.50.8 (0.6-0.9)0.32.610.0
  15-499.412.6 (11.8-13.5)3.234.538.4
  ≥5040.654.5 (51.2-58.7)14.048.553.1
 RegionCentral11.515.6 (14.6-16.8)
  Top End8.110.8 (10.1-11.7)2.838.643.8
 SexMale7.59.1 (8.7-9.6)1.645.556.5
  Female11.514.0 (13.4-14.7)2.457.357.3
 Age<150.20.4 (0.3-0.4)0.145.527.8
  15-4911.013.2 (12.6-13.9)2.350.152.2
  ≥5032.839.7 (38.0-41.8)6.956.765.6
 RegionCentral13.416.2 (15.5-17.1)2.854.058.1
  Top End6.37.7 (7.3-8.1)1.350.355.0
Ischaemic heart disease
 SexMale3.65.3 (4.7-6.0)1.730.456.6
  Female3.44.9 (4.4-5.6)1.628.954.9
 Age<150.10.3 (0.2-0.4)0.214.312.5
  15-493.14.5 (4.0-5.0)1.424.750.0
  ≥5016.023.3 (21.1-26.4)7.434.461.9
 RegionCentral3.34.9 (4.4-5.6)1.628.553.1
  Top End3.65.3 (4.7-6.0)1.730.657.9
Renal disease
 SexMale9.317.0 (15.1-19.3)7.626.524.3
  Female12.923.1 (20.6-26.4)10.328.122.3
 Age<154.18.6 (7.4-10.0)4.48.936.7
  15-4911.921.1 (18.8-23.9)9.125.418.7
  ≥5029.852.6 (47.0-59.8)22.846.728.7
 RegionCentral13.725.2 (22.3-28.8)11.527.922.2
  Top End9.015.8 (14.1-18.0)6.826.824.3
 SexMale3.46.0 (5.2-7.0)2.623.841.9
  Female4.37.7 (6.7-8.9)3.316.540.0
 Age<151.42.7 (2.3-3.2)1.316.825.4
  15-492.64.5 (4.0-5.3)1.912.429.6
  ≥5018.231.6 (27.8-36.6)13.324.851.7
 RegionCentral3.05.4 (4.7-6.3)2.518.234.4
  Top End4.78.1 (7.1-9.4)3.420.244.8
Figure 1.

Observed prevalence rate of preventable chronic diseases in remote Aboriginal population by NT Regions, 2005, compared with Australia, 2004/5.

In terms of data ascertainment, CDR and hospital sources were relatively incomplete, both between 20% and 60% for most conditions. Diabetes had the highest completeness of case ascertainment, followed by hypertension. Renal diseases had the lowest completeness, with only a quarter of the cases being captured by either of the data sources. Hospital sources were more complete for IHD, COPD and hypertension than CDR, whereas CDR was more complete for renal disease than the hospital sources. Completeness varied between regions. Central Australia had more complete case ascertainment for hypertension and diabetes, but less complete for IHD and COPD compared with the Top End (Table 2). Overall, the completeness was better for older age groups.

In terms of co-occurrence, 40% of PCD patients had at least two conditions before age 50. This percentage increased to more than 60% after age 50. About 30% of PCD patients had at least three conditions after age 50 (Figure 2). The more common interactions were between hypertension, diabetes, IHD and renal disease.

Figure 2.

Numbers of preventable chronic diseases by age group, remote Aboriginal population of the Northern Territory, 2005.


The results from this study confirm that the prevalence estimates for diabetes, hypertension and renal disease from previous intensive community surveys, for example Hoy et al.,24 apply more generally across remote Aboriginal communities of the NT. The study estimates are markedly higher than those from single-source surveys for hypertension,9,10 diabetes,7,8 and renal disease.11–13,15 Importantly, the study provides prevalence data for IHD and COPD, as well as highlighting regional variation of the prevalence of the five conditions across the NT. Co-morbidities and complications of PCDs were common and strongly associated with increasing age.13,25 From a technical perspective the study quantifies the incompleteness of case identification in the major clinical datasets for the NT. The incompleteness of identification highlights significant health care service gaps in diagnosis and treatment of PCDs in NT remote Aboriginal communities.

For this article, national prevalence estimates are presented for broad comparison only. The national estimates are based on ABS health surveys, which rely on self-assessment of largely urban residents.23 There were also differences in the case definitions between this study and ABS health surveys; for example, renal disease in the National Health Survey included all genito-urinary diseases other than urine incontinence and diseases of female pelvic organs and genital tract. Prevalence estimates for hypertension and diabetes from this study were marginally higher than those reported from the National Aboriginal and Torres Strait Islander Health Survey (NATSIHS),26 although again, there are differences in case definition. The more substantial differences in reporting category and case definition hindered direct comparison with NATSIHS results for IHD, renal disease and COPD.

Application of CRM in epidemiology dates back to the 1970s.27 The method has been widely used to estimate the prevalent and incident cases, including unreported and undetected cases, from multiple incomplete sources for both infectious and non-infectious diseases.28,29 Another application of CRM is ascertaining completeness for data coverage or undercount assessment.30 CRM is generally believed to be more accurate and cost-effective than conventional methods of single-sourced estimation and is regarded as the gold standard for prevalence determination.31–33

The application of classic (two-source) CRM is reliant on three assumptions. First, the population under study must be closed. Second, the probability of case ascertainment must be equal for all cases within a single source; and finally, the data sources need to be collected independently. In this study, the first assumption appears to be satisfied, because the study population is defined by a fixed list of clients identified by a unique client number. The strategy to handle the second assumption was to stratify the population by sex and age, because case ascertainment is largely influenced by these demographic characteristics. The third assumption was almost certainly violated, because both CDR and hospital morbidity datasets were managed by the public sector. Appropriately, there would have been information exchange between the two clinical data systems. If two sources are not independent, the two-sourced CRM can either underestimate or overestimate the true prevalence. The underestimation is caused by positive dependence of data sources, such as referrals from CDR to hospitals. The overestimation is caused by negative dependence, for example, CDR registration (and prompt treatment) precluded hospitalisations. The nature of this dependency cannot be determined from two data sources, and so we separated the hospital data into two sections (large hospital and small hospital) to create a total of three data sources and then used log-linear models to adjust for data source dependency.17,21 As indicated in Table 1, there were positive dependences between CDR and hospitals. The magnitude of the underestimation by two-source independent model was about 10% for diabetes, largely driven by referral and treatment patterns.

This study used a primary care CDR and hospital morbidity data, both collected from routine health care practices. As far as we are aware this is the first application of CRM to multiple routine data collections to estimate and monitor chronic disease prevalence in Australia. Using CRM has both strengths and problems. It can increase accuracy of prevalence estimation without additional case-finding costs, and provides an alternative to an extensive and expensive biometric population survey. CRM quantifies the extent of under-diagnosis/treatment and the associated gap of unmet health care need. CRM also enables us to ascertain the completeness of a reporting data source, which assists data quality improvement. However, the use of information within both the CDR and hospital data relies on the accuracy and completeness of underlying recording in a clinical rather than research database. These are data sources that cannot be controlled with the rigour of a well-organised survey. First, measurement errors may exist between different diagnosing doctors and different clinical equipments. Second, the results were reliant on routine clinical opportunistic testing rather than comprehensive and specifically designed survey screening, leading to potential underestimation of the true prevalence rates. There may also be a discrepancy between a ‘known diagnosis’ and accurate recording in the CDR. Third, different demographic groups may have different participation rates, which could impact on the prevalence estimation. The strategies to deal with the last two types of bias were to check the medical history over a long period (five years), and apply multivariate log-linear modelling to stratify and analyse data dependency and case under-ascertainment by demographic and geographic covariates. There were still limitations in using HRN as a record linkage. Many people had more than one HRN, and sometime it was recorded wrongly in the medical charts and information systems. The same caveat also applied to Medicare number, which could be useful for developing a national chronic disease surveillance system. Multiple HRNs of a single client had been consolidated before commencement of the project. The quality of HRNs was validated and maintained by the acute care information team.

The completeness of primary care and hospital data were similar in terms of PCD case ascertainment. Shifting from the population-based CRM to an area-based analysis would be helpful for completeness of assessment and data quality improvement for a defined service population. An area-based data source will also accommodate the substantial mobility of the Aboriginal population in the NT. The availability of area-based data will be greatly enhanced by current expansion and linkage of electronic clinical data systems such as the Primary Care Information System.34 A future chronic disease surveillance system may reasonably have three important features: the first is comprehensive coverage of the health of the service population, the second is prompt reporting and the third feature is that it be readily accessible to inform health policy and service planning.


The authors gratefully acknowledge the many district medical officers for their long-term commitment in the development and use of the chronic disease register. We also extend our thanks to Julia Seemann and Julie Robinson for their assistance.