• Open Access

A feasible method for linkage studies avoiding clerical review: linkage of the national HIV/AIDS surveillance databases with the National Death Index in Australia


Correspondence to:
Ms Fatemeh Nakhaee, National Centre in HIV Epidemiology and Clinical Research, level 2, 376 Victorial Street, Darlinghurst, New South Wales 2010. Fax: (02) 9385 0920; e-mail: Fnakhaee@nchecr.unsw.edu.au


Objective: To assess the sensitivity and specificity of linkage of HIV/AIDS diagnoses in Australia to the National Death Index (NDI).

Methods: An aggregated file containing 19,772 matched HIV/AIDS diagnoses reported to the national HIV/AIDS databases from 1980 to 30 June 2004 was linked to the NDI using probabilistic linkage methods based on the namecode, date of birth, and sex as identifiers. Based on the 6,900 HIV/AIDS known deaths reported by 1 January 2003 and 1,455 known non-deaths with an active follow-up beyond 1 January 2003, the different combinations of weights assigned to matched pairs were examined to obtain maximum sensitivity and specificity.

Results: The trade-off between sensitivity and specificity was used to obtain an optimal linkage. The optimal linkage was found to link 5,658 of the 6,900 HIV/AIDS known deaths (a sensitivity of 82%), and 116 false positives of the 1,455 known not deaths (specificity of 92%). Causes of deaths were recorded for 86.5% of deaths that were linked to the NDI.

Conclusions: This is a feasible method for conducting linkage studies if both the identifying deaths and non-deaths are available. The relatively poor sensitivity could be due to limited identifiers available for linkage on the HIV/AIDS databases.

Health authorities need to have complete ascertainment of death reports to estimate the number of people living with HIV/AIDS and the survival rate to enable appropriate evaluation and planning of health and treatment programs.

In Australia, HIV and AIDS are monitored by a well-established national surveillance system consisting of the National HIV Database (NHD) and the National AIDS Registry (NAR). Mortality following HIV or AIDS is not captured completely in Australia through present surveillance systems. Deaths following HIV infection but prior to AIDS may have been collected by some State/Territory health departments, but they have been reported to the NHD only since January 2000. Deaths following AIDS have been reported by physicians to State/Territory health departments, and on to the NAR, since 1982, but it is thought that there may be substantial loss to follow-up rates during the first decade of the AIDS epidemic, with consequent under-reporting of mortality.1 Moreover, data on causes of death is limited to either AIDS or non-AIDS prior to 2000, and to six broad categories more recently. Complete data on causes of death, particularly prior to AIDS, are limited to those available through the Australian HIV Observational Database, which is a relatively small subsample of HIV-infected people.2

In Australia, the National Death Index (NDI) is maintained by the Australian Institute of Health and Welfare (AIHW) and contains all registered deaths occurring in Australia since 1980. Each death is reported to the NDI on a death certificate with an underlying cause of death and any other conditions that led to death indirectly.3

National death registries are becoming a widely used epidemiological resource for mortality and causes of death in most developed countries. Several studies in the United States (US) and Australia have involved linkage with the NDI to establish whether the death occurred and cause of death.3–11 The usual method of linkage uses probabilistic software programs that assign each possible match a weight indicative of the likelihood that a match is a true one. Clear matches and non-matches can generally be established for many cases, but there often follows a time-consuming and subjective clerical review of cases for where it is not immediately clear whether the linkage match is a true one.

In this paper we present a feasible method of classifying whether linkage matches are true. The method essentially uses HIV/AIDS patients that have either been reported to have died or have recent active follow-up that the patient is alive. The cut-offs are selected to maximise linkage among the known deaths (sensitivity) while minimising linkage among the known non-deaths. We also report the results of validation of the linkage conducted between the NHD and the NAR with the NDI.


AIDS is a notifiable condition by physicians. All AIDS diagnoses are reported to the health departments in all States and Territories and then on to the NAR.12 All HIV cases diagnosed by State reference laboratories are reported to State health departments and are subsequently registered with the NHD.13 Both the NHD and the NAR databases are maintained at the National Centre in HIV Epidemiology and Clinical Research (NCHECR), University of New South Wales.

Deaths following AIDS are notified by the treating doctors to the State and Territory health department then forwarded to the NAR. Moreover, deaths after AIDS may also be notified by linking with the State and Territory Register of Deaths in some health jurisdictions. Deaths among cases of HIV infection without an AIDS notification have been routinely notified through State and Territory health departments to the NHD from January 2000. The date of last medical contact on the NAR is updated each year by State and Territory health authorities for people living with AIDS. The approval of the ethics committees at the University of New South Wales and the AIHW were obtained to perform linkage between NHD and NAR with NDI.

Merging NHD with NAR

Before performing the linkage of the NAR and the NHD with the NDI, HIV and AIDS cases diagnosed from 1980 to the end of 2003 and reported to the NCHECR by 30 June 2004 were merged into a single file using the following procedure to establish how many HIV diagnoses have also been reported with AIDS. For the purposes of confidentiality, HIV and AIDS cases in Australia are registered with a namecode (the first two letters of family name and given name). First, 6,452 HIV diagnoses without full identifying information available (namecode, date of birth and sex) in the NHD were discarded since those cases would not be able to be linked with the NDI. All AIDS diagnoses on the NAR have complete identifying information. The HIV cases on the NHD with complete identifying information were merged with cases in the NAR in three stages:

  • Stage 1. Matching on the National AIDS Registry number. Some of the AIDS cases are registered with the same national AIDS number in the NHD.
  • Stage 2. Exact match on the namecode, date of birth, and sex.
  • Stage 3. Matching on the State number. Each HIV/AIDS diagnosis is assigned a number by the State/Territory health authority. All HIV/AIDS diagnoses in Queensland and Western Australia are registered with the same State number in both the NHD and NAR, and in these two States cases could be further matched using this State number.

All cases on the NAR and NHD were then aggregated, identifying cases as both reported to NAR and NHD (i.e. matches with the methods of above), on the NAR only or on the NHD only. The results of merging revealed that 6,065 records of HIV infection have been matched to a subsequent AIDS diagnosis, 3,232 records of AIDS diagnoses without matching an HIV diagnosis, and 10,475 cases of HIV diagnosis only.

Linkage strategy

The aggregated file of the HIV/AIDS cohort was linked to all deaths registered in the NDI by using a probabilistic linkage software. Since all deaths registered in the NDI to the end of March 2005 were included in the linkage, it was expected that deaths recorded in the NHD and NAR by 30 June 2004 would also be included. Records in the NDI are provided in two parts. The first part, from 1997 onwards, contains records for periods when dates of birth are available, and the second part, up to 1996, includes data with unavailable exact dates of birth where an estimated year of birth was derived. Linkage was undertaken through 10 passes based on different combinations of date of birth as a blocking variable and with the following matching fields: the first two letters of the family name, the first two letters of the given name, and sex (see Table 1).

Table 1.  The strategy of matching between HIV/AIDS databases and National Death Index, Australia, 1980-2003.
PassesaDescription of the blocking variableb
  1. Notes:

  2. (a) Five further passes were identical to these, except that the family name and given name were swapped.

  3. (b) The matching identifiers were the first two letters of family name the first two letters of given name and sex.

  4. (c) Blocking on the part of the NDI where only year of birth is available up to 1996.

1Date of birth; exact on day, month, and year
2Date of birth; day wrong
3Date of birth; month wrong
4Date of birth; year wrong by ≤3 years
5Date of birth; year exactc

In each pass, the software package assigns a weight to each of two compared records indicating the likelihood that the two compared records are really from the same individual. Within each pass, a higher weight indicates a greater likelihood of a genuine match. However, weights are not directly comparable across different passes and can only be used to rank the likelihood of a match within a pass. In the matching process, possible matches where the date of death preceded the date of last active contact were considered not to be matches.

However, date of death was not used as an identifier in the matching process because this would only be available for cases known to have died.

Accepting the matches

In the HIV/AIDS cohort, there were 6,900 people with HIV/AIDS diagnoses reported by State/Territory health departments to be dead by 1 January 2003. A further 1,455 people with an HIV/AIDS diagnosis had an active follow-up beyond 1 January 2003 and could be assumed to be alive beyond this date. Based on these 6,900 known deaths and 1,455 known non-deaths, we chose combinations of weights across the 10 passes that maximised sensitivity (i.e. matched a high proportion of known deaths) while also maximising specificity (i.e. matched a low proportion of known non-deaths). For fixed values of specificity between 90% and 96%, we chose the combination of weights that maximised sensitivity. The final set of weights across the 10 passes used to indicate matches and non-matches was chosen so that sensitivity and specificity were maximised.


The maximum sensitivities obtained for fixed specificities between 90% to 96% are summarised in Figure 1.

Figure 1.

Maximum sensitivity for fixed values of specificity in an optimal trade-off between sensitivity and specificity of linkage between HIV/AIDS databases and NDI, Australia, 1981–2003.

A specificity of 92% (confidence interval (CI) 0.87–0.97) and sensitivity of 82% (CI 0.80–0.84) were found to optimise the linkage in terms of the area under the ROC curve (see Figure 1). Using the cut-off weights for maximum specificity and sensitivity, we matched 5,658 of 6,900 (82%) HIV/AIDS deaths to the NDI (see Table 2). The false-positive matching rate of known non-death was 116 of 1,455 (a specificity of 92%).

Table 2.  Sensitivity and specificity of optimal linkage of NHD and NAR with NDI, Australia, 1980–2003. Thumbnail image of

Sensitivity by year of death as reported to the NAR is shown in Figure 2. Sensitivity was high during the first few years of the AIDS epidemic in Australia, it then decreased into the range 77% to 81% during 1986 to 1992. Thereafter, sensitivity increased into the range 86% to 92% during 1993 to 2002.

Figure 2.

Sensitivity by year of death reported to the HIV/AIDS databases in link to the NDI, Australia, 1981–2002.

Since date of death for the known deaths was not used in the linkage process as an identifier, it was possible for known deaths to be matched to deaths in the NDI with a later date of death. The time difference between date of death reported to the HIV/AIDS databases and date of death registered in the NDI was examined. The results revealed that 1,234 deaths out of 6,900 deaths reported to the HIV/AIDS databases had a later date of death in the NDI. The interval from date of death on the NAR to date of death on the match in the NDI for these 1,234 deaths is summarised in Figure 3. Most of these known deaths with later matched date of deaths had a relatively short interval, but for some cases the interval was as large as 16–19 years. However, the highest proportion of known deaths with discrepant dates of death was for deaths occurring in 1981–86 (see Figure 4). For the period 1994 to 2002, the time period where there are the majority of HIV/AIDS cases with unknown vital status on the NAR, the proportion of known deaths with a discrepant matched date of death was below 10%.

Figure 3.

Time difference from date of death on NAR and NHD to date of death on NDI for matches with discrepant date of death in linkage between HIV/AIDS databases and NDI, Australia, 1981–2002.

Figure 4.

Proportion of matches for which date of death reported to the HIV/AIDS databases was different from date of death registered in the NDI by year of death, Australia, 1981–2002.


By choosing weights of matched pairs that maximise sensitivity and specificity, we obtained 82% of known deaths among HIV/AIDS cases matched to a death, and 92% of non-deaths were correctly not matched to a death. Since the methodology was based on using known deaths and known non-deaths to choose cut-off weights, the estimate of sensitivity and specificity was obtained without a lengthy subjective clerical review. If it can be assumed that subjects with unknown vital status have similar chances of being accurately linked as subjects who are know to be dead or alive, it seems reasonable to suppose that these rates of sensitivity and specificity will apply to HIV/AIDS cases of unknown vital status. It may be that a proportion of subjects with unknown vital status are in some cases more difficult to trace at a State level, and that these subjects would be more difficult to link accurately. This would mean that our estimates of the sensitivity and specificity of the linkage procedure would be too optimistic when applied to subjects with unknown vital status. However, data are not available that would allow the extent to which subjects with unknown vital status have poorer linkage to be quantified.

Previous studies have reported a higher sensitivity on matching to the NDI in Australia. Powers et al., in a study for establishing the vital status of older women, obtained a sensitivity of 95% and specificity of 98%.9 In an attempt to assess the accuracy of the NDI, Kelman reported sensitivity and specificity at 88.8% and 98.2% respectively using a cohort of patients who had received one of the five types of medical implant.4 Magliano et al. assessed the accuracy of the NDI by linking the deaths among participants in the long-term Intervention with Pravastatin in Ischaemic Heart disease study and reported 93.7% sensitivity and 100% specificity.3 Kariminia et al. have demonstrated sensitivity of 88.4% and specificity of 99.7% in a study of mortality among prisoners.8 The relatively poorer sensitivity seen in our study is likely a consequence of having namecodes only, as opposed to full names, available for the linkage process. Namecodes are susceptible to a variety of errors, such as patients using aliases, variation in coding of apostrophes or common prefixes such as Mc or Mac, and inadvertent misordering of first and given names. All of these errors would reduce the efficiency of the linkage and would be difficult, if not impossible, to pick up from the namecode alone.

We found that sensitivity was high for deaths during the first years of AIDS epidemic in Australia. This might be related to the relatively few HIV/AIDS deaths that occurred during those years. Sensitivity was lower for deaths during the period 1984 to 1994, a time when effective antiretroviral treatment was not available and the number of HIV/AIDS death was high. The relatively poorer sensitivity for deaths reported during this period might be related to the lack of availability of exact dates of birth on the NDI up to 1997. Sensitivity was relatively high after 1995, the time period during which there are most HIV/AIDS cases of unknown vital status on the NAR and for whom the linkage with the NDI will be most useful.

Although sensitivity for linkage of fact of death was found to be 82%, 18% of these matches had a date of death later on the NDI, and for some cases it was many years later. At least some of these matches, but an unknown proportion, will be incorrect. However, the proportion of matches with discrepant dates of death was relatively low for the period 1994–2002, which, as noted above, is the time period during which there are most HIV/AIDS cases of unknown vital status. Most of the matches with a very discrepant date of death were prior to 1996 (see Figure 4), when information on the date of birth was not available on the NDI.

HIV diagnoses without full identifying information were not entered in the linkage with the NDI. This was because these diagnoses, which often had no identifiers, could not be expected to be linked reliably. These HIV cases were mostly diagnosed in the early 1980s to the mid 1990s, when confidentiality of HIV notifications was a real concern to many individuals. Although these HIV cases have incomplete identifiers, were not linked, and hence probably have quite incomplete survival information, it seems reasonable to assume that survival outcomes in these subjects would be similar to contemporaneous HIV diagnoses with identifying information. Hence, overall survival rates, and numbers of people living with HIV, can be estimated by applying mortality rates in cases that were included in the linkage to the HIV cases for which identifying information was not available.

Although the linkage in our study was not perfect, which was expected given the limited identifiers available, the information available from this linkage validation on sensitivity and specificity, the proportion of matches with discrepant dates of death, and the time period from date of death on the NAR to date of death on the NDI, can all be used when estimating overall mortality rates among HIV/AIDS cases reported to the NHD and NAR.

Our study presents a feasible method for conducting linkage studies whenever deaths and non-deaths are available that allows the subjective and often time-consuming process of clerical review to be avoided. The success of this kind of linkage method is based on reported deaths and non-deaths, and therefore a follow-up system is strongly suggested for surveillance case registries, even if vital status is planned to be established through linkage with national death registries.


The authors thank John Harding, Mark Short and other staff of the Australian Institute of Health and Welfare for their help. The HIV/AIDS linkage study was funded by a PhD scholarship (Iran Ministry of Health) and National Centre in HIV Epidemiology and Clinical Research. NCHECR is funded by the Australian Government Department of Health and Ageing and is affiliated with the Faculty of Medicine, University of New South Wales.