Mini-Sentinel's systematic reviews of validated methods for identifying health outcomes using administrative data: summary of findings and suggestions for future research


Correspondence to: R. Carnahan, Department of Epidemiology, The University of Iowa College of Public Health, C 21-H General Hospital, 200 Hawkins Dr., Iowa City, IA 52242, USA. E-mail:



The validity of findings from surveillance activities, which use administrative and claims data to link exposures to adverse events, depends in part on the validity of algorithms to identify health outcomes using these data. This review provides a high level overview of the findings of 19 systematic reviews of studies, which have examined the validity of algorithms to identify health outcomes using these data. The author categorized outcomes on the basis of the strength of evidence supporting valid algorithms to identify acute or incident events and suggested priorities for future validation studies.


The 19 reviews were evaluated, and key findings and suggestions for future research were summarized by a single reviewer. Outcomes with algorithms that consistently identified acute events or incident conditions with positive predictive values of greater than 70% across multiple studies and populations are described as low priority for future algorithm validation studies.


Algorithms to identify cerebrovascular accidents, transient ischemic attacks, congestive heart failure, deep vein thrombosis, pulmonary embolism, angioedema, and total hip arthroplasty revision performed well across multiple studies and are considered low priority for future validation studies. Other outcomes were generally thought to require additional validation studies or algorithm refinement to be confident in algorithms. Few studies examined the validity of International Classification of Diseases, 10th Revision, codes.


Users of these reviews need to consider the generalizability of findings to their study populations. For some outcomes with poorly performing codes, it may always be necessary to validate cases. Copyright © 2012 John Wiley & Sons, Ltd.


The Food and Drug Administration (FDA) Mini-Sentinel contract is a pilot program that aims to conduct active surveillance to detect and refine safety signals that emerge for marketed medical products. The program uses administrative and claims data from several collaborating partner organizations to examine and strengthen signals related to the safety of drugs, medical devices, and biologics. Because diagnosis and other codes found in administrative and claims data are not always accurate, it is important to understand the performance characteristics (i.e. predictive values, sensitivity, and specificity) of various algorithms for identifying health outcomes from the codes contained in administrative and claims data. This understanding is vital in determining the validity of exposure–outcome associations observed in these data sources. Because establishing a temporal relationship helps provide causal inference when evaluating adverse events related to medical products, it is useful to know whether an algorithm identifies incident or acute events as opposed to prevalent or preexisting conditions. It is also important to understand the populations from which these validation estimates were obtained to determine whether they can be generalized to Mini-Sentinel or other data sources. Predictive values are dependent on the prevalence of an outcome in a population, so they may be higher in a high-risk population compared with a low-risk population. Differences in the validity of codes across settings of care and data sources are also important to consider. If the validity of algorithms to identify acute or incident conditions is well characterized as excellent on the basis of studies in populations and data sources similar to those in which evaluations are conducted, some may argue that it is not necessary to require medical record review to validate cases, although this is debatable. If the validity is not well characterized, positive predictive values (PPVs) are inadequate, algorithms do not differentiate acute events or incident conditions, or the populations of algorithm validation studies differ from those in which the algorithm will be applied, it is generally necessary to conduct some type of outcome validation to ensure that the findings of evaluations are valid.

The Mini-Sentinel program contracted with collaborators to conduct systematic reviews of studies focused on the validity of algorithms for identifying 20 health outcomes of interest (HOIs) in administrative data. These outcomes were selected by the FDA because they were thought to be important in monitoring the safety of medical products. Outcomes that had been the subject of prior reviews commissioned by the Observational Medical Outcomes Partnership (OMOP) were generally excluded from consideration because reviews of validated algorithms had been recently conducted. These reviews can be found on the OMOP Web site at Because of significant overlap of two outcomes (revision and excision of orthopedic medical devices) and overlap of the studies examining these outcomes, the responsible investigators provided one report to examine both outcomes. Thus, a total of 19 systematic reviews were conducted. The purposes of this manuscript are to briefly summarize the results of these reviews, to discuss the major gaps in evidence for performance characteristics of the algorithms for these HOIs, and to provide recommendations for HOIs for which future validation studies are most important for understanding performance characteristics of algorithms.


The methods for these reviews are overviewed in the accompanying manuscript by Carnahan and Moores.[1] This manuscript provides a high-level summary of the results of the reviews on the basis of the current author's judgment and interpretation of the key findings and gaps. It should be noted that any judgment of an algorithm as being good, poor, and so forth, is not a classification on the basis of scientific evidence because there are no specific thresholds by which an algorithm can be judged. Generally, PPVs of 70% or greater will be described as high, PPVs of 50% to 70% will be described as moderate, and PPVs of less than 50% will be described as low or poor. These categories are not absolute, and their implications for the application of algorithms to future research are dependent on the needs of end users. Statements regarding the priorities for future research are mainly based on the amount and consistency of evidence on the validity of algorithms, the ability to identify acute events or incident conditions, the performance characteristics of the algorithms, and the populations within which they have been studied.


Cerebrovascular accident/transient ischemic attack

Andrade et al.[2] identified 35 studies that reported on the validity of algorithms for cerebrovascular accidents (CVAs) and/or transient ischemic attacks (TIAs). The algorithms ranged from examining composite end points to studying a specific type of cerebrovascular event. Algorithms for studying stroke or intracranial bleeds generally had PPVs of 80% or greater. Algorithms for TIA generally had PPVs of 70% or greater. Algorithms for composite end points had PPVs ranging from 57% to 92% for acute events and from 33% to 96% for those examining a history of or current disease. Algorithms that used more focused sets of codes generally performed better. Most studies examined incident events or hospitalizations. For stroke, 24/25 studies examined acute events and 22/25 evaluated only hospitalizations for stroke. Hospitalization is an important indicator because it infers a serious event. For TIA, 6/7 studied acute events and 5/7 studied hospitalizations or emergency department visits only. For intracranial bleed and subarachnoid hemorrhage, 5/5 studied incident events. Composite definitions of cerebrovascular disease were the exception, as 6/10 studies included prevalent cerebrovascular disease.

Because a large number of studies have been conducted that found relatively high PPVs, these outcomes should probably be considered a low priority for future validation studies. However, some questions do remain. The report authors recommended that future studies might examine differences in validity of algorithms based on age and sex, differentiation of ischemic strokes caused by thrombosis versus emboli, comparing algorithms based on standard criteria, and comparing algorithms for first events versus recurrent events. Also, only a small number of studies examined International Classification of Diseases, 10th Revision (ICD-10), codes.

Congestive Heart Failure

Saczynski et al.[3] identified 35 studies that reported on the validity of algorithms for heart failure. These included inpatient and outpatient settings, incident or prevalent heart failure, and studies examining the performance characteristics on the basis of the position of heart failure codes. Most algorithms had PPVs greater than 90%. Incident heart failure, the focus of 7/35 studies, could be identified quite well by requiring a multiyear disease free baseline period of eligibility. Hospitalization for heart failure was the focus of 24/35 studies.

The large number of studies and high PPVs suggest that this outcome should be a low priority for future validation studies. However, the authors did note that most of the algorithms used only inpatient codes for heart failure. Because heart failure is often managed in the outpatient setting, it may be important to further establish the validity of outpatient codes. The authors also recommended more studies of specific subgroups of high interest, and more studies examining ICD-10 code validity.

Atrial Fibrillation

Jensen et al.[4] identified 16 studies that reported on the validity of algorithms for atrial fibrillation. They found that both inpatient and outpatient atrial fibrillation codes had PPVs ranging from 70% to 97%. Sensitivity in six studies ranged from 57% to 95%. Only one study specifically sought to identify incident atrial fibrillation and found a PPV of 77% for this algorithm.

Despite a large number of studies on this topic, only one study sought to identify incident atrial fibrillation. Given that this is likely to be the specific outcome of interest for studies examining exposures that cause atrial fibrillation, a confirmatory validation study may be considered of moderate priority. The report authors recommended the use of electrocardiogram data, where available, as part of future algorithms to be studied. They also suggested future research on whether both an ICD code and electrocardiogram code should be required in an algorithm, the number of diagnoses that should be required in an algorithm, and the time period within which these diagnoses should occur to consider a person to have incident atrial fibrillation.

Ventricular arrhythmias

Tamariz et al.[5] identified nine studies that reported on the validity of algorithms to identify ventricular arrhythmias. PPVs ranged widely from 5% to 100%. Algorithms using more specific sets of codes had PPVs of 70% to 80% or better range. In the studies that examined sensitivity and specificity, sensitivity was 77% or better and specificity was 84% or better. All studies seemed to focus on acute events.

A fair amount of evidence exists on algorithms to identify ventricular arrhythmias. However, the variability in findings may increase the priority of further algorithm validation studies from low to moderate, depending on whether the more restricted sets of codes with better PPVs are considered satisfactorily sensitive. The report authors suggested that future research might make more use of pharmacy, procedure, or diagnosis-related group codes. They also suggested future studies focused on high-risk groups, such as those with heart failure, or groups of different race/ethnicity. Also, few studies have examined ICD-10 codes. Lastly, most case identification and validation strategies were based on medical claims and records and did not incorporate death certificates. It can be assumed that several fatal arrhythmias would be missed in any study that used only administrative and claims data to identify these events. Future work might examine the number of events identified in administrative and claims data versus death certificates.

Venous thromboembolism

Tamariz et al.[6] identified 15 studies that reported on algorithms to identify deep vein thrombosis (DVT) or pulmonary embolism in administrative data and the validity of the algorithms. A wide range of populations were studied, and there was some variability in findings. Although the performance of specific codes varied, the better algorithms using more specific codes generally had PPVs greater than 70%, and some PPVs exceeded 90%. This was true for both DVT and pulmonary embolism, although pulmonary embolism codes generally performed better than those for DVT. All studies focused on acute events.

Given the number of studies available that generally showed relatively good performance of algorithms, these outcomes might be considered low priority for future validation studies. However, the report authors suggested that more research should be conducted on the potential value of adding other types of codes to algorithms, such as procedure codes, diagnosis-related groups, or pharmacy claims for anticoagulant medications.


Townsend et al.[7] identified 11 studies that examined the validity of algorithms to identify depression. Validation methods varied, including patient self-report, depression screening instruments, and medical record diagnoses. PPVs ranged widely, from 31.5% to 98.8%, and were not clearly dependent on the method of validation. Only one study seemed to focus on new-onset depression, which could be considered a substantial limitation in evidence. It is also important to consider that depression is underdiagnosed, and no algorithm using diagnosis codes from medical encounters will detect undiagnosed depression detected by screening instruments. Generally, the evidence suggested that most people with a depression diagnosis code are depressed, although many depressed individuals do not receive a diagnosis.

Given that several studies have been conducted examining algorithms to identify clinically diagnosed depression, future validation studies might be considered of moderate priority. However, the potential importance of this outcome in youth and the lack of studies in this patient subgroup may raise its priority level. The lack of focus of validation studies on newly diagnosed depression is also a major gap because this is of greatest interest for surveillance. The authors suggest several priorities for future research, including a replication study of the most promising algorithm to identify clinically diagnosed depression, the assessment of this algorithm in comparison with an independent standard in settings that conduct routine screening for depression (though screening tools have limitations), and the research on algorithms to identify depression in youth.

Suicide and suicide attempts

Walkup et al.[8] identified six studies examining the validity of algorithms to identify suicide or suicide attempts. The studies and their findings were very different. Several studies examined the validity of E-codes to determine intent of self-harm, with PPVs ranging from 36.5% to 100%, depending on the study population and method of validation. Sensitivity was evaluated in two studies. When compared with death certificates, coroner reports, and law enforcement reports in one study, the sensitivity of administrative data in identifying suicide was approximately 14%. The algorithms used in this comparison were relatively nonspecific but included a broad algorithm examining the presence of any visit to an emergency department or hospital within a day of suicide death. This suggests that administrative and claims data had very low sensitivity in capturing suicide in this study population. Another study used a death registry as the criterion standard and found a sensitivity of 65% for a more specific algorithm. It is difficult to reconcile these differences, but they may reflect regional variation in coding or factors related to a suicide reaching medical attention and thus appearing in claims. Overall, the report findings suggest a moderate to good PPV for most algorithms using administrative and claims data to identify hospitalization for self-harm, but poor to moderate sensitivity of administrative and claims data for identifying suicides.

On the basis of the limited and inconsistent evidence, this outcome might be considered high priority for future research. The authors concluded that it is difficult to recommend any single algorithm to identify suicide or suicide attempts. Because E-codes are inconsistently coded, the authors suggested future research that would help determine the completeness of E-codes in different settings. They also suggested research on the validity of algorithms in patient subgroups of interest. Finally, the potentially poor sensitivity of administrative and claims data to completed suicide indicates a need for linkages with death certificates, at a minimum, to effectively identify suicide.

Seizures, convulsions, or epilepsy

Kee et al.[9] identified 11 studies that reported on the validity of algorithms to identify seizures, convulsions, or epilepsy. Performance characteristics varied widely, but the more focused algorithms to identify epilepsy in adult populations generally had PPVs approximately 80% or higher. These studies focused on prevalent epilepsy, although some studies of acute events included new epilepsy diagnoses in their case definitions. PPVs in studies of vaccines and seizures in children varied by the setting of diagnosis. PPVs of inpatient and emergency department codes ranged from 60% to 94%, whereas outpatient codes had very low PPVs.

Given the available evidence, future validation studies for this outcome might be considered of moderate priority. Algorithms to identify seizures, not epilepsy, in adult populations may benefit from further refinement. Two studies of seizures in children receiving vaccines suggest that the preferred algorithm might use inpatient and emergency department codes and not outpatient codes.


Moores et al.[10] identified eight studies that examined the validity of algorithms to identify pancreatitis. PPVs ranged widely, but most studies found PPVs in the 60% to 80% range. Algorithms focused on acute pancreatitis performed better than those that included chronic pancreatitis. Future validation studies of pancreatitis might be considered of moderate priority on the basis of available evidence. It should be noted that most validation studies examined focused populations from single centers. More generalizable validation statistics may be desirable. It would also be useful to examine the potential for improvement of algorithms in settings where laboratory results can be linked with administrative data.


Herman et al.[11] identified only one study that examined the validity of algorithms to identify lymphoma. Of four algorithms studied, the highest PPV was 62.8%. Sensitivity ranged from 55.2% to 88.7%. The reference standard for validation in this study was a single state cancer registry, and it was noted that such registries may not identify lymphoma consistently. Given that only one validation study has been conducted that used a single state cancer registry as the validation reference, future validation studies of lymphoma algorithms might be considered a high priority. Future studies should consider medical record review to augment state cancer registries to validate potential cases, that is, the medical records of cases not confirmed by the registries should be reviewed.

Infection related to blood products, tissue grafts, or organ transplantation

Carnahan et al.[12] identified only one validation study of an algorithm to identify infections in recipients of blood products, tissue grafts, or organ transplants. This study examined the validity of an algorithm to identify aspergillosis in transplant recipients. No studies validated or even used algorithms to identify infections transmitted by these sources. Such infections are of greatest interest for surveillance. Two studies found that a code for allogeneic red blood cell transfusion had near perfect specificity, but sensitivity ranged from 21% to 83%. Other evidence suggests that administrative data may lack sensitivity for identifying transfusions. No studies were found that validated algorithms to identify other types of transfusions, tissue graft, or organ transplants.

Given the lack of studies to identify infections transmitted by blood products, tissue grafts, or organ transplantation in administrative data, this outcome might be considered a high priority for future research. Given that confirming the source of infection in a clinical setting can be difficult, identifying such infections in administrative data might be difficult. However, codes for infections resulting from medical care may improve the specificity of such algorithms.

Transfusion-associated sepsis

Carnahan et al.[13] identified no studies that reported on the validity of algorithms to identify transfusion-associated sepsis. Because no such studies were identified, they instead reported on four validation studies of sepsis algorithms and two validation studies of a code for allogeneic red blood cell transfusions, which were discussed previously. Two studies reported PPVs of approximately 80% or greater for sepsis algorithms, whereas one study in Veterans Affairs surgical patients had a much lower PPV of 44% and a sensitivity of 32%. The coding in Veterans Affairs settings may differ because codes are not linked to reimbursement in the same way as in other settings. Another multicenter study found sensitivity of 75.4% for an algorithm to identify sepsis.

The lack of studies that attempted to confirm transfusion as a cause of sepsis suggests that this outcome should be a high priority for future validation studies. Such studies should examine the effect of including codes for infections transmitted through medical care and a new code that was adopted, which indicates acute infection after transfusion. Sepsis algorithms varied, so the validity of specific codes might be further examined. Platelet transfusion has historically been among the more common sources of infection because of room temperature storage and bacterial contamination that results. Thus, it would also be very useful to examine the validity of codes to identify platelet transfusion.

Transfusion-related ABO incompatibility reactions

Carnahan et al.[14] identified no studies examining the validity of algorithms to identify transfusion-related ABO incompatibility reactions. Some evidence suggests that administrative data might be insensitive to identifying these reactions, but they have been identified in some studies using administrative data. Given the lack of available information, future algorithm validation studies might be considered of high priority.

Erythema multiforme, Stevens–Johnson syndrome, and toxic epidermal necrolysis

Schneider et al.[15] identified four studies that examined the validity of algorithms to identify erythema multiforme, Stevens–Johnson syndrome, or toxic epidermal necrolysis. Only three provided validation statistics. The PPVs ranged from 44% to 51%. The most recent data from these studies were from 1986, suggesting a need for more current information on the performance of the relevant codes. This outcome might be considered a high priority for future validation studies. Evidence is outdated, and all studies examined a single code to identify this outcome. It is possible that less specific codes for drug-induced hypersensitivity reactions may also identify some cases. Future studies might examine the validity of algorithms dependent on the setting of care and specialty of the diagnostician.

Anaphylaxis, including anaphylactic shock and angioedema

Schneider et al.[16] identified six studies that provided validation statistics for anaphylaxis, anaphylactic shock, or angioedema. The code for anaphylactic shock performed better than other codes to identify anaphylaxis, with PPVs of ranging from 52% to 57.1% in three studies that reported on this specific code. PPVs for other anaphylaxis codes were very low. Codes for angioedema performed quite well, with PPVs of 90% and 95.3% in two studies.

Generally, codes for anaphylaxis seem to perform at a level that is less than that desired. It may be necessary to confirm every case through medical record review in studies of this outcome unless an algorithm with a higher PPV is identified. It may also be beneficial to include a broader set of codes to capture more cases, although care should be taken to prevent major losses in efficiency of the medical record review process from selection of an algorithm with too many codes. Alternatively, potential strategies that might increase the PPV include examining the addition of codes for epinephrine injections. Additional research may further delineate codes appropriate for identifying potentially drug-related anaphylaxis, as opposed to that from other causes. Such research might further examine the value of using E-codes for this purpose. Because angioedema codes seem to perform well, future validation studies for this outcome might be considered a low priority.

Hypersensitivity reactions other than anaphylaxis

Schneider et al.[17] identified four studies that reported on the validity of algorithms to identify hypersensitivity reactions other than anaphylaxis and another that focused on anaphylaxis but explored some related codes. Two of the studies that reported validation statistics were focused on angioedema. They found PPVs >90%, as previously described. Another study used data mining to examine a long list of candidate codes to predict the presence of abacavir-associated hypersensitivity reactions among new users of the drug and found an algorithm with 95% sensitivity and 90% specificity. Unfortunately, this focused on a type of hypersensitivity reaction that is somewhat unique to this drug, so the algorithm cannot be generalized to other hypersensitivity reactions. A final study found a PPV of 15% for an algorithm to identify drug-related allergic reactions other than anaphylaxis and was unable to identify a well-performing algorithm despite attempts at algorithm refinement.

Very few useful data are available to support the validity of algorithms to identify hypersensitivity reactions other than anaphylaxis or angioedema. This outcome might be considered a high priority for validation studies, except for the fact that this category of hypersensitivity reactions generally seems to represent a less serious set of outcomes compared with the others, for example, urticaria is highly preferable to anaphylaxis. Research might explore the differential utility of specific codes for hypersensitivity reactions, and also consider the utility of including E-codes that indicate a drug as the cause of a hypersensitivity reaction.

Pulmonary fibrosis and interstitial lung disease

Jones et al.[18] identified no studies that reported on the validity of algorithms to identify pulmonary fibrosis (PF)/interstitial lung disease (ILD). Given that no studies are available to determine the validity of algorithms to identify PF/ILD, this outcome may be considered a high priority for validation studies. The ICD-9-CM codes used to identify the outcome were fairly consistent across studies. However, research might examine the effect of using only inpatient versus inpatient and outpatient codes or requiring diagnostic procedure codes consistent with confirming the outcome. Finally, the clinician reviewer noted that PF is typically considered a subtype of ILD, and ILD actually represents a large number of discrete diseases and conditions. The two major codes used to identify the outcome, 516.3 and 515.x, were described as too narrow and too broad, respectively, to capture all diagnoses of interest. Codes for this condition may vary by the provider specialty or health care setting.

Acute respiratory failure

Jones et al.[19] identified no studies that reported on the validity of algorithms to identify acute respiratory failure. Because of the lack of studies identified in the original search, a Google Scholar search was conducted for this gap analysis on 13 January 2011 to help ensure that important studies were not missed. Further details can be found in the complete gap analysis and lessons learned document on the Mini-Sentinel Web site.[20] PPVs in three studies using algorithms to identify acute respiratory distress syndrome (ARDS) ranged from 4% to 38%.[21-23] Sensitivity was 79% and 88% in two studies and less than 10% in another study in an intensive care unit.

Because of the limited evidence, this outcome might be considered a high priority for future algorithm validation studies. However, the low PPVs and highly variable sensitivity in studies of ARDS raise concern on whether any algorithm will be sufficiently valid to identify this outcome. Even if all potential cases were confirmed by medical record review, codes may lack sensitivity in some settings. If the focus was acute respiratory failure presenting in an emergency setting is the specific outcome of interest, algorithms might perform differently. There is no evidence on algorithms to identify this outcome.

Orthopedic implant revision and removal

Singh et al.[24] identified five studies that examined the validity of algorithms to identify orthopedic implant revision. None specifically evaluated implant removal. Two studies reported on total knee arthroplasty (TKA) revision and three studies on total hip arthroplasty (THA) revision. The PPV for TKA revision was 32% in one study, despite a sensitivity of 77.2% and a specificity of 97.6%. Another study found a sensitivity of 87.2% and a specificity of 99%. Neither study used medical record review as the criterion standard but rather used other administrative or claims data as the criterion standard. Thus, the validation statistics should be interpreted with caution. The PPVs for revision THA were >90% in two studies and 71% in another, all using Medicare data to identify cases and medical record review as the criterion standard. The PPV of 71% related to whether the revision was performed on the same side as an index THA, and hence reflects a more stringent criterion standard. CPT codes were used in the better performing algorithms.

Studies to improve the performance of algorithms to identify revision TKA might be considered of higher priority given the low PPV in the one study in which it was calculated and suboptimal criterion standards. Studies on algorithms for revision THA found relatively good performance of algorithms that used CPT codes, so this outcome is probably of low priority for future validation studies. An important caveat is that the validation studies were conducted using Medicare data, so generalizability to other data sources or younger patients could be questioned. Because no study examined implant removal, this might be considered a high priority for future validation studies.


The systematic reviews overviewed provide a wealth of information about algorithms to identify various health outcomes in administrative data. They also identify substantial gaps in evidence to support algorithms to identify several outcomes and many opportunities for future research. For some outcomes without algorithms with high PPVs, it may always be necessary to validate potential cases identified in administrative data, unless algorithms can be altered to enhance specificity.

Some limitations need to be considered. The reviews, although informative, may not have identified all relevant evidence. As mentioned in an accompanying manuscript describing the methods,[1] the indexing of algorithm validation studies is inconsistent, and searches are likely to miss some relevant publications. Also, this high level overview did not provide extensive details on the study populations in which most outcomes were studied but only highlights outcomes for which populations studied to date clearly presented limitations to generalizability. It is recommended that users of these reviews examine them carefully to determine whether the information is generalizable to the populations they plan to study. This is an important consideration because PPVs are dependent on the prevalence of the outcome in the population studied. For example, the PPV of a code for venous thrombosis may be higher in postsurgical populations with mobility restrictions compared with in outpatient populations who have a lower risk. If the populations in which algorithms were validated are not representative of the population in which the algorithm will be applied, the generalizability of the results should be questioned. Another consideration is the setting of the diagnosis, to which this logic can be extended. The PPV of seizure codes observed after vaccinations in children is higher in emergency department and inpatient settings compared with outpatient settings, where the PPV is quite low.[9] This is likely due to parents seeking immediate medical care for more obvious and severe seizures, although it may also relate to the ability to confirm an event because of the availability of electroencephalograms in the acute care settings. One cautionary caveat should be considered, particularly for diagnoses that are less objective or take multiple medical encounters to confirm. In some populations, a high diagnostic suspicion, because of a high risk in that population, may lead to more frequent provisional or presumptive diagnoses without clear evidence to confirm the diagnosis. In theory, this could result in a reduced PPV of some algorithms in high-risk populations. For most outcomes, however, conventional wisdom that a higher prevalence leads to a higher PPV is likely to apply.

Another limitation of this summary is that confidence intervals for point estimates for PPVs and other performance characteristics were very rarely reported by individual studies or reviews, so the precision of these estimates is not discussed. Precision is dependent on sample size. For example, a PPV of 70% based on 50 cases (35/50 confirmed) has a 95%CI of 57% to 83%. The same PPV based on 100 cases (70/100 confirmed) has a 95%CI of 61% to 79%. The 95%CI reduces to 64% to 76% if it is based on 200 cases (140/200 confirmed). This is an important consideration, and the imprecision of PPVs based on very small numbers of cases should be taken into account when interpreting the evidence. Future reviews of algorithm validation studies should consider calculating CIs for these estimates if they are not reported by the source studies such that precision can be better described.

On the basis of the amount of evidence available, the consistency of findings, and the performance characteristics of algorithms studied, the outcomes can be broadly classified into several categories. Those HOIs with algorithms that perform consistently well across several studies might be considered low priority for future validation studies. These include CVA, TIA, congestive heart failure, DVT, pulmonary embolism, angioedema, and THA revision. The study populations still need to be carefully considered since, for example, THA revision algorithms were only validated in Medicare populations. Thus, validation studies that include younger patients may fill a gap in available evidence. Most studies of these outcomes studied acute events rather than a history of an event and many restricted case definitions to hospitalizations. This increases the utility of algorithms for surveillance activities linking medical products to adverse events.

Other outcomes with a fair amount of evidence but less consistent findings on the validity of algorithms might be considered of moderate priority for future validation studies. These include atrial fibrillation; ventricular arrhythmias; depression; seizures, convulsions, or epilepsy; and pancreatitis. Most studies of atrial fibrillation and depression did not differentiate acute events or incident diagnoses, which may also increase their priority for future validation studies. The designations of moderate and high priority are arguable and do not consider the influence of decisions about which outcomes will be the focus of surveillance activities. In practice, any outcome for which there is insufficient evidence to be confident in an algorithm might need to be validated if is the subject of a risk evaluation. An important question that influences this prioritization is how much is likely to be learned from another study focused specifically on validating or possibly refining algorithms. These outcomes in the moderate priority category have generally been the focus of multiple algorithm validation studies, but algorithms that identify acute or incident events with consistently high PPVs across multiple studies are generally not available. Improvement on some PPVs may or may not be possible, as certain codes may simply have poor validity.

Studies with little evidence that is relatively inconsistent, or for which algorithms performed poorly, might be considered of high priority for future validation studies. More evidence is generally needed to develop algorithms or confirm or refute their validity. These include suicide and suicide attempts; lymphoma; infection related to blood products, tissue grafts, or organ transplants; transfusion-associated sepsis; transfusion-associated ABO incompatibility reactions; hypersensitivity reactions other than anaphylaxis; erythema multiforme, Stevens–Johnson syndrome, and toxic epidermal necrolysis; PF and ILD; total knee replacement revision; and orthopedic implant removal. Finally, two outcomes have limited evidence that suggests algorithms may perform quite poorly. This might place them in a category that is high priority for future algorithm validation studies, or one in which all potential cases should be consistently confirmed because of poor performance of algorithms. The outcomes that fall into this category are anaphylaxis and acute respiratory failure. The only evidence on the latter outcome relates to ARDS, so there may be too little evidence available to conclude for certain that a well-performing algorithm could not be designed. Codes for anaphylaxis generally had low to moderate PPVs, with the code for anaphylactic shock having the only moderate PPV.

A final consideration for future research is whether validation studies have been conducted for algorithms using ICD-10 codes. This coding system must be adopted in the USA by October 2013, so research using administrative data will have to adapt accordingly. A small number of reports included studies that examined the validity of algorithms using ICD-10 codes. These included seizures, convulsions, or epilepsy (one study),[9] anaphylaxis (one study),[16] congestive heart failure (two studies), [3] and CVA or TIA (two studies).[2] Although mapping across coding systems and generalizing ICD-9 algorithm validity to ICD-10 codes might be reasonable for some outcomes, many algorithms will need to be reconfigured and revalidated when ICD-10 coding is more completely implemented in the USA. It is also possible that validity during the initial period of transition may be different because of the learning curve in implementing ICD-10.


The 19 systematic reviews conducted by Mini-Sentinel investigators provide a relatively comprehensive review of the literature on the validity of algorithms to identify the HOIs. The reviews identified many useful algorithms, gaps in evidence, and suggestions for future research. Little evidence was found to support algorithms using ICD-10 codes. This overview provides a high level summary of the findings of these reports and one author's suggestions on the relative prioritization of future algorithm validation studies. Ultimately, the selection of outcomes for validation studies in Mini-Sentinel will depend on the US FDA's prioritization of future surveillance activities and the willingness to accept the results of prior validation studies as generalizable to Mini-Sentinel data and the populations that are the subject of risk evaluations.


The author reports no conflicts of interest related to this work.


  • The availability of evidence on algorithms to identify the HOIs reviewed varied widely.
  • The need for future research on algorithm validity is prioritized based on available evidence.
  • It is possible that some outcomes for which algorithms have low PPVs will always need to be validated in administrative database studies to ensure validity of study findings.
  • Few studies examined the validity of algorithms using ICD-10 codes. Coding in the USA must transition to ICD-10 by October 2013.


Mini-Sentinel was funded by the FDA through the Department of Health and Human Services contract number FDA HHSF2232009100061. The project would not have been possible without the valuable input and work of many people. The author would particularly acknowledge the following individuals. Kevin Moores, PharmD, and Ronald Herman, PhD, of IDIS, and Jonathan Koffel of the University of Iowa libraries provided advice or worked on designing and conducting searches, managing citations, and producing the abstract review documents. Patrick Ryan, PhD, provided helpful insights into the integrated search strategy developed by OMOP, on which our searches were built. Carol Mita of the Harvard library conducted several Embase searches that provided insight on the potential value of this database. Swati Sharma provided essential project management assistance. Elizabeth Chrischilles, PhD, Sean Hennessey, PhD, Darren Toh, PhD, Kimberly Lane, MPH, and Judy Racoosin, MD, MPH, provided important input on many aspects of the project through their work with the Mini-Sentinel Protocol Core. Richard Platt, MD, MSc, provided valuable advice whenever it was requested. Finally, the project would not have been possible without the hard work and input of the HOI report authors and the reviewers who generously donated their time to improving the reports.