Is This “My” Patient? Development and Validation of a Predictive Model to Link Patients to Primary Care Providers


  • Presented in part at the annual meeting of the Society of General Internal Medicine, New Orleans, LA in May, 2005.

Address correspondence and requests for reprints to Dr. Atlas: General Medicine Division, Massachusetts General Hospital, 50 Staniford Street, Boston, MA 02114 (e-mail:


BACKGROUND: Evaluating the quality of care provided by individual primary care physicians (PCPs) may be limited by failing to know which patients the PCP feels personally responsible for.

OBJECTIVE: To develop and validate a model for linking patients to specific PCPs.

DESIGN: Retrospective convenience sample.

PARTICIPANTS: Eighteen PCPs from 10 practice sites within an academic adult primary care network.

MEASUREMENTS: Each PCP reviewed the records for all outpatients seen over the preceding 3 years (16,435 patients reviewed) and designated each patient as “My Patient” or “Not My Patient.” Using this reference standard, we developed an algorithm with logistic regression modeling to predict “My Patient” using development and validation subsets drawn from the same patient set. Quality of care was then assessed by “My Patient” or “Not My Patient” designation by analyzing cancer screening test rates.

RESULTS: Overall, PCPs designated 11,226 patients (68.3%, range per provider 15% to 93%) to be “My Patient.” The model accurately categorized patients in development and validation subsets (combined sensitivity 80.4%, specificity 93.7%, and positive predictive value 96.5%). To achieve positive predictive values of >90% for individual PCPs, the model excluded 19.6% of PCP “My Patients” (range 5.5% to 75.3%). Cancer screening rates were higher among model-predicted “My Patients.”

CONCLUSIONS: Nearly one-third of patients seen were considered “Not My Patient” by the PCP, although this proportion varied widely. We developed and validated a simple model to link specific patients and PCPs. Such efforts may help effectively target interventions to improve primary care quality.

Despite considerable resources devoted to health care in the United States, major deficiencies in quality persist.1–4 Much effort has focused on defining specific quality indicators to generate the “numerators” of quality measurements.5 However, less effort has been applied to defining the patient populations to which these quality measures should be applied, or the “denominator” of quality measurements.6–10 For a health insurer, a self-insured employer, or a health maintenance organization, the denominator represents their “covered” population. Defining the denominator of patients for an individual primary care provider (PCP), however, is not so straightforward.9 Over the course of a year, PCPs may see their “own” patients, patients cared for primarily by practice colleagues, and patients primarily cared for elsewhere (or not at all). Quality improvement efforts at the physician level must therefore account for the range of responsibility PCPs have for the spectrum of patients they encounter.

Assessing a physician's performance based on administratively generated patient lists that do not accurately reflect the patients he or she is actually managing undermines the validity of the quality measurement.11,12 Furthermore, to the extent that providers serve as critical agents of change with regards to health care delivery, patient interventions directed through providers are hampered by such misclassification if that provider does not have continuing contact with the patient.

To address the paucity of research on defining patient populations linked to specific providers, we sought to: (1) determine if PCPs could identify their “true” patients from the population of all patients they saw in the prior 3 years, (2) develop and validate a predictive model to link patients with specific PCPs, and (3) assess whether such linkage to a PCP was associated with higher quality of care.


Study Setting

The Massachusetts General Hospital (MGH) primary care network includes approximately 220 PCPs working in 15 clinically and demographically diverse practices. Over 161,000 unique patients were seen for 372,000 visits in these primary care practices in 2004. A convenience sample of 23 PCPs representing 12 of 15 practices was invited to participate, and 18 individuals from 10 practices agreed to participate and completed the study (78% response rate). The study was approved by our Institutional Review Board.

Protocol Development

A series of meetings were held to develop local physician consensus about what defines patient-provider linkage, to determine the degree of model specificity required for physicians to accept panels assigned without their specific review, and to identify the minimal clinical variable set necessary to enable providers to efficiently assess whether or not a patient was theirs.

Based upon these discussions, guidelines and examples were developed illustrating the link between patients and providers (Appendix A). Because the objective was to use electronic data such as patient age and visit history to model the providers' judgments, the guidelines did not provide specific criteria. A positive predictive value of at least 90% for any given provider was the consensus goal for model development. Finally, a web-based tool was developed to facilitate the providers' review of patient data. The tool displayed the patient's name, past and future outpatient appointment schedule, the last progress note by the physician reviewing the record, and a link to the patient's full electronic medical record (Appendix A).

Identification of Patients Seen by PCPs

For participating physicians (n=18), all patients with whom they had an outpatient clinical visit between April 1, 2000 and March 31, 2003 were identified using electronic billing records. Patients seen only in connection with the physician in his/her capacity as supervising a resident physician were not included in the data retrieval. Among 18,358 patients seen by these physicians, 1,460 (7.9%) patients who were less than 18 years old, 221 (1.2%) who had died after review of electronic Social Security records through November 30, 2003, and 242 (1.3%) with missing predictor variables were excluded. Of the remaining 16,435 patients, 165 saw 2 different participating physicians, leaving 16,270 unique patients.

Study Procedure to Classify Patients Seen by PCPs

Patient panels for each provider were then created for physician review. For each patient, the physician assigned 1 of the following categorizations: “My Patient,”“Not My Patient,” or “My Patient with Reservations.” The last category was used to identify patients the provider would consider to be his/hers, but who were generally not compliant with care recommendations. It was determined in advance that a patient would be considered linked to the provider if the patient was categorized as either “My Patient” or “My Patient with Reservations.” Physicians reviewed their patient panels between July and October 2003 and were remunerated for their time based upon the size of their panel.

Variables Considered in the Model

Variables considered in the modeling process are shown in Table 1. Broadly these included variables from patient registration fields, physician characteristics, patient scheduling, and billing information. The PCP designee field used information from the hospital registration system including data provided by managed care plans or patients at initial registration and subsequent visits. Distance in miles was calculated from the patient's listed residence to the practice site. Physician characteristics were obtained from the hospital registrar.

Table 1. Variables Evaluated for Predictor Model
Data SourceVariable
  • *

    Includes no shows, cancellations, and rescheduled appointments .

  • Visit designated urgent care, follow-up, annual .

  • Number of visits where PCP is PCP designee divided by the total number of patients seen. See text for description of variable .

Patient registrationPrimary care physician (PCP) designee
Date of birth
Marital status
Health insurance status
Miles from residence to practice
State of residence
Years cared for within care system
Physician characteristicsGender
Years since graduation
Years employed
Clinical sessions per week
Accepting new patients
Years since closed to new patients
Scheduling and billing dataTotal number of PCP visits
Total number of other provider visits
Months since last PCP visit
Months since last practice visit
Future appointments with PCP
Future appointments with other provider
Total number of missed provider appointments*
Visit type
Days waiting for provider appointments
PCP practice style

Quality Assessment Measures

Three cancer-screening tests were assessed using data from electronic billing and medical records supplemented by additional data from other Partners Healthcare affiliated providers and available insurance billing data. Breast cancer screening included any mammogram for women 52 to 69 years old in the prior 2 years. Cervical cancer screening included any Papanicolaou smear performed for women 21 to 64 years old in the prior 3 years. Colorectal cancer screening for patients 52 to 69 years old included any colonoscopy within 10 years, sigmoidoscopy or double contrast barium enema within 5 years, or home fecal occult blood testing within 1 year.


Patients were randomly split into 2 groups, 1 for model development (70% sample) and 1 for model validation (30% sample). Patient variables were compared among development and validation cohorts using χ2 and t-tests. The model's designation of “My Patient” versus “Not My Patient” was compared with the physician designation. Test characteristics including sensitivity, specificity, and positive predictive value were calculated for the development and validation cohorts and for the overall study population. For each provider, positive predictive values and percentages of “My Patients” excluded from the model or misclassified were calculated. Rates of completed cancer screening tests were compared for patients designated as “My Patient” versus “Not My Patient” using logistic regression models with repeated measures analyses to control for provider. All analyses were performed using SAS (SAS Institute, Cary, NC).


Provider and Patient Characteristics

The 18 participating PCPs included 8 women (44%). Overall, they had graduated from medical school 19.1 years ago (median 21, range 4 to 36), had been in practice at MGH for 10.1 years (median 9, range 1 to 24), saw patients 4 half-day clinical sessions per week (range 1 to 8.25), and had a patient panel size of 1029 (median 964, range 226 to 2372). The 5 physicians who did not review her/his patient panel were similar to participating physicians except they worked fewer sessions (mean 2.3) and had smaller patient panels (mean 476).

The average patient age was 48.1 years (SD 16.7, range 18 to 99), 60.5% were women, 51% were married, and the last visit to the PCP was a median of 5.3 months previously. Eighty-one percent were non-Hispanic white, 6% were Hispanic, 4% African American, 2% Asian, and 7% other or unknown. The primary insurer was a commercial carrier for 70%, Medicare for 18%, Medicaid for 6%, and self-pay or free care for 6%.

Model Derivation

We developed a predictive model to categorize patients as “My Patient” or “Not My Patient” using the physician classification as the reference standard. Preliminary analyses identified the PCP designee field from the hospital registration system as a key predictor variable, but it was not specific enough to be used alone. A modified algorithm using the PCP designee field was combined with logistic regression models to assign patients to specific PCPs.

Stepwise logistic regression was used in the development subset to identify potential predictors of patient status. Model performance using predictor variables that were readily available from computer systems was compared with a more complex model using additional predictor variables requiring manual manipulation (e.g., physician characteristics available from credentialing data but only in paper format) or more complex calculations (e.g., distance from patient residence to practice site). The c-statistics of the 2 models were similar, so the simpler model was chosen. Probability cutoffs were selected to maximize the positive predictive value for each physician. The final model constructed using the development patient subset was then validated using the test subset. The combined subsets were then used to generate β coefficients (weights) with the previously chosen predictors and cutoffs.

The final algorithm appears in Figure 1 and is detailed in Appendix B. Patients seen by a physician who was also listed as the PCP in the designee field of the registration system were entered into the logistic regression model. Patients for whom the listed PCP differed from the physician seeing the patient were considered not to be the physician's patient. Variables considered in the logistic regression model included patient age, time at most recent visit, in-state residence, and PCP practice style (Table 2). The PCP practice style variable was defined based on the proportion of their “own” patients typically seen. Thus, physicians who were the listed provider for at least 70% of the patients they saw were categorized as following a solo model, whereas physicians who were the listed provider for less than 70% of their patients were designated as practicing in a collaborative practice model of care (Appendix B).

Figure 1.

 Schematic of algorithm with Logistic Regression Models.

Table 2. Variables in Current Model and Alternative Independently Developed Model Using Same Data Set
Current ModelAlternative Model*
  • *

    Independently developed model using same development and testing data sets, but designed to use data that would be available in most electronic scheduling systems (Lasko et al.)9

  • Total number of days waited for appointments with the given physician, divided by the total days waited for all physicians combined .

  • Total number of visits that a patient has made to the given physician minus the total to all other physicians combined .

  • §

    Total number of appointments scheduled for future visits with the given physician, minus the total for all other physicians combined .

  • Number of days since the last visit to the given physician, divided by the number of days since the first visit .

Primary care physician designee in registration fieldDays since last visit with physician
Physician practice stylePhysician practice style
Patient ageWaiting fraction
Months since last visit with physicianVisit difference
Patient's residence listed as in stateFuture difference§
Idle ratio

Because several model variables (e.g., PCP in designee field, PCP practice style) may not be widely generalizable to other practice settings, we also independently created a second, more complex model using data that would be available electronically in most scheduling and billing systems. This second model has been reported elsewhere and its variables are shown in Table 2.9

Provider Versus Model Classification as “My Patient”

Study PCPs categorized 11,226 of 16,435 (68.3%) patients seen in the preceding 3 years as “My Patient” (range 15% to 93%, Fig. 2). Among “My Patients,” PCPs considered 742 (4.5%) to be noncompliant (“My Patients with Reservations”). “My Patients” (including “My patient with Reservations”) were significantly more likely to be older, non-Hispanic white men, and insured by a commercial carrier or Medicare than “Not My Patients” (data not shown).

Figure 2.

 Percentage of all patients seen by each primary care provider rated as “My Patient.” In general, physicians rating a low percentage as “My Patient” primarily worked as urgent care providers within their practice.

Provider classification was initially compared with model classification separately for development (11,535 patients) and validation (4,900 patients) cohorts, but results are presented for the overall cohort as the models created with development data performed well with the validation data (Table 3). The model's sensitivity was 80.4% (9,025 of 11,226), specificity was 93.7% (4,882 of 5,209), positive likelihood ratio was 11.4, negative likelihood ratio was 0.2, positive predictive value was 96.5% (9,052 of 9,352), and negative predictive value was 68.9% (4,882 of 7,083). C-statistics for models using development and validation data ranged from 0.79 to 0.82.

Table 3. Provider Designation Versus Model Prediction of “My Patient”*
Model PredictionPCP DesignationTotal
“My Patient”“Not My Patient”
  • *

    Combined results from development and validation data sets .

“My Patient”9,0253279,352 (56.9%)
“Not My Patient”2,2014,8827,083 (43.1%)
Total11,226 (68.3%)5,209 (31.7%)16,435 (100%)

A study goal was to provide PCPs with highly accurate patient lists, defined as a positive predictive value of ≥90% for any given PCP. The model met this goal, with the positive predictive value for each provider ranging from 90.1% to 100%. We compared the results of the current model to a simpler model that only included the PCP designee field from the MGH registration system. The simpler model had a higher sensitivity (90.3%, 10,139 of 11,226 patients), but a lower specificity (84.9%, 4,421 of 5,209 patients). Although the positive predictive value remained excellent (92.8%, 10,139 of 10,927 patients), the positive predictive value in this simple model was under 90% for 5 (28%) providers (overall provider range 81.7% to 99.6%).

We compared the model's results to a separately derived model that used generic billing and scheduling variables available in most electronic systems.9 Both models used similar development and testing sets, but combined results are presented as there were no differences among paired individuals from these cohorts (Table 4). Paired comparisons of the same patients in both models showed very similar results, although the alternative model was somewhat more accurate (McNemar's test, P<.0001).

Table 4. Current Model Versus Alternative Model Accuracy in Predicting “My Patient”*
Alternative ModelCurrent ModelTotal
CorrectNot Correct
  • *

    Matched pairs of patients using combined results from development and validation data sets for independently developed predictor models .

  • Independently developed model using same development and testing data sets, but designed to use data that would be available in most electronic scheduling systems.9

Correct13,125 (81.5%)1,458 (9.0%)14,583
Not correct671 (4.2%)857 (5.3%)1,528
Total13,7962,31516,111 (100%)

Model Classification and Cancer Screening Rates

Cancer screening rates were compared for patients predicted in the model to be “My Patient” or “Not My Patient.” For each cancer-screening test, patients predicted to be “My Patient” were more likely to have completed the test than patients predicted to be “Not My Patient,” and this was statistically significant for breast and colorectal cancer screening rates (Table 5).

Table 5. Cancer Screening Test Rates by Provider Designation and Model Prediction of “My Patient”
VariablePopulation, N (%)Provider Designation, N (%)Model Prediction, N (%)
“My Patient”“Not My Patient”P Value*“My Patient”“Not My Patient”P Value*
  • *

    P-values comparing actual and predicted patient loyalty designation to the provider using multiple regression modeling controlling for individual provider cluster with general estimating equation methods.

  • Mammogram performed over the prior 2 years in eligible women aged 52 to 69 years old .

  • Papanicolaou (Pap) smear performed over the prior 3 years in eligible women aged 18 to 65 years old .

  • §

    Colorectal cancer screening test performed in all patients aged 52 to 69 years old: colonoscopy in the prior 10 years; flexible sigmoidoscopy or double contrast barium enema in the prior 5 years; or fecal occult blood test in the prior year .

Breast cancer, Mammography2,636 (82.7%)1,917 (85.9%)719 (74.4%).011,740 (87.9%)896 (72.7%)<.001
Cervical cancer, Pap Smear8,242 (82.7%)5,231 (83.8%)3,011 (80.7%).774,356 (84.9%)3,886 (80.2%).20
Colorectal cancer§4,336 (47.6%)3,256 (48.7%)1,080 (44.1%).742,975 (50.3%)1,361 (41.6%).02


Quality improvement efforts directed toward PCPs require accurate lists of the patients they actually care for. Moreover, these lists should not be limited to those patients with commercial health insurance. We developed and validated a predictive model using readily available electronic data to categorize patients as being linked to a specific provider or not. Patients linked to a PCP were significantly more likely to have completed appropriate cancer screening tests than patients who were not linked to a PCP.

A basic tenet of organized primary care is the assignment of 1 patient to 1 PCP. In reality, many patients receive episodic care from multiple different providers without necessarily establishing a strong relationship with any 1 provider.13,14 Because it can be difficult to easily identify the usual source of care for a patient within a network, available data may be associated with error rates that are sufficiently high to limit the usefulness of such patient lists. We hypothesized that greater attention to and understanding of patients with well-established links to a PCP would help facilitate the goal of making physicians an integral part of quality improvement—something that is often recommended but has proven elusive.12,15

For individual PCPs or practice networks that do not have their own data on patient panels, commercial insurers with managed care contracts often provide lists of their patients who have signed up with a PCP. There is little information on the accuracy of these lists. They may be inaccurate because patients may have signed up but never seen the provider, may have switched providers but not informed the insurer, or may have recently switched insurers. A major limitation with commercial insurer-derived patient lists is that they systematically exclude uninsured or government insured patients. Indeed, focusing on patient populations from commercial insurer lists may have the unintended effect of exacerbating health care inequities in the general population because lower socioeconomic, elderly, and ethnic minorities are underrepresented on such lists.16

Initially we expected the PCP designee field in our hospital registration system would be accurate enough to identify a provider's patient panel. Although the quality of this variable is very good in our data system, it led to the incorrect designation of a sizeable fraction of patients (more than 10%) seen by 5 of 18 PCPs (28%). This field is in part derived from managed care insurer-provided designations of the PCP and does not reflect care utilization such as clinic visits with specific providers. Primary care networks in different settings wishing to address this issue of accurate PCP designation will likely need to derive care system-specific models modeled after the approach described here.

A key study finding is that physicians who participated in assigning patient status had a clear and accurate notion of who “their” patients were. This is reflected in the observation that the model had a high positive predictive value for each of the physicians. If physicians had different concepts about what constituted “their” patient, one would have expected that while the overall model may have a high predictive value, these proportions would have varied considerably among providers. These results reinforce prior qualitative work.17 Future research needs to determine whether patient's views of who their PCP is would differ from physicians.

Patients identified by the model as being linked to a PCP were significantly more likely to have completed appropriate cancer screening tests than patients who could not be linked to a specific physician. As patient linkage to a PCP is likely related to continuity of care and patient-provider loyalty, our results showing better quality of care are congruous with prior studies.18,19 However, whether interventions designed to improve the linkage between patients and their PCPs would result in higher quality of primary care is unproven.

Our study has several important limitations. Although participating physicians reflected a range of experience and practice styles, they may not completely represent all providers who work in our large network. Additional work is needed to validate the model within our broader network. As previously mentioned, it is also possible that the simple and specific model developed for our network will not generalize to other practice settings. For this reason, we concurrently developed an independent model from the same development and test cohorts that used variables likely to be found in most electronic billing and scheduling systems.9 Comparing each model's ability to predict “My Patient” in the same patients showed similar accuracy. Future studies should examine whether such predictor variables can identify patients linked to providers in other primary care practice settings. Our model linked patients and providers using office visits. It is likely that such predictive models will need to be revised as clinical practice and data systems change over time. Finally, lower screening test rates for patients not linked to a PCP may reflect outside tests not available in our electronic systems.

In summary, our participating PCPs did not feel personally responsible for one-third of patients recently seen, and the percentage varied widely among PCPs. We developed and validated a novel method to link patients with PCPs using the physician's designation as the reference standard. To achieve a high predictive value for each physician, 18% of our PCPs' patients were excluded. Patients who were linked to the provider were more likely to undergo appropriate cancer screening tests. Accurately identifying patients linked to a specific provider may help quality improvement efforts directed at PCPs.


The authors gratefully acknowledge the assistance of Nancy Wong, BS, with data management and statistical analyses. We also appreciate the assistance of the physicians who evaluated their patient panels, and Dr. Daniel Singer for his helpful comments on a prior version of this manuscript. Supported by institutional funding through the Massachusetts General Hospital Primary Care Operations Improvement program.