The PU‐PROM: A patient‐reported outcome measure for peptic ulcer disease

Abstract Objective Patient‐reported outcome measure (PROM) conceived to enable description of treatment‐related effects, from the patient perspective, bring the potential to improve in clinical research, and to provide patients with accurate information. Therefore, the aim of this study was to develop a patient‐centred peptic ulcer patient‐reported outcome measure (PU‐PROM) and evaluate its reliability, validity, differential item functioning (DIF) and feasibility. Method To develop a conceptual framework and item pool for the PU‐PROM, we performed a literature review and consulted other measures created in China and other countries. Beyond that, we interviewed 10 patients with peptic ulcers, and consulted six key experts to ensure that all germane parameters were included. In the first item selection phase, classical test theory and item response theory were used to select and adjust items to shape the preliminary measure completed by 130 patients and 50 controls. In the next phase, the measure was evaluated used the same methods with 492 patients and 124 controls. Finally, we used the same population in the second item reselection to assess the reliability, validity, DIF and feasibility of the final measure. Results The final peptic ulcer PRO measure comprised four domains (physiology, psychology, society and treatment), with 11 subdomains, and 54 items. The Cronbach's α coefficient of each subdomain for the measure was >0.800. Confirmatory factory analysis indicated that the construct validity fulfilled expectations. Model fit indices, such as RMR, RMSEA, NFI, NNFI, CFI and IFI, showed acceptable fit. The measure showed a good response rate. Conclusions The peptic ulcer PRO measure had good reliability, validity, DIF and feasibility, and can be used as a clinical research evaluation instrument with patients with peptic ulcers to assess their condition focus on treatment. This measure may also be applied in other health areas, especially in clinical trials of new drugs, and may be helpful in clinical decision making.


| INTRODUCTION
Peptic ulcer is defined as an ulcer occurring in a region that touches gastric acid and pepsin, and usually refers to a gastric or duodenal ulcer. Although peptic ulcers have a very low mortality rate, they can have complications such as haemorrhage and perforation if not treated in time. They can cause significant physical pain to patients and increase financial and service burdens. One report found that the lifetime prevalence of peptic ulcer disease in the general population has been estimated to be about 5%-10%, and incidence 0.1%-0.3% per year. [1][2][3] Although peptic ulcers are a benign, non-fatal disease, they do result in lost productivity and associated economic loss.
Distress caused by peptic ulcers includes psychological, social and behavioural problems, which may interfere with a patient's ability to fully participate in their health care, and manage their illness and its consequences. The widely accepted bio-psycho-social medical model Patient-reported outcomes (PROs) are a central aspect. 7 A PRO is a measurement of any aspect of a patient's health status that originates directly from the patient (i.e, without the interpretation of the patient's responses by a physician or other person). 8 A PRO can directly reflect the influence of disease in a patient, help in the treatment of that patient and help to establish good communication between the patient and medical staff in determining treatment efficiency. In addition, a PRO can help to explain clinical outcomes and treatment decisions made.
In recent years, many effective and reliable measures for the digestive system have been developed internationally and applied in practice. Commonly used measures include the Sickness Impact Profile (SIP), 9 the Nottingham Health Profile (NHP) 10 and the Quality of Well-being (QWB). 11 The Medical Outcomes Study Short Form-36 (SF-36) is the most well-known and widely accepted measure. 12 Specific health-related QoL (HRQoL) measures commonly used include the Quality of life in peptic diseases (QPD) questionnaire, 13 Peptic ulcer diseases questionnaire (PUDQ), Ulcer esophagitis subject symptom (UESS), Quality of life in duodenal ulcer patients (QLDUP). 14 Many studies have been performed on the demographic and clinical characters and health-related quality of life in the disease-specific patients based on all those measures. However, these scales all have a specific focus, and cannot fully reflect the actual situation of patients with peptic ulcers. Therefore, in this study, we aimed to develop a PRO measure for peptic ulcer patients (PU-PROM) that (I) was developed from the perspective of patients and could be widely applied in evaluations of clinical curative effect; (II) is established across four domains (physical, psychological, social and treatment) to comprehensively reflect health status and quality of life in patients with peptic ulcers; (III) can be specifically used to report the clinical outcomes of patients with peptic ulcers; (IV) comprised items that were easy to understand and answer (responses on 5-point Likert scale).

| Ethics statement
The study protocol and the PU-PROM were reviewed and approved by the Medical Ethics Committee of Shanxi Medical University.

| Study population and design
Based on the principles of the United States FDA, we performed a comprehensive review of the literature and related measures, conducted semi-structured interviews with patients with peptic ulcers and consulted relevant experts. We extracted and integrated information from these sources to form a theoretical framework and an initial item pool. The interview participants were 10 patients with peptic ulcers with a consistent disease distribution by sex and age, including six men and four women aged 40.01±9.81 years. The six experts consulted included three chief peptic ulcer physicians, one psychologist, one sociologist and one ethics expert. Based on the patient interviews and expert consultation, we debugged and modified the expressions of items in the initial item pool.
For the first item screening stage, 200 participants were sampled from eight hospitals at different levels in Shanxi Province, China. This included 150 patients with peptic ulcers and 50 controls. Completed questionnaires were examined using classical test theory (CTT) and item response theory (IRT), to select and adjust the item pool, and a preliminary measure was developed. This was followed by a formal investigation, with 550 patients with peptic ulcers and 150 controls, using the same method to reduce the items and form the final measure. Finally, the reliability, validity, differential item functioning (DIF) and feasibility of the measure were verified.

| Development of the PU-PROM
The PU-PROM was developed in three phases: (i) conceptual framework construction and initial item generation; (ii) formation

| Step 1: Item generation
Identifying the conceptual framework and initial item content We searched for related literature of PRO measures to build the conceptual framework of PU-PROM. The theoretical framework was established including four domains and 12 subdomains: physiology (subdomains: physical symptoms, independence and physical status); psychology (subdomains: work stress, anxiety, depression and fear); society (subdomains: social support and social adaptation); and treatment (subdomains: compliance, degree of satisfaction and availability). Then, the patients were interviewed to understand the main symptoms, the influence of the psychological and social, and the evaluation of treatment satisfaction; all of the information were collected and collated to form the item pool. Next, the item pool was revised based on discussions with the six experts and 10 patients. They put forward items that ambiguous in words and difficult to understand should be deleted, and some items should be added. Subsequently, an initial version of the measure was developed, using four domains, 11 subdomains (independence was excluded from the physiology domain) and 64 items.

Sampling survey
We selected a sample comprising 150 peptic ulcer patients and 50 controls from the eight participating hospitals in Shanxi Province.
Patients who were diagnosed with a definite peptic ulcer, who were fully competent and who volunteered to participate were included in this study. Patients were excluded if they had deficiencies in language or cognitive abilities that meant they could not understand or complete the questionnaire; mental illness; or disturbance of consciousness. The controls did not have peptic ulcers, malignant tumours or mental illness, and had a similar age distribution to the patient group.
In the process of evaluating the patients' completed questionnaires, 20 were invalid and 130 were valid, giving a valid response rate of 87%. The valid response rate for the controls was 100%.
Reponses to all items were on a 5-point Likert scale, where recorded as 0-4 points. The 5-point scale form was most commonly chosen as the easiest to complete, and item omission was least frequent with this form. Nagata suggests that the 5-point scale is most useful for measuring health status. 15 The measure contained positive items and negative items. Positive items were scored as the original score plus one and negative items were scored as five minus the original score. Missing data were tested by Little's Missing Completely at Random test, and the P-value was <.001. 16 Items considered as missing at random were imputed based on the expectation-maximization algorithm. 16 Statistical methods for item selection The item reduction for the preliminary measure was based on CTT and IRT. CTT includes discrete trend, factor analysis, correlation coefficients, Cronbach's α if an item is deleted, retest reliability and differentiation degree analysis. These methods were combined with specialized F I G U R E 1 Developmental process flow chart knowledge to assess the items. Items selected at by least five methods were kept, although this meant that other methods might have suggested that the item be removed. The final version of the preliminary measure comprised four domains, 11 subdomains and 54 items (10 items were deleted).

CTT
Discrete trend The scores of all items (derived from the 5-point Likert scale) could be regarded as similar to normal processing. A low discrete trend indicates people inclined to select the same answer, and the item has poor evaluation ability.
We used the standard deviation (SD) to measure the discrete degree of items. Items with a low SD (<1.0) were deleted. Correlation coefficient Items were filtered by representativeness and independence, which showed the item satisfied the purpose of the PROM. An item that showed a low correlation coefficient (<0.6) in relation to its subdomain was deleted.
Cronbach's alpha if item was deleted (CAID) The method used internal consistency to choose items and ensure homogeneity. 17 The internal consistency of items was evaluated by calculating the CITC and CAID values (Cronbach's α), when the CITC value was more than 0.45, which showed the item has highly contribution to the measured construct. The CAID values determined which item has highly contribution to the reliability of the PROM. Whether the Cronbach's α coefficient showing a big increment after an item was removed indicated that item was influential for internal consistency and it should be deleted.
Retest reliability Retest reliability reflects stability and consistency across time. We selected 4 days as the retest interval and calculated the correlation coefficient of the item score across two surveys.
Items with low correlation coefficients (<0.6) were deleted.
Differentiation degree analysis Whether the item could not make a distinction between different objects that should be deleted. We compared each item score for the patient and control groups by performing independent two-sample t-tests (α=0.05). Items with no statistical difference were deleted.

IRT
Item response theory is a nonlinear model used to select items and test construction. It establishes a functional relationship between a participants' reaction to the item and their ability. This relationship is reflected by an item characteristic curve (ICC). Items were assessed using Multilog 7.03 with a grade response model. Each item's parameters of discrimination (α) and difficulty (b) were estimated. In general, items with a discrimination value of <0.4 should be deleted. Difficulty was divided into four grades (b1, b2, b3 and b4) ranging from −3 to 3. Items outside this range should be considered for deletion.

Second sampling survey
To verify the measure, we recruited 550 patients and 150 controls from the same eight hospitals, but only 492 and 124, respectively, were available to participate in the study. The response rate met the requirements, and the number of participants satisfied Nunnally's proposal. 18 We used the same methods as used for the preliminary measure to revise the items.
Classical test theory and item response theory were used for item reselection. Combined with professional knowledge, we used at least five methods to evaluate the items. Three item evaluation methods were discarded. The results indicated that all 54 items should be kept to form the final measure.

| Step 3: Validation of the measure
Finally, we using the data obtained from these 492 patients as well as 124 control participants to examine the reliability, validity, DIF and feasibility of the final measure.

Reliability
Reliability refers to the consistency of the test results; the higher the measured value of homogeneity, the better the reliability. 19 Cronbach's α coefficient and retest reliability are frequently used in reliability analyses. Cronbach's α coefficient is the most commonly used method. In this study, we calculated every subdomain's Cronbach's α coefficient. And generally, the α value should be more than 0.7. Retest reliability reflects the stability and homogeneity across time. And it is generally believed that the correlation coefficient should be more than 0.6.

Validity
Validity analysis evaluates the validity of a questionnaire. This involves content validity, construction validity and discriminant validity. Content validity reflects the degree to which the selected items represent the expected content. 19 Construction validity (or structure validity) examines whether the multi-index measurement is a professionally ideal structure, testing the structure from clinical and common sense perspectives. A measure with good construction validity can obtain true latent trait of subjects during measurement. We used confirmatory factor analysis (CFA) to build a measurement model between the item and the subdomain that included the item. We used LISREL 8.70 software (Scientific Software International. Inc. 7383 North Lincoln Avenue, Suite 100 Lincolnwood, IL 60712-1704) for the CFA. 20 In addition, the evaluation of the fit effect of the model for every domain was also used multiple indexes, commonly fit index including: GFI, RMR, NFI, NNFI, IFI, CFI. 21 Discriminant validity reflects small changes across different populations and different times. It can reflect the different trait of selected subjects. Discriminant validity was assessed by comparing the mean score of the patients and the controls to ensure whether each subdomain correctly distinguish the two types of people. Generally, we used a simple independent twosample t-test to compare patients and controls. When the P-value was <.05, we considered the difference to be statistically significant, and the measure to have a good degree of differentiation.

DIF
As a test result has personal, social and political ramifications, it should be reliable, valid and fair. 22 To investigate whether a test item is fair among members of different subgroups such as males and females and majority groups and minority groups, a plethora of research on DIF has been conducted. 23,24 This verified the quality of the questionnaire and ensured the validity and fairness of the measure. DIF is generally divided into two types of uniform DIF and non-uniform DIF. Uniform DIF is present when an item differs across groups in item difficulty parameters, while non-uniform DIF is present when an item differs across members of different subgroups in item discrimination parameters. 25 The MACS model identifies non-uniform DIF and uniform DIF using unidimensional multistage scoring. If there is no DIF in the items of all subdomains, we further confirmed it by the comparison of nested models. Comparing the chi-square difference between the "measurement equivalence model" and the "baseline model," if the difference was not significant, the items of the subdomains did not exist DIF; if there is DIF in the items of all subdomains, we need to compare the chi-square difference between the "measurement equivalence model" and the "partial measurement equivalence model"; if the difference has statistical significance, indicating that the items exist DIF.

Feasibility
Feasibility is used to reflect the degree of acceptability of a measure. This is characterized by acceptance rate, response rate and completion times. In general, the recovery rate of the questionnaires should more than 85%, and the response rate also should more than 85%. In addition, time for each person answered should be controlled within 15 minutes.

| Item generation
Through a large number of literature, expert consultation and patient interviews, we conducted a conceptual framework consisted of four domains, 12 subdomains and a pool of 77 items. Next, we selected

| Item selection
The two-step item selection process was based on CTT and IRT.
This iterative process resulted in a final version comprised four domains, 11 subdomains and 54 items. For the first phase, statistical results for the items are given in Table 3. As we can see, 10 items (PHD15, PHD16, PSD14, PSD15, PSD16, PSD20, SOD4, SOD8, SOD9, THA8) were deleted based on the criteria described in the Methods. To ensure the reliability and validity of each item, we used the same methods to re-select items and considered their practical significance based on the experts' opinions for each item. All items remained after this process. Figure 2 shows the matrix plot of the ICCs for each item. Ideally, the first ICC curve should decrease monotonously, the last curve should increase monotonously and the other curves follow a normal distribution. The closer the ICC distribution is to the ideal state, the more information it contains, and vice versa. As seen in Figure 2, most items were satisfactory.

| Validation of the measure
The reliability, validity, DIF and feasibility of the 54 items were assessed, and the results are presented in the followed.

| Reliability
Cronbach's α coefficient is an important indicator for the reliability.
In general, its value should more than 0.70. For our PU-PROM, the Cronbach's α of each subdomain was more than 0.800 for the measure and ranged from 0.817 to 0.907 for the four domains, indicating the measure was reliable (Table 4). In addition, we conducted a repeated survey of 50 patients, and the correlation coefficients of each item were >0.60, which showed the measure has high retest reliability.

| Validity
The measure was also found to have good content validity as it allows direct communication with patients 26,27 and they thought the PU-PROM was easy to understand and response. In the process of the

T A B L E 3 (Continued)
item building and modifying phase, experts also agreed the measure was reasonable and comprehensive in content. We conducted CFA for the 54 items to investigate the factor structure of the measure.
We also found that the indices of fit (GFI, RMR, NFI, NNFI, CFI, IFI) met the expected structure. On the whole, GFI, NFI, NNFI, CFI and IFI are all more than 0.90, and RMR is <0.09. The results are shown in Tables 5 and 6. The measure was able to distinguish between patients with peptic ulcers and controls. Our analysis of the different average scores using the independent two-sample t-test showed that all P-values were <.05, indicating the measure had a good degree of differentiation ( Table 7). As controls did not receive treatment and could not answer items in the treatment domain, no comparison was made for the SAT (satisfaction) subdomain.

| DIF
This study used a mean and covariance structure (MACS) model based on sex for the DIF analysis, through which we examined whether there were differences between men and women. We performed DIF analysis for the 11 subdomains by sex (

| Feasibility
The acceptance rate and completion rate of the measure were beyond 85%, and the average completion time was within 15 minutes, indicating that the measure was feasible.
Therefore, the final PU-PROM is comprised of four domains, 11 subdomains and 54 items (Appendix 3) ,and the theoretical framework is shown in the Table 9.

| DISCUSSION
Peptic ulcers have various causes and mainly occur in the stomach and duodenum mucosa. A peptic ulcer is a chronic disease characterized by high incidence, low mortality, spontaneous remission and periodical paroxysms. The main peptic ulcer complications are haemorrhage and perforation. The disease leads to a decrease in productivity and brings significant loss to society, and the treatment cost contributes to social and economic burdens. Therefore, a PRO measure for peptic ulcers may improve patients' QoL and can be used to evaluate clinical curative effect.
Patients are increasingly involved in evaluations of health-care quality. [28][29][30] Internationally, many HRQoL measures have been developed and applied in chronic diseases of the digestive system.
However, existing measures have some deficiencies. For example, the NHP focuses on a more serious disability level and is not sensitive to relatively mild condition changes; the SIP has wide coverage, but is very long; and the SF-36 is widely used, but does not identify specific areas of the disease. The QPD is restricted to functional dys- In developing the measure, we used cognitive investigation to form the conceptual framework, CTT and IRT to select items, and CFA to validate items. The United States FDA has emphasized the importance of clinical outcomes and provided guidance for the establishment of PRO measures. 31 This study was based on FDA guidance, and our methods of investigation strictly followed the established PRO measure production process. We used expert and patient opinions and suggestions to build an initial measure. A sample investi- In contrast to previous assessments of measures, we directed attention to the items rather than the structure validity of the measure. Items for the PU-PROM were based on a review of the literature and other related questionnaires, face-to-face interviews with patients and discussions with expert professionals. The content validity index further strengthened the content validity of our preliminary measure, as in the development process, we selected items by this index. The role of content validity index in the process of preparation and evaluation cannot be ignored, and can meet the quality requirements of the measure before implementation.
In the development stage, we used CTT and IRT to select the items, striving for items that had strong representativeness, independence and high sensitivity. CTT is easier to understand and is more commonly used than IRT. However, it neglects the estima- T A B L E 9 Construction frame of final PU-PROM too small, meaning that in the process of DIF analysis, a "baseline model" could not be implemented. Additional studies should further expand the scope as well as the number of participants. We should also compare our PU-PROM with a criterion measure and assess the measure in terms of the range, reliability, validity and applicability.

| CONCLUSIONS
Our study makes important contributions to the treatment and outcomes of patients with peptic ulcers and develops a specific measurement instrument for this group. We found strong evidence for the reliability and validity of our PU-PROM. However, we do not consider that our PU-PROM is able to replace other related questionnaires.
Therefore, we need to further improve the measure to ensure it suitable for a wider range of people. We could also extend the measure for other uses such as patients' health conditions, clinical effects evaluation, new drug development, health service deployment and clinical research.