Secondary analysis of national survey datasets

Authors


Sunjoo Boo, Department of Physiological Nursing, University of California San Francisco, San Francisco, CA 94143-0610, USA. Email: sunjoo.boo@gmail.com

Abstract

Aim:  This paper describes the methodological issues associated with secondary analysis of large national survey datasets.

Methods:  Issues about survey sampling, data collection, and non-response and missing data in terms of methodological validity and reliability are discussed.

Results:  Although reanalyzing large national survey datasets is an expedient and cost-efficient way of producing nursing knowledge, successful investigations require a methodological consideration of the intrinsic limitations of secondary survey analysis.

Conclusions:  Nursing researchers using existing national survey datasets should understand potential sources of error associated with survey sampling, data collection, and non-response and missing data. Although it is impossible to eliminate all potential errors, researchers using existing national survey datasets must be aware of the possible influence of errors on the results of the analyses.

INTRODUCTION

Advances in digital technology have made it possible to have easy access to large datasets and to analyze them with personal computers. In response, secondary data analysis has become commonly recognized as a legitimate form of scientific inquiry in nursing research (Kneipp & Yarandi, 2002). Simply defined, secondary data analysis is the method of using any existing datasets (Clarke & Cossette, 2000). Particularly, at a time of limited funding for non-experimental studies, secondary data analysis can be an alternative means to be able to conduct studies with less cost. It is an expedient and cost-effective way of producing knowledge, especially when it involves a nationally representative database. Large national surveys on behaviors, health, or health care commonly contain multiple variables important to nursing research and practice (Bibb & Sandra, 2007; Bierman & Bubolz, 2003; Kneipp & Yarandi, 2002). Analysis of these representative data not only provides reliable national estimates but also offers the opportunity to test or generate nursing theories based on a large representative sample.

While the benefits of reanalysis of national survey datasets are considerable, several limitations need to be considered. Because researchers using existing survey datasets were not involved in the study or sampling design, data collection, or the data entry process, it is important to fully understand the accuracy of the datasets before delving into the database. Without a full understanding of the limitations or complexities of the analysis of large national survey datasets, the process will be challenging and the results may be unreliable. Researchers planning to analyze national survey datasets must recognize unique issues pertinent to survey data quality at the beginning so that the potential for introducing threats to reliability and validity can be addressed and their impact on the results considered.

Previously, several articles have covered a wide range of topics in terms of general theoretical (Clarke & Cossette, 2000), methodological (Castle, 2003; Clarke & Cossette, 2000; Doolan & Froelicher, 2009; Magee, Lee, Giuliano, & Munro, 2006; Pollack, 1999), and practical (Clarke & Cossette, 2000) considerations and statistical issues concerned with analyzing an existing dataset (Kneipp & Yarandi, 2002). This paper will describe the methodological issues that need to be considered specific to secondary analysis of large national survey datasets.

MAKING GOOD USE OF EXISTING SURVEY DATA

Secondary data analysis can be carried out rather quickly because there is ready access to datasets. In addition, most research projects that consist entirely of secondary data analysis raise few ethical considerations; thus, they are usually eligible for expedited Institutional Review Board (IRB) review, and the review process is often speedy. Where good datasets are available, researchers can save time and money by making use of them to answer their own new research questions. Currently, many large survey databases sponsored by the national government are easily accessible to researchers. Some are freely available online for anyone to use, but some may require permission for the use of the datasets. Examples of large national surveys are the National Health and Nutrition Survey, the National Survey on Family and Economic Conditions, and the National Health Interview Survey. These survey datasets provide the benefits of nationally representative samples, which often are difficult to obtain directly. When national survey datasets contain many health-related variables, they are capable of being reanalyzed to answer a wide variety of nursing research questions, cost-effectively yielding reliable estimates of public health (Bibb & Sandra, 2007; Bierman & Bubolz, 2003). Analysis of national survey datasets can be carried out to describe phenomena, to test nursing theories, to generate knowledge for nursing practice, or to understand the present, the past, or trends over time.

Although secondary data analyses serves as an economical alternative to an expensive and time-consuming data collection process, the research process remains basically the same with all other research studies. Secondary data analysis requires the researchers to begin with a sound conceptualization of the research question to be studied including a conceptual or theoretical framework (Magee et al., 2006). The framework serves to delineate the inclusion of variables and to define how the variables are conceptualized. Formulating the research question and theoretical framework allows researchers to narrow the range of the possible datasets (Magee et al., 2006).

Once the researcher has defined the research questions, he or she needs to identify the most appropriate dataset available. Working with an existing dataset requires the researcher to work within that dataset. Therefore, it is necessary to achieve the most appropriate fit between the research question proposed and the datasets available (Castle, 2003; Doolan & Froelicher, 2009; Magee et al., 2006). Identifying and obtaining a proper data source may require substantial time and effort, but increases the probability that the research will yield valuable results. Before evaluating datasets, the researcher should specify the population and variables of interest as well as the ideal definition and measures of the variables. The researchers, when selecting from existing datasets, should only consider high quality datasets with a sufficient level of accuracy and detail for the proposed research.

Datasets identified may require refinement or modifications for the research questions or scope of the study (Doolan & Froelicher, 2009). The process of secondary data analysis is often an iterative process, rather than a linear one. The process includes research question development, identification of potential datasets, and modification or refinement of the research question depending on the data available in order to balance feasibility and limitations (Doolan & Froelicher, 2009).

In designing a secondary analysis of national surveys and in choosing the most appropriate data source for answering the research question, researchers face a number of potential problems. Some are inherent in studies using existing datasets, but some are related to the use of large survey datasets. Issues about survey sampling, data collection and non-response and missing data in terms of methodological validity and reliability will be discussed.

SAMPLING CONSIDERATIONS

Sampling design

When designing a study using a large national survey dataset, the kind of sampling design used in the survey to collect the data is very important because specific sampling design can affect the accuracy of the statistical analysis results and generalizability of the results. In large national surveys, a simple random sampling method is rarely used. Not only is it extraordinarily expensive, but it is cumbersome and tedious. Instead, stratified random or multistage sampling methods are commonly used as efficient strategies to provide a nationally representative sample. The multistage sampling strategy combines probability sampling approaches in various ways, usually clustering, stratification, or oversampling (Trochim, 2001). It may begin by selecting representative geographic units (clustering). Strata within selected geographic units are then identified for a random sample.

When any sampling method other than simple random sampling is used, for example, the multistage sampling method, the concept of weight is important in analyzing the data. There are two categories of weights: sample weights and variance estimation weights (Kneipp & Yarandi, 2002). Sample weights reflect the probability of being sampled based on sample size or design as well as adjustments for non-response (Kneipp & Yarandi, 2002). A simple example is if 100 individuals in a sample were drawn from a population of 10 000, the sample weight would be 100 because each individual in the sample represents 100 (sample weight = 1/fraction = 10 000/100) identical responses in the population. On the other hand, variance estimation weights are needed to adjust the variances caused by design (Kneipp & Yarandi, 2002). When multistage cluster sampling is used, individuals within the same geographic area might have more characteristics in common with each other than individuals selected at random from the population, thus affecting the variance of the survey.

If researchers are concerned about the relationship of variables for that sample, weights are not necessary (Lohr, 2009). However, if researchers want to estimate the population with the sample, weights are absolutely necessary. In this case, data from each sample need to be multiplied by the appropriate weight to obtain unbiased estimates (Kneipp & Yarandi, 2002; Lohr, 2009). Ignoring sampling design in the analysis results in underestimating standard errors and narrowing confidence intervals, thus frequently leading to results that seem to be statistically significant when, in fact, they are not (type I error). To adjust for survey weights, specific survey software such as STATA, SAS, SUDAAN, or Epi Info are needed. In the case of SPSS, a Complex Samples, optional add-on module that can be purchased separately, helps to incorporate the sample design into survey analysis accounting for weights (Centers for Disease Control and Prevention, 2009). A study by Boo and Froelicher (2012) is one example of using the SPSS Complex Samples to appropriately analyze a national survey dataset. Regular statistical software not designed for survey data was created with the assumption that data were collected using simple random sampling (Centers for Disease Control and Prevention, 2009; Kneipp & Yarandi, 2002). Kneipp and Yarandi (2002) showed excellent examples of underestimating standard errors by not accounting for weights with regular SPSS leading to incorrect conclusions. Researchers analyzing national survey datasets need to understand sampling design and weights to obtain unbiased population estimates.

Sampling frame

Many available national survey datasets provide the benefits of nationally representative large samples. This makes the analytic results from the datasets more generalizable and safer from threats to external validity than results from small studies using convenience samples. However, before making inferences from the research results from survey data, researchers need to consider the sampling frame used and the population represented in the survey. A sampling frame is the list of the population from which the sample is chosen or the way to access the population to choose the sample (Trochim, 2001). How the sampling frame is defined implies what the population represents, thus giving some idea about the subgroups that may have been excluded. For example, in a telephone survey where names are selected from a telephone directory, the directory is the sampling frame. In this case, persons who have no telephone or who are not at home during the day are likely to be excluded or underrepresented. But it is possible that persons without telephones are more likely to be poor and persons not at home during the day are more likely to be employed. Such a difference between the sampling frame and the population can cause bias. Researchers using a pre-existing national survey dataset need to be aware that while it is a nationally representative survey, some subgroups who may differ from the population may be excluded according to the sampling frame used and this needs to be included in the discussion.

Sample size and effect size

In the case of secondary analysis, the sample size is predetermined. A large sample size is one major advantage of using national survey datasets. However, a researcher using a pre-existing large dataset needs to know that an extremely large sample often ends up with statistically significant results with very small differences (type I error) (Magee et al., 2006). Statistical significance is different from the clinical significance of a difference. It is important to discuss explicitly whether the findings are not only statistically significant but also clinically significant.

In addition, if a researcher is interested in a subgroup of specific age or race, the sample size may not be adequate. Therefore, as in any study, the researcher needs to make sure that the sample size in the dataset provides sufficient power to investigate the new research questions. This involves a power calculation in which the following four parameters are involved: (i) alpha level (α); (ii) power (1-β); (iii) sample size; and (iv) effect size (Saba, Pocklington, & Miller, 1998). These four parameters are so related to each other that if any three of them are fixed, the fourth can be determined. The alpha level (α), the probability of mistakenly rejecting the null hypothesis when it is true (type I error), is commonly set at 0.05. Power, the probability of correctly rejecting the null hypothesis when it is false, is 1-beta (β). The beta value (β) is the probability of failing to reject the null hypothesis when it is false (type II error). Conventionally, a beta (β) of 0.20 is selected, producing 80% power. Researchers using pre-existing datasets need to confirm whether the given sample has sufficient power to detect a statistical difference. Insufficient sample size causes a lack of power and increases the risk of a type II error.

DATA MEASUREMENT CONSIDERATIONS

Evaluation of measurement of dataset

The major threats to the reliability and validity of datasets used in secondary analysis arise from the precision and accuracy of the methods of data collection used in the primary data collection process (Clarke & Cossette, 2000; Magee et al., 2006; Pollack, 1999). The quality of the data collected determines the quality of research results. Researchers reanalyzing existing data may use the dataset for answering different research questions other than those for which the original data collection was intended. Thus, specific variables of interest may not have been assessed or may have been measured with undesirable measurement tools. Alternatively, important variables may have been defined or categorized in a way different from what the researcher would prefer. For example, race may have been defined with only two categories of white/others or age may have been collected as a categorical variable rather than as a continuous variable. For these reasons, researchers, when making use of secondary data, need to judge whether or not there is a good fit between the research questions proposed and the dataset. With secondary analysis, a good fit between the research questions and the dataset is mandatory to minimize errors and increase validity (Castle, 2003; Doolan & Froelicher, 2009; Magee et al., 2006). The dataset should be evaluated carefully to confirm that it includes the important variables of interest and that the data are operationalized and coded in an appropriate manner allowing for the desired analysis. This can be made possible by thoroughly reviewing the purpose and the summary reports, the codebooks, the manual of operations, and previously published papers related to the dataset. Most national survey datasets provide extensive documentation on the method and data reliability at their websites, allowing secondary analysts easy access to the valuable information.

Measurement error

Measurement error is the extent to which the responses differ from the truth (Bierman & Bubolz, 2003). Large national surveys are probably designed to minimize error but it is not feasible to remove all possible sources of error. Furthermore, measurement errors made in the original survey are often invisible and it can be difficult to determine where the error came from. Therefore, a researcher using existing datasets needs to pay attention to the details of the survey methodology and understand how this may influence results.

Factors related to the survey methods, the respondents, and the instruments or measurements can introduce measurement error (Bierman & Bubolz, 2003). Differences in the survey method administrated may affect the responses of respondents. Data collected by interview is often more complete than data from self-administered questionnaires. With a personal interview, the interviewer works directly with the respondent making it more useful than a mail survey for probing questions to elicit better answers, especially when asking about feelings or opinions. But interviewers can also be a potential source of error by the way the interviewer administers the survey and records results so they must be well trained. On the other hand, mail surveys are easy to administer but response rates are often low. It may not be known who actually responded to the questionnaire.

Survey respondents can sometimes provide inaccurate information for several reasons (Bierman & Bubolz, 2003; Trochim, 2001). For example, in reporting behaviors such as drug use or sexual behaviors, they are likely to respond in more socially desirable ways. This may be more problematic with interviews than with questionnaires. Recall bias can be a threat when reporting events that happened in the past.

The use of a reliable and valid instrument to collect data is an important consideration that can minimize measurement error. Not only the reliability of the instrument provided by the instrument's author but also that provided by original research and by the current sample should be evaluated carefully (Magee et al., 2006). Wording or ordering of questions in the instrument and timing of data collection can also influence responses (Bierman & Bubolz, 2003). Even though national surveys probably adopt properly designed questionnaires and rigorous procedures for interviewing to minimize error, researchers using existing data need to be aware of the potential sources of error. They must consider how potential error could bias the research results and describe as limitations. It is helpful to maintain open communication channels with the individuals involved in the collection of the original data in order to obtain information about the accuracy of the data.

Non-response and missing data

Both non-response and missing data reduce the sample size and can bias results. The response rate in a survey is the percentage of people who complete the survey from the selected sample. A low response rate may result in low accuracy (Bierman & Bubolz, 2003; Pollack, 1999). Selection bias can be a problem especially when the subjects differ between those who responded and those who did not. In the case of secondary analysis, the response rate may not simply be how many persons complete the original survey (Bierman & Bubolz, 2003). For example, if an original survey had a response rate of 80% and if 90% of the responses had sufficient data to permit inclusion in the secondary analysis, the response rate of the secondary analysis is the product of both numbers, 72% (0.8 × 0.9). Non-responses may be adjusted with a technique that weights samples using specific survey software (Kneipp & Yarandi, 2002).

Missing data can be an important concern in secondary data analysis. Large surveys usually contain many variables in the questionnaire and this burden on the respondent to complete the questionnaire appears to increase the risk of missing data. Excessive missing data may be a sign of poor data quality; if that is the case, the decision to use the dataset may be inappropriate (Doolan & Froelicher, 2009; Pollack, 1999). Before deciding how to handle missing data, the researcher needs to determine the nature of the missing data. There are several reasons why the data may be missing. It can be considered missing completely at random (MCAR) if the missing variables are related neither to the value of the variable nor to the value of the other variables (Allison, 2001; Polit, 2010). This is the most desirable case, providing unbiased results, but it is almost never the case. When missing data are not MCAR, they may be categorized as missing at random (MAR). When a variable has missing data that are MAR, it means that the missing data are not related to the value of the variable that has the missing values but to other variables (Allison, 2001; Polit, 2010). For example, data would not be considered MCAR if women are simply less inclined than men to report their weight, thus missing data on weight would be related to sex. It is possible to test this relationship by dividing the sample into those who did report their weight and those who did not, and then testing for a difference by sex. But, if among women the probability of reporting weight is equal regardless of their weight, the missing variable is random and the data would be considered MAR, though not MCAR. If some data are MAR, they are somewhat inconsequential because the outcome of interest is independent of the missing data.

The third type is missing not at random (MNAR), in which the missing variables are related to the value of the variable and, often, to other variables as well (Allison, 2001; Polit, 2010). For example, if obese women had a tendency to refuse to report their weight, data would be considered as MNAR. Distinguishing MAR from MNAR is sometimes difficult and requires prior knowledge about the variable (Allison, 2001).

Non-random missing data may be a threat to validity because it means that some respondents have chosen not to answer one or more questions or items for some unknown reason (Munro, 2004; Pollack, 1999). It is not possible to predict from other cases what the missing data would have been. Missing data increases the chance of a type 2 error and affects the generalizability of the study findings (Munro, 2004). The most commonly used ways to handle missing data are: (i) to delete the respondent for whom there is any missing value for any variable (listwise deletion); (ii) to delete the respondent only when the variable with missing data is involved in an analysis (pairwise deletion); or (iii) substitute some value for the missing data (Munro, 2004; Polit, 2010). The researcher who is using an existing dataset needs to examine patterns of missing data in order to assess potential biases and include in the analytic plan strategies to deal with the missing data properly.

CONCLUSION

Secondary data analysis is a legitimate way to enhance knowledge development in nursing. Its advantages are that it is less time-consuming and costly than undertaking a prospective study. Readily available national survey datasets are valuable sources for nursing research that can produce generalizable results in a versatile way.

While the potential for secondary analysis of national surveys is tremendous, successful investigations require a methodological consideration of the intrinsic limitations of secondary survey analysis. A sound conceptualization of the research question and a good fit between the research question and the dataset are prerequisites to yielding valuable insights. Potential sources of error associated with sampling, data measurement, and non-response or missing data in survey datasets should be considered when evaluating the appropriateness of the dataset. Even though it is impossible to eliminate all of the potential sources of errors, researchers must be aware of their possible influence on the results of the analyses to maximize opportunities for eliciting accurate and useful insights.

Ancillary