The Robustness of the Corporate Social and Financial Performance Relation: A Second-Order Meta-Analysis

For many decades, there has been a debate about the relation between corporate social/environmental performance (CSP) and corporate financial performance (CFP). Our study presents a review of academic research on this topic by applying a second-order meta-analysis. The data sample combines 25 previous meta-analyses yielding a sample size of one million observations. Our results demonstrate a highly significant, positive, robust, and bilateral CSP-CFP relation. The relation is positive regardless of whether firms focus on ecological or social aspects, though corporate reputation turns out to be a key CSP determinant. We find a particularly strong CSP-CFP relation for operational CFP. Furthermore, we add a new perspective on potential biases resulting from the studies’ publishing source: social issues-oriented journals and methodological weaker papers do not distort the positive CSP-CFP relation. Our conclusion is: Based on the extant literature, the business case for being a good firm is undeniable. © 2018 The Authors. Corporate Social Responsibility and Environmental Management published by ERP Environment and John Wiley & Sons Ltd. Received 24 April 2017; revised 22 October 2017; accepted 3 November 2017


W HAT IS THE RELATION BETWEEN CORPORATE SOCIAL/ENVIRONMENTAL PERFORMANCE (CSP) AND CORPORATE FINANCIAL PER-
formance (CFP)? For nearly 50 years, scholars have investigated this question. The search for the answer has become particularly prominent since the end of the 1990s. Our study sample comprises 1214 empirical primary studies on the CSP-CFP relation. While considerable research has been conducted, it is as yet less clear how different CSP dimensions and CFP categories interrelate and in which way various moderating factors determine the CSP-CFP relation (Aguinis and Glavas, 2012;McWilliams & Siegel, 2001;Peloza, 2009). Moreover, the robustness of the business case over time has only occasionally been analyzed (Borgers, Derwall, Koedijk, & ter Horst, 2013;Derwall, Koedijk, & Ter Horst, 2011). The aim of this study is to provide a more detailed understanding of the CSP-CFP relation.
relevance or interest to a variety of external parties. For example, reputation is likely to influence customers' behaviours and investors' decisions (Orlitzky et al., 2003). This intangible character of corporate reputation, which is difficult to replicate, leads to sustained superior profit over time (P. W. Roberts & Dowling, 2002). Literature reviews and meta-analytical results point in a similar direction (Margolis et al., 2009;Orlitzky et al., 2003;Sabate & Puente, 2003). We, thus, hypothesize that CSP reputation plays a dominant role: H2b: Within the CSP dimensions, reputational aspects have the strongest relation to CFP.

CSP Disclosure
A more controversial debate exists regarding the link between CSP disclosure and CFP. Wiseman (1982) finds considerable discrepancies between CSP disclosures and actual CSP. Ullmann (1985) concludes that voluntary disclosure should not be considered as a proxy for CSP performance. R. W. Roberts (1992) discovers that stakeholder power, philanthropic activities, and previous economic performance significantly explain concurrent CSP disclosure. Orlitzky et al. (2003) ascribe a minimal positive relation between CSP disclosure practices and CFP, whereas additionally almost all of the observed effect size variance is due to statistical artifacts. Research in environmental management also nourishes a weak relation among both measures. Schultze and Trommer (2012, p. 396) summarize, based on a review of empirical research, that 'as disclosure is mostly voluntary, many companies only provide beneficial rather than accurate, unbiased information on their current performance'. This could lead, in turn, to distorted information of market participants accompanied by increased stock volatility when investors realize that their investment decisions were based on wrong assumptions (Orlitzky, 2013). Thus, we expect a relatively weaker link of CSP disclosure and CFP: H2c: Within the CSP dimensions, CSP disclosure has the weakest relation to CFP.

The Relation of CSP and various CFP Categories
Perceptional CFP Perceptual evaluations of business performance by senior executives are considered to be valid substitutes for objective CFP measures. The product-moment correlation coefficient between subjective and objective measures ranges in empirical studies between 0.4 and 0.8 (Guthrie, 2001;Venkatraman & Ramanujam, 1987;Wall et al., 2004). Despite managers' discretionary assessment of the firm performance, the judgment is often a fair reflection of financial reality and not per se overrated (Venkatraman & Ramanujam, 1987). Depending on the research context, the application of survey-based measures can even be the preferred choice of researchers. A quarter of empirical studies in top management journals therefore use some kind of subjective organizational performance measure (P. J. Richard, Devinney, Yip, & Johnson, 2009). Perceptual CFP measures are compiled by one-fifth of the meta-studies in our sample. We formulate our hypothesis in accordance with meta-analyses that report high correlations between CSP and perceptual CFP measures (Margolis et al., 2009;Orlitzky et al., 2003;Wu, 2006): H3a: CSP has the strongest relation to perceptional CFP than to other CFP categories.
Operational CFP Operational performance such as productivity allows a nuanced conceptualization of business performance as it focuses on those operational success factors that might lead to superior CFP (Venkatraman & Ramanujam, 1986). As an intermediate-outcome metric (Peloza, 2009), it reflects information that is closer to the value creation process within a firm and more remote to the variety of interacting and overlapping factors like discretionary accounting practices or general stock market movements (P. J. Richard et al., 2009). McWilliams, Siegel, and Teoh (1999) caution not to use share prices as a sole metric for CFP when investigating the CSP-CFP relation. Effects of specific plant or product-level information on value creation cannot be fully captured by simply looking at market-based measures such as shareholder value. Therefore, the effects of CSP on operational CFP measures should be less likely to be drowned out by capital market 'noise' (Peloza, 2009(Peloza, , p. 1527. Empirical evidence for this argument was found in a meta-analysis on human resource factors (Combs et al., 2006): operational CFP measures such as staff turnover yield higher correlations than other accounting-or market-based performance categories. Similar results were carried out in the environmental supply chain context: operational CFP measures such as reduced material and waste streams demonstrated slightly higher correlations to market-based measures (Golicic & Smith, 2013). Thus, we hypothesize: H3b: CSP has a stronger relation to operational CFP than to market-based CFP.

Traditional CFP Categories
When considering the range of the traditional CFP measures, Cochran and Wood (1984) propose that accountingbased CFP such as return on assets may be the best proxy for measuring CFPpresuming that the accounting practices of firms are comparable. They more adequately reflect internal decision-making capabilities and managerial performance rather than external evaluations of investors. Empirical evidence supports this: accounting-based CFP measures have a stronger relation to CSP than market-based CFP measures (Margolis et al., 2009;Orlitzky et al., 2003;Wu, 2006).
Growth-based CFPtypically determined by sales, profit, or employee growthis not directly affected by capital market developments. However, companies' growth can be distorted by inorganic growth from acquisitions or divestments. Such activities usually have a relatively higher impact on growth measures (O'Shaughnessy & Flanagan, 1998) compared to changes in accounting figures (King, Dalton, Daily, & Covin, 2004). Accordingly, empirical research finds a weaker CFP relation for growth as a CFP measure compared to accounting-based CFP (Dixon-Fowler et al., 2013;Unger et al., 2011).
With regards to risk, primary research (Bansal & Clelland, 2004;Godfrey, Merrill, & Hansen, 2009;J. B. McGuire et al., 1988) suggests that CSP reduces companies' risk. Most of the empirical studies measure risk either as accounting-based risk (i.e. debt to assets) or market-based risk (i.e. beta-factor, total volatility). However, the observed correlations in meta-analyses are relatively low (e.g. Orlitzky & Benjamin, 2001). Summing up, these arguments bring us to the following hypothesis: H3c: Accounting-based CFP demonstrates a stronger relation to CSP than growth-, risk-, and market-based CFP.

Mutual Funds
Several studies on mutual funds, financial indices, and virtual portfolios suggest that the positive CSP-CFP relation is weaker compared to studies using firm-specific CFP measures. To date, this result presents a puzzling outcome in CSP-CFP research. The first mutual funds studies appeared in the early 1990s. The initial findings of these early studies discovered that the performance of CSP-oriented mutual funds is not statistically different from the performance of conventional mutual funds (e.g. Hamilton, Jo, & Statman, 1993). However, ever since then, further investigations have been carried out in several primary and review studies (Bauer, Koedijk, & Otten, 2005;Capelle-Blancard & Monjon, 2014;Rathner, 2013;Revelli & Viviani, 2015), hardening the conclusion that the CSP-CFP link is weaker for mutual funds as compared to other CSP-CFP studies. In turn, we hypothesize: H3d: Studies on mutual funds demonstrate a weaker CSP-CFP relation compared to studies using other CFP measures. (Aguinis & Glavas, 2012;Buehler & Shetty, 1974;Griffin & Mahon, 1997;Orlitzky, 2008) several other moderating factors have been used in primary studies. These are visibility to the public and consumers (Fry, Keim, & Meiners, 1982;Jiang & Bansal, 2003), research and development investments (McWilliams & Siegel, 2000), economic conditions (Golicic & Smith, 2013), the level of diversification, advertising, government sales, consumer income, and stage in the industry as well as the firm life cycle (Aguinis & Glavas, 2012;Elsayed & Paton, 2009;McWilliams & Siegel, 2001). Since not all primary studies use the same bundle of control variables and/or moderating factors, it is difficult to compare such effects in meta-analyses. Thus, as meta-analysts have not adjusted for these factors in a comparable manner, we also cannot investigate these effects. Nonetheless, a few central moderating factors can be assessed at second-order level.

Effects over Time
It is not unusual that academic studies investigating a specific effect for the first time demonstrate impressive effect sizes which partly diminish in subsequent research studies (Trikalinos & Ioannidis, 2005). Griffin and Mahon (1997) have already called for research that looks at the CSP-CFP relation over time. To date, it is still an open question as to whether this is a time-independent relation. Scholars suggest that institutional investors increasingly focus on stakeholder issues (Borgers et al., 2013;Derwall et al., 2011). On account of learning effects and changes in investor values, these issues came into focus in recent years and subsequently the CSP-CFP relationin this case measured as market-based CFP -diminished considerably since the mid-2000s. However, in meta-research on the CSP-CFP link, the evidence of a time dependency is not consistent. In their meta-analysis, Pavie and Filho (2008) show considerable effect size differences before and after 1998. Some discover no or only non-significant relations (Albertini, 2013;Endrikat et al., 2014;Rubera & Kirca, 2012). Based on suggestions from primary studies and meta-analyses, we hypothesize about the moderating role of time: H4a: The CSP-CFP relation shrinks over time. Hedges (1981) demonstrates that effect sizes for small samples are biased upward. Evidence from cumulative metaanalyses in different fields supports this finding and cautions that meta-analyses with a small number of subjects may overestimate effects (Ioannidis, Trikalinos, Ntzani, & Contopoulos-Ioannidis, 2003;Trikalinos et al., 2004). By using the number of included studies as a weighting factor, the calculation design of meta-analyses seeks to adjust for this circumstance. If the CSP-CFP relation was a statistical outlier, we would observe effect size differences depending on the study sample size. Initial observations of our sample based on the number of included primary studies in a meta-analysis and the corresponding meta-study overall effect size support such a conclusion. We therefore hypothesize:

Sample Size
H4b: The CSP-CFP relation shrinks with the growing number of primary studies in the meta-sample.
The Role of the Studies' Publishing Source Orlitzky (2011) examines in a review of the 2003 data whether the empirical evidence on the CSP-CFP relation differs depending on the kind of publishing journal. The author finds that the average correlations in economics, finance, and accounting journals are less than half of the correlations published in social issues-oriented journals (SIM journalssocial issues in management). The deviation is statistically significant. Allouche and Laroche (2005) reflect on their meta-analysis results by differentiating the publishing sourcesbusiness ethics journals, accounting journals, and management journals. They find no significant differences. In order to investigate the role of the publishing source for our sample, we follow a three-step approach. First, we differentiate between SIM and general management (GM) journals, similarly to Orlitzky (2011). Second, we classify the journals of our 25 metaanalyses according to the Thomson Reuters' Journal Citation Reports (JCR) classification standard [Social Sciences Citation Index (SSCI) subject categories]. Third, we take into consideration the JCR journal impact factor of the publishing journals for the meta-analyses in our sample. We expect that effect sizes will tend to be smaller in more rigorous journals (e.g. Aguinis, Beaty, Boik, & Pierce, 2005). We, thus, hypothesize: H4c: The CSP-CFP relation is distorted by the publication source; SIM and lower-impact journals report more positive outcomes.

Methodological Quality
An analysis of meta-analyses in education research proposes that 'there seems to be a particular tendency for smaller effect sizes to be associated with higher methodological quality' (Tamim, Bernard, Borokhovski, Abrami, & Schmid, 2011, p. 14). However, the results were not significant. Based on nearly 200 meta-studies in management research, Aguinis, Dalton, Bosco, Pierce, and Dalton (2011) find that methodological choices have very little impact on the overall effect sizes. Lipsey and Wilson (1993) reveal for 302 studies in psychology that effect sizes in meta-studies are not inflated by the inclusion of methodological weak studies. To create generalizable conclusions for the CSP-CFP debate, we build on Aguinis et al. (2011) and apply a methodological strength assessment. We hypothesize: H4d: The CSP-CFP relation is independent of methodological choices

Methods and Data
The first comprehensive summary of meta-analyses was published at the beginning of the 1990s, aggregating the work of 302 meta-analyses (Lipsey & Wilson, 1993). Ever since then, various examples of how to aggregate existing research summaries have emerged (Ahn, Ames, & Myers, 2012;Aytug, Rothstein, Zhou, & Kern, 2012;Cooper & Koenka, 2012;Hattie, 2009;Peterson, 2001;D. Richard, Bond, & Stokes-Zoota, 2003;Tamim et al., 2011;Wilson & Lipsey, 2001). The main challenges of second-order meta-analyses are the precise calculation of the second-order average effect and estimation of its sampling error. Hunter and Schmidt (2004) and Borenstein, Hedges, Higgins, and Rothstein (2009) already address these challenges, but only recently have sufficient statistical methods been provided.
This study uses Schmidt and Oh's (2013) method for second-order meta-analysis. A second-order meta-analysis combines a number of independent but methodologically comparable first-order meta-analyses. It enables better generalization across contexts with a tremendously larger sample size and enhanced meta-analytic moderating factor analyses. Conclusions can be derived not just from a fraction of available primary studies, but presumably with a sample close to a complete set of studies. The Schmidt and Oh (2013) second-order meta-analysis is an advancement of the psychometric first-order predecessorthe Hunter-Schmidt (2004), approach, which is used by more than 80% of meta-analyses in management research (Aguinis et al., 2011). Schmidt and Oh's (2013) method has two important advantages compared to other second-level meta-analytic approaches. First, their method determines how much variance across a set of comparable first-order meta-analyses is due to second-order sampling error. This can considerably reduce the remaining sampling error variance of firstorder meta-analyses and allows estimation of the true (non-artifactual) variance across these mean effect sizes. Second, the analysis enables the computation of reliability estimates for the first-order meta-analyses and, thus, allows us to produce a more accurate estimation of the true mean effect size in each first-order meta-analysis. Schmidt and Oh (2013) present three approaches for second-order meta-analysis: bare-bones second-order metaanalysis; psychometric second-order meta-analysis with artifact distribution of first-order meta-analysis estimates; and psychometric second-order meta-analysis with individual corrected first-order meta-analysis estimates. Because of the high number of first-order meta-analyses in our sample using artifact distribution correction without correcting primary studies individually, we opted for the first two approaches to calculate our results. Both approaches vary in terms of the input data (uncorrected or attenuated correlations vis-à-vis corrected or disattenuated correlations), the weighting scheme of the first-order meta-analyses, and the calculation of the second-order mean effect size. 1 Like the first-order meta-analytical predecessor methodology (Hunter et al., 1982;Hunter & Schmidt, 2004), second-order meta-analysis is also a fully random effect model.

The Search Process
We included a CSP-CFP meta-analysis in our sample insofar the applied CSP definition of meta-analysis matched the CSP understanding expressed by Wood (2010) and Clarkson (1995). Wood (2010, p. 64) argued for 'CSP as an overarching, multi-dimensional construct' where vast literature on organizational culture, managerial decision-making, or employee relations practices 'waits to be brought into the field of CSP'. Also, Clarkson (1995) outlined a diverse set of social and stakeholder issues. He included, for example, employee-related aspects like compensation/rewards, training and development, employment equity, and employee communication. We follow their line of reasoning and have also included meta-analyses beyond a narrow CSP focus. These are, for instance, CEP, ethical corporate governance, and broader employee-related aspects. We did not differentiate as to whether the motives for firm CSP were altruistic or strategic (Baron, 2001;J. W. McGuire, 1969;McWilliams, Siegel, & Wright, 2006). In terms of CFP, the studies needed to consider at least one of the following categories: accounting-based performance, market-based performance, operational performance, perceptual performance, growth metrics, risk, or the performance of CSP mutual funds/indices. In cases where the entire meta-analysis was beyond this scope, we extracted only the relevant parts of the whole analysis. 2 Meta-analyses with probit-or logit-correlations were excluded, as transformations to the r-family of correlations result in effect sizes with too large sampling errors. Non-standardized effects, which could not be classified as CSP-CFP sub-effects, were also disregarded for this second-order meta-analysis. Our cut-off date for study inclusion was December 2014. All studies needed to be available in electronic format.
Initially, we applied ancestry research, using widely cited meta-and review studies to extract their cited predecessor studies. The sample was subsequently extended with a forward search in Google Scholar looking for potentially relevant cross-references of so far unknown studies. After that, we performed searches in relevant databases and publisher sites: Academy of Management journals, ABI/Inform, Ebsco, Emerald, Google Scholar, Oxford Journals, Sage, Science Direct, Springer Link, Web of Science, and Wiley, and also searched for non-published manuscripts on Econbiz, National Bureau of Economic Research (NBER), Research Papers in Economics (RePEc), and Social Science Research Network (SSRN). We used several search combinations and variations of the term 'Corporate Social Performance' including (corporate) sustainability, (corporate) social responsibility, (corporate) governance, (corporate) environmental performanceall in relation to (corporate) financial performance. Of all database queries, we further processed the first 100 hits sorted by relevance. The raw data set of several thousand studies was first reduced for overlapping search hits, and second searched for keywords in the remaining results. We searched for meta, review, literature, overview, analysis, study/ies, and examination. This yielded an overall raw sample of 149 working papers and studies which were analyzed in more detail by examining the abstract or full paper. Seventy-six studies were dropped in the next step, because they either did not contain a relevant CSP-CFP categorization or were single-study designs. The remaining studies were narrowed to the final sample of 25 studies by the exclusion of: (1) 47 narrative reviews and vote-count studies; and (2) meta-studies with non-transferable effect sizes and different paper versions. If different versions of meta-studies with comparable data sets were identified, the latest version remained in our sample. Our search was open to meta-analyses in all asset classes, but it was not possible to locate meta-studies that analyzed other objects than equities. Thus, published (primary) research of the CSP effect in corporate bonds, real estate, and private equity is not considered, as we could not yet identify dedicated meta-analyses.

Sample
Our total sample consists of 25 meta-analyses which combined yield a gross number (n m ) of 1,902 primary studies. It aggregates 4,507 separate effects (k m ) and nearly a million observations (N m = 992,239) (Table 1). Apart from the meta-analytical summary effects, we obtained 129 CSP-CFP sub-effects of first-order meta-analyses, combining in total 2,387,215 observations. Besides the summary and sub-effects at the meta-level, we extracted all provided data at primary study level. Adjusted for overlapping studies and primary studies which either were not made and; (e) variance estimated based on study sample size n i and overall n-weighted attenuated effect size r with formula b σ 2 . See text for abbreviations.
transparent by the meta-analyses' authors (n = 404) or needed manual corrections (n = 38), the final study sample comprises 1214 (n) unique studies. Manual corrections were necessary for author typing errors, different years (i.e. published vs unpublished version), or adding or changing 'a' or 'b' when two different studies of the same author team appeared in the same year. When retrieving the primary study information from the meta-studies, we coded it in a unique form as author1, author2, …, author i (year). For 45% of studies (n (1) = 551 primary studies), the firstorder meta-analyses provided data about effect sizes. On the basis of this information, it was possible to calculate a conventional first-order meta-analysis (so-called second-order omnibus meta-analysis).

Coding
Besides the basic identifier information: (1) we retrieved a variety of methodological information. This included the meta-analytical approach, the family of reported effect sizes, if and how corrections for sampling and measurement error were taken, and, finally which attenuation factors were used.
(2) We collected the granularity of primary studies (transparency level of statistics, all effect sizes, N, k, and t-test).
(3) Finally, we gathered all meta-analytical results for summary and sub-effects (effect sizes, attenuated and disattenuated variances, standard errors, credibility and confidence intervals, reliability information of dependent and independent variables, the overall reliability estimate, and results of the homogeneity test, significance tests, and Fail-Safe N analysis). If results for both fixed and random effects were reported, we selected the random effect. All transparent data in primary studies which were also included in a first-order meta-analysis were extracted (e.g. the study identifier and results like study size, effect size(s), and reliabilities). In some cases, rudimentary interpretation was needed. For these cases, two independent coders reviewed the context. The original inter-rater reliability was above 0.9.

Independence Considerations
An important prerequisite for conducting a second-order meta-analysis is statistical independence of individual first-order meta-analyses. This would require that the primary studies or samples used in one first-order metaanalysis not be included in any of the other analyses (Schmidt & Oh, 2013). In fact, it is a typical case for secondorder meta-analyses that this assumption cannot completely be met. Scholars therefore suggest that minimizing the lack of independence might be the best that can be expected (Cooper & Koenka, 2012;Schmidt & Oh, 2013). Various simulation analyses support this approach and find negligible effects of statistical independence violations (Bijmolt & Pieters, 2001;Romano & Kromrey, 2008;Tracz, Elmore, & Pohlmann, 1992). Our approach for minimizing data dependency applied three steps: meta-analyses selection; average overlap assessment; and the elimination of overlapping studies as a robustness check. First, we selected only the most recent version of a study. We considered multiple contributions from single authors only if they had different research topics or reported different correlation coefficients for the same primary study (Orlitzky, 2001;Orlitzky et al., 2003;Orlitzky & Benjamin, 2001). Second, we calculated for every meta-analysis with transparent primary studies the pair-wise overlap to each other. We determined an average overlap of 5.4% of one single meta-analysis to the rest with 78% of pairs having less than 5% overlap. Third, Wilson and Lipsey (2001) reduce disproportionate dependency in second-order meta-analysis by eliminating studies that have more than 25% primary studies in common with other meta-analyses. Thus, we eliminated all meta-analyses with more than 25% overlap to at least one other study. The results of an additional assessment to rule out distorting effects of a potential independence violation are reported in Table 6. We conclude that the level of average non-independence for this study appears small and we have no indication that it could distort effect sizes and variances significantly.

Analysis and Statistical Models
We calculated summary effects in three alternative ways and also for various subgroups and moderating factors. First, we conducted a second-order omnibus meta-analysis with all primary studies in our meta-analyses sample which contained transparent data for effect sizes. Technically, this is a traditional psychometric first-order T. Busch and G. Friede meta-analysis (Hunter et al., 1982;Schmidt & Hunter, 2015). So-called omnibus meta-analyses have the advantage of established calculation procedures from first-order meta-analysis. Moreover, such analyses yield the same grand mean estimate, provided that all else is equal. But they also suffer from shortcomings like typically scarce data transparency at primary study level and, most important, it is not possible to estimate second-order sampling error variance and correct for it (Schmidt & Oh, 2013). This omnibus meta-analysis, to our knowledge, constitutes the largest meta-analysis in the field hitherto, based on the number of included primary studies. It combines n (1) = 551 studies with N (1) = 152,437 observations retrieved from previous meta-analyses. 3 We utilized this firstorder meta-analysis to gain a more granular understanding of the CSP-CFP relation and its statistical variation at primary study level.
Second, we calculated a second-order meta-analysis combining all meta-analytical summary effects of the 25 meta-analyses in our sample (m = 25).
Third, we aggregated all relevant sub-effects of these meta-analyses (m = 129) in various subgroups in order to derive more granular conclusions about the CSP-CFP relation. These sub-effects and their reliability measures provided the foundation for the calculation and correction of a second-order sampling error and the subsequent moderating factor analysis. Moreover, as part of the robustness checks, we analyzed the impact on results when changing from best-effort variances to simple variance estimates. Further documentation on the determination of effect sizes and variances, the correction for measurement error, and the assessment of heterogeneity is provided in online Supplementary Appendix SA.

Adjusting for the Second-Order Sampling Error
The second-order sampling error is comparable to adjustments for first-order measurement error variance in psychometric meta-analysis. As measurement error is the random deviation of an observed score from a true score, so is the random deviation of first-order meta-effect sizes from its actual second-order average, the second-order sampling error. Thus, the function is comparable to measurement error (Schmidt & Oh, 2013 where b p ir is the second-order sampling error adjusted for the i-th first-order meta-analytic estimate and unreliability in the first-order meta-analytic results, r pp is the reliability of i first-order meta-analytic values, b p i is the i-th first-order meta-analytic estimate, and b ̿ p is the second-order grand mean meta-analytic estimate. The reliability coefficient r pp is 1-ProVar, which is calculated as the ratio of b σ 2 p =S 2 bp where b σ 2 p is the estimate of the actual (non-artifactual) variance across the m-meta-analyses of the population mean-disattenuated correlations b p. This is the variance that remains after the variance for the second-order sampling error E S 2 ep i has been subtracted from the variance of meandisattenuated correlations S 2 b p across the m-meta-analyses (Schmidt & Oh, 2013).
Adjustments for second-order sampling error are useful to reduce the remaining sampling error variance of first-order meta-analyses and therefore produce more accurate estimations of the true mean effect size and (non-artifactual) variance in first-order meta-analyses. These figures are displayed in columns 6-13 of Table 4.

Moderating Factors
For determining potential moderating factors, the data were split into two or, if necessary, into multiple subgroups. We calculated z-values to identify whether effect size differences among groups are significant. For assessing the potential time dependence of the CSP-CFP relation, we analyzed the effect sizes in dependence of the meta-study publication year and the average publication year of the primary study in each meta-analysis. To rule out a distortion of results from changing meta-analytic methods and correction factors over time, the relation was as well assessed on first-order level and in dependency of the publication year.
The number of primary studies in each meta-analysis and its repercussions for the correlation level were analyzed for evaluating the role of sample sizes. The influence of the publishing source was compared for SIM journals versus General Management (GM) journals based on Orlitzky's (2011) categorization. Since definitions regarding what can be considered as a SIM or GM journal differ (AoM SIM Division, 2012), all publishing journals of the 25 meta-analyses were classified according to the Thomson Reuters' JCR classification standard. The SSCI subject categories present in our sample (business, environmental studies, ethics, management, political science, and applied psychology) were used to alternatively differentiate the sample of meta-analyses. If a journal was listed in more than one SSCI category, the one where it ranks highest was selected. Furthermore, the JCR 5-Year Journal Impact Factors (2014), the annual citation frequency of the meta-analyses, and the deviation between working papers and published studies were evaluated.
Finally, effect size dependence on methodological quality was evaluated. The methodology quality assessment builds on Aguinis et al. (2011) and transferred those parts of Aguinis et al.'s 21 parameters which could be expressed as dichotomous and qualitative. All parameters which were applied multiple times (e.g. range restrictions) are summarized to one overarching parameter. For each of the parameter identifiable in a meta-study, the meta-study received a score of 1 (present) or 0 (not present). The condensed set contained 11 parameters and therefore yielded a maximum methodological quality score of 11. The total scores for the set of 25 meta-analyses range between 2 and 9.

Addressing Publication Bias
A general concern of any meta-analysis is that research yielding no significant results remains unpublished (e.g. Dickersin, 2005). Rosenthal (1979) labelled this potential limitation as the file drawer problem, with reference to the many files that remain with the investigator and are not published by journals. We addressed this potential drawback by calculating the Fail-Safe N (or file drawer) statistic, a display of the Funnel Plot, and the inclusion of unpublished literature. Fail-Safe N states the number of (future) studies with null results, until the effect size loses its significance level (Rosenthal, 1979). Rosenthal suggested a tolerance level of 5n+10 for Fail-Safe N to be considered unlikely. The statistic was calculated for all summary and sub-effects. For a visual investigation of publication bias we included as well a funnel plot (Light & Pillemer, 1984;Sutton, 2009). It scatters the study size of a meta-analysis against the effect size. In our case the funnel plot displays all m = 129 sub-effects (number of effects k) against its disattenuated effect size p (Figure 1). The effects are symmetrically distributed around the underlying true effect, with larger variability for smaller studies (and vice versa). As such, we can exclude biases in the data. Furthermore, we applied the Egger test for publication bias, which tests for asymmetry of the funnel plot (Egger, Smith, Schneider, & Minder, 1997). Moreover, we separated attenuated and disattenuated effects from unpublished studies and subgroup the results against published studies (Schmidt & Hunter, 2015). The proportion of these working papers within the total results was notably high. They contributed 28 of the 129 sub-effects. All checks revealed that the results are very likely not be affected by publication bias (Figure 1; Table 5).

Robustness Checks
The empirical outcomes of this analysis can be considered as fairly robust for a number of reasons. First, we analyzed the impact of switching from our best-effort variances to simple variance estimates (calculation according to Equation 6 in Online Supplement). The results in Table 6 show that our best-effort variances are conservative estimates, which produce lower effect sizes and variances, in particular for disattenuated results. Second, we performed a sensitivity analysis (Aguinis, Gottfredson, & Joo, 2013;Greenhouse & Iyengar, 2009) and recalculated all effects when leaving out the top 5% and bottom 5%, each for effect size and effect size variances. Third, to control for a potential bias through perceptual measures of CFP and CSP, the effects were recalculated by leaving out the complete sample of perceptual CFP and reputational CSP. The results of all three robustness checks indicate that the findings are not distorted by outliers (Table 6).

The General CSP-CFP Relation
We find strong support for Hypothesis 1a. First, the second-order omnibus meta-analysis, which is technically a first-order meta-analysis with n (1) = 551 primary studies, reveals a highly significant (first-order) effect size of r 1 ð Þ = 0.119 (Table 2). This number presents the weighted average correlation across all primary studies adjusted for sampling error. When we additionally adjust for meta-analytic artifacts (measurement error for the CSP and CFP figures), we find a disattenuated correlation of b p 1 ð Þ = 0.169. The average effect is highly significant and different from zero. The 95% confidence intervals for this weighted average attenuated correlation range from 0.105 to 0.134 (0.150 to 0.189 for the disattenuated). The 95% credibility interval ranges from -0.206 to 0.444 (-0.284 to 0.624 for the disattenuated). The corresponding variance b σ 2 b p 1 ð Þ of 0.054 (Table 2) is fairly comparable to the simple average of   Table 2. First-order meta-analytical results (attenuated bare-bones results and psychometric results with artifact correction). Note: Table 2 outlines the first-order meta-analytical results based on extracted statistics for primary studies from m = 25 meta-studies (second-order omnibus meta-analysis). The additional index (1) is introduced to differentiate these first-order meta-analytical results from second-order results. The significance values are: *p < 0.10; **p < 0.05 and ***p < 0.01. disattenuated correlations in the meta-analyses in our sample of 0.042 (Table 1). The effects are very heterogeneous [Q (1) = 4733 to 9288 and I 2 1 ð Þ = 0.88 to 0.94]. Second, the aggregation of all summary effects of the meta-analyses in our sample (m = 25) uncovers effect sizes that are very similar to those in the sample of primary studies. We determine an attenuated correlation of ̿ r = 0.108 and a disattenuated correlation of b ̿ p = 0.150 (Table 3). Third, the correlations change marginally when the sample of 129 sub-effects from the meta-analyses is aggregated ( ̿ r = 0.110, b ̿ p = 0.157). But unlike effect sizes, (true) variances at second-order level are considerably lower compared to those in the omnibus analysis of primary studies. Second-level variances are only in the range of b σ 2 ̿ r = 0.001 to b σ 2 b ̿ p = 0.007 (Table 3). The 95% credibility interval for disattenuated sub-effects ranges from -0.001 to 0.316. Moreover, in our second-order results, we cannot detect the large heterogeneity of effects anymore that was present in the primary studies. Q and I 2 indicate homogeneous relations (Q = 11.4 to 75.5; I 2 = 0.00; Table 3). The third and preferred measure of heterogeneity for psychometric meta-analysis (1-ProVar) still indicates heterogeneity (0.96 to 0.97)albeit in very small variances, which have low statistical power. We conclude that all CSP-CFP summary effects, be it first or second order, are highly significant positive. Second-order level variances are furthermore very small with credibility intervals that only by a very thin margin include zero and with effect sizes that are relatively homogeneous.
With respect to Hypothesis 1b, the difference between prior CSP to subsequent CFP (r i = 0.090) vs prior CFP to subsequent CSP (r i = 0.098) is small and non-significant (Table 4). We apply two layers of corrections to account for other artifacts besides first-order sampling error. Based on the first-order correction for other artifacts, the relation changes slightly ( b p i = 0.128 vs b p i = 0.141). The additional correction for second-order sampling error reveals that the real grand mean correlation is marginally different ( b p ir = 0.130 vs b p ir = 0.143). The difference between both groups fails, with a wide margin, any significance level. We, thus, can rule out that the slack resources hypothesis or the good management hypothesis yields significantly higher results. However, we find the strongest correlation for concurrent CSP and CFP, which supports Hypothesis 1b that there is a virtuous circle between CSP and CFP. The mean correlation for concurrent CSP and CFP, corrected for second-order sampling error, is b p ir = 0.159, the highest of the group with the lowest variance.   Table 3. Second-order meta-analytical results (summary effects m = 25; meta sub-effects m = 129). Note: Table 3 depicts the second-order meta-analytical results: a) for the bare bones second-order meta-analysis; and b) for the psychometric second-order meta-results corrected with the artifact distribution factor. m stands for the number of included meta-analytical effects (m = 25 is the highest level of reported meta-analytical summary-effects, m = 129 are extracted meta-analytical sub-effects); k m is the number of studies at primary level; N m states the number of the overall sample size in the m metas; r _ _ is the variance-weighted attenuated average second-order effect size; p _ _ is the variance-weighted disattenuated average second-order effect size; S 2 is the corresponding sample-size weighted observed variance for r _ _ and p _ _ ; b σ 2 is the estimated true variance for r _ _ and p _ _ ; StdE is the standard error; CI95% is the 95% confidence interval and CrI95% is the 95% credibility interval, in which L indicates the lower and U the upper bounds; 1-ProVar describes the reliability of the first-order meta-analytical correlations and indicates the proportion of (true) variance which is not due to firstand second-order artifacts; I 2 and Q are the corresponding heterogeneity measures. The significance values are: *p < 0.10; **p < 0.05; and ***p < 0.01.  Table 4. Second-order meta-analytical results for subgroups.

The Relation of Various CSP Dimensions and CFP
In line with Hypothesis 2a, we find that studies with social CSP reveal a higher relation ( b p i = 0.181) vs studies focusing on environmental CSP ( b p i = 0.151) (Table 4). However, this difference is small and fails statistical significance (p-value of the difference is 0.388). Thus, Hypothesis 2a is rejected. Opposite to a common notion in the literature, our findings propose that good CSP pays, whether social or environmental related. Within CSP dimensions, CSP reputation yields the highest relation ( b p i = 0.318). The difference to the group average of CSP measures of b ̿ p = 0.225 is significantly higher at the 99% level (z-value 2.66). However, when we adjust for the 0.80 reliability factor (Table 4, column 12), CSP reputation is marginally adjusted towards b p ir = 0.299 to account for the second-order sampling error. Thus, we confirm Hypothesis 2b: reputational aspects have the strongest relation to CFP within the CSP dimensions.
CSP disclosure yields significantly lower correlations ( b p i = 0.119) compared to the group average ( b ̿ p= 0.225) at the 99% level (Table 4). On account of the high level of variance artifacts (1-ProVar = 0.58), an adjustment for the second-order sampling error reveals the final, potentially true, correlation level b p ir = 0.163. Our Hypothesis 2c is supported. Within CSP dimensions, CSP disclosure has the weakest relation to CFP, including second-order sampling error.

The Relation of CSP and Various CFP Categories
In line with Hypothesis 3a, perceptual performance yieldsindeed, by some distancethe highest relation to CFP ( b p i = 0.380) ( Table 4). It deviates significantly at the 99% level from the CFP group average ( b ̿ p = 0.148), also adjusted for second-order sampling error ( b p ir = 0.359). This finding clearly supports Hypothesis 3a: CSP has the strongest link to perceptional CFP than to other CFP categories.
Operational CFP exhibits the highest CSP-CFP relation within the group of objective CFP categories ( b p i = 0.222) ( Table 4). The deviation with the group average ( b ̿ p = 0.148) is highly significant. Consideration of the second-order sampling error changes the result only negligibly ( b p ir = 0.216). This clearly supports Hypothesis 3b.
Accounting-based measures of CFP are more highly correlated ( b p i = 0.161) with CSP compared to other market-based measures: growth measures ( b p i = 0.124), market-based CFP measures ( b p i = 0.115), and risk measures ( b p i = 0.062). 4 Even though all these category results are significantly different from zero, market-based CFP and risk measures return significantly lower correlations compared to other CFP measures (Table 4). These findings support Hypothesis 3c.
In accordance with Hypothesis 3d, we obtain the lowest correlation of all CFP categories ( b p i = 0.016) for CSP mutual funds. It is the only result within the CFP categories which is not significantly different from zero. Nonetheless, if we adjust for the low statistical reliability of the CSP mutual fund relation (1-ProVar = 0.00), a correction for the second-order sampling error towards the average group effect b ̿ p becomes necessary (outlined in Equation 1). The difference between b p i and b p ir of 0.132 could be a reflection of the overall 'noise'whether statistically or capital markets induced. We, thus, support Hypothesis 3d and conclude that mutual funds demonstrate a weaker CSP-CFP relation compared to other CFP categories. Figure 2 graphically summarizes the range and magnitude of the main effect sizes.

Moderating Factors
Our initial assessment reveals that meta-analyses which are more recent have significantly lower effect sizes compared to older studies. Measured by either the meta-study publication year ( b p i = 0.130 vs b p i = 0.216) or the average

Hypothesis 4a
Average meta-analysis publication year sub sub ø n m N m    (Table 5). However, the attenuation factors applied by the first-order meta-analytical reviewers for these two groups differ considerably. Relatively older metaanalyses used on average an attenuation factor of 0.66 in comparison to younger analyses (0.78). We also assess the relation on first-order level with data on 551 primary studies to rule out a distortion of initial results from changing meta-analytical methods and correction factors over time. This verification uncovers that the pattern of a correlation-level shrink is no longer observable as the differences in r 1 ð Þ and b p 1 ð Þ do not exist anymore (0.002 and -0.002). Based on this extensive set of primary studies that avoids potential distortions through different metaanalytical attenuation factors, Hypothesis 4a cannot be confirmed.
When separating our meta-analytical results into dichotomous groups depending on the number of underlying studies (n), we retrieve two summary effects that are marginally and insignificantly different from each other. Larger studies tend to be negligibly weaker in effect size ( b p i = 0.153) in comparison to smaller studies ( b p i = 0.160). We conclude that the CSP-CFP relation is not dependent on the study sample size as the difference is non-significant. Larger studies in our sample produce equally strong relations compared to smaller samples which can be interpreted as another sign of the stability of the CSP-CFP relation. Thus, we reject Hypothesis 4b.
As no meta-study in our sample was published in an economics, finance, or accounting journal, we are limited to sub-effects in SIM journals and GM journals. We determine, in contrast to findings at primary study level, even significant lower effect sizes of meta-analyses in SIM journals ( b p i = 0.115) compared to GM journals ( b p i = 0.174). Concomitantly, we also find for the JCR journal classification significantly lower effect sizes in environmental studies and ethics compared to those in business and management ( b p i = 0.113 vs 0.150). These are interesting findings since prior research proposed the opposite. Working paper studies exhibit attenuated effect sizes comparable with those in published papers (r i = 0.115) but disattenuated effect sizes deviate significantly from those in published papers ( b p i = 0.186 vs 0.152). Moreover, we find that higher-impact factor journals (factor 5.1 on average for the sample) produce significantly higher correlations than lower-impact factor journals ( b p i = 0.177 vs b p i = 0.127). Very Publishing journal impact factor sub sub ø n m N m     Table 6. Robustness analysis for summary effects (m = 25) and sub-effects (m = 129).
Note: Table 6 displays the results of various robustness checks. The original results based on best-effort variances are (a) for m = 25 meta-analytic summary effects and (b) for m = 129 sub-effects. (c) Re-calculates (a) with simple variances instead of best-effort variances and (d) recalculates (b) with simple variances instead of best-effort variances. In particular, the comparison of (b) and (d) indicates that applied best-effort variances lead to more conservative estimates for effect sizes and narrower credibility intervals. The b ̿ p-value for (d) is significantly higher compared to the corresponding value in (b) measured by its z-value. (e) and (f) display the results of additional measures to reduce overlapping in the sample, eliminating for (e) eight out 25 meta-analyses in which more than 25% of primary studies overlap with each other, and also for (f) all sub-effects of those meta-analyses. We find no indication that small violations of independence affect effect sizes and credibility intervals despite the reduced data sample.
From the m = 129 sub-effect sample, (g) eliminates the 5% of sub-effects with the lowest variances which would lead to significantly higher disattenuated effect sizes, (h) eliminates the 5% of sub-effects with the highest variances, (i) eliminates the 5% of sub-effects with the lowest effect sizes, (j) eliminates the 5% of sub-effects with the highest effect sizes, and finally (k) eliminates all sub-effects on reputational CSP and perceptual CFP. Viewed from different angles, the robustness checks underline once more the very robust nature of the CSP-CFP relation and the conservative nature of the determined effect sizes. See text for abbreviations. The significance values are: *p < 0.10; **p < 0.05; and ***p < 0.01.
frequently cited (per annum basis) meta-analyses exhibit very similar effect sizes compared to less often-cited ones. All these findings clearly reject Hypothesis 4c. There is no upwardly distorted CSP-CFP relation through SIM or lower-impact journals. Finally, regarding the role of judgmental decisions in meta-analyses techniques and methodological choices, we find virtually no difference in effect sizes for methodological stronger ( b p i = 0.161) and weaker studies ( b p i = 0.153). This finding supports Hypothesis 4d. The CSP-CFP relation exists regardless of the methodological sophistication in meta-analyses. Furthermore, the results of the additional robustness tests suggest that the results are not distorted by statistical outliers (Table 6).

Discussion and Conclusion
This study makes several contributions to the academic inquiry on whether it pays to be a good firm. First, we are able to claim with more confidence than before that a highly significant, positive, and bidirectional CSP-CFP relation exists. The findings are robust at first-order and, in particular, at second-order meta-analytical level and do not change when we consider potential publication bias, apply less conservative variances, or conduct sensitivity analyses. True variances at second-order level are considerably lower compared to those at first-order level. Thus, the resulting credibility interval is much narrower, with a lower bound virtually at zero (m = 129). We are therefore safely able to claim that in 95% of cases we observe a positive average effect in meta-studies on the CSP-CFP relation. We find no indication that either CSP or CFP matters more from a cause-effect point of view. All three underlying theories (slack resources, good management, and virtuous circle) can empirically be defended. However, the level of effect sizes for a bidirectional relation is highest at the lowest variability of the effects. This leads us to acknowledge the virtuous circle theory as the most probable. To date, the academic CSP-CFP debate is often coined by the notions of 'contradictory', 'mixed', or 'ambiguous' findings. We propose that the business case for CSP can be considered well proved. Research designs taking into account non-linear CSP-CFP relations may even further increase the explanatory power of the CSP-CFP relation (Barnett & Salomon, 2012;Gao, Wu, & Hafsi, 2017;Trumpp & Guenther, 2017;Wagner, Van Phu, Azomahou, & Wehrmeyer, 2002).
Second, we shed light on an academically well-established misinterpretation. We cannot detect statistically significant differences between the effects of environmental and social-related CSP on CFP. This outcome is in contrast to several empirical studies and conceptions which consider social aspects as the main driver for CFP. Our findings propose a different result: good CSP pays off, whether social or environmental related. The difference between both is not statistically significant. This result mirrors S. L. Hart's (1995) idea of a natural resource-based view and supports other empiric evidence for the business case of improving CEP (Albertini, 2013;Berchicci & King, 2007;Etzion, 2007). It may also be a way for revitalizing the resource-based view (Barney, Ketchen, & Wright, 2011) to acknowledge environmental-and social-related aspects as equally relevant sources for companies' success.
Third, within CSP dimensions, it is possible to identify different sources for a business case. The strongest relation to CFP is identified for CSP reputation. The intangible character of corporate reputation, which is difficult to replicate and potentially the most overarching outcome measure for CSP, leads to sustained superior financial profit (Melo & Garrido-Morgado, 2012; P. W. Roberts & Dowling, 2002;Sabate & Puente, 2003). On the other hand, the strong correlation for CSP philanthropy reiterates that instrumental and moral initiatives can supplement each other (Brammer & Millington, 2008;Hahn, Pinkse, Preuss, & Figge, 2016). CSP disclosure as well as CSP audits, processes, and policies are significantly weaker correlations compared to the rest of the CSP dimensions. The limited standardization of CSP disclosure practices (Busch, Bauer, & Orlitzky, 2016) seems, in fact, to encourage companies to provide primarily beneficial rather than unbiased CSP information (Schultze & Trommer, 2012;Ullmann, 1985). The significantly weaker correlation of CSP disclosure may point to reliability and validity issues of CSP data. One way of overcoming such data limitations can be to strengthen efforts toward a more stringentand maybe even obligatoryextra-financial data disclosure. Standardization efforts of the Sustainable Accountings Standards Board (SASB, 2015) and the European Union (EU, 2014) are promising steps forward in this regard.
Fourth, our findings suggest that operational performance is significantly and more highly correlated to CSP than other CFP categories. This finding mirrors other researchers' claims that operational performance such as productivity is located closer to the value creation source within a firm; it better reflects organizational performance than potentially distorted CFP categories like accounting-or market-based measures (McWilliams et al., 1999;Peloza, 2009;Rowley & Berman, 2000;Venkatraman & Ramanujam, 1986).
Fifth, it is an interesting phenomenon that correlations shrink for different CFP categoriesperceptual performance having the highest correlation and CSP mutual funds the lowest. The closer research gets to realized market performance, the smaller the effects. Several reasons may explain why empirical studies on CSP mutual funds exhibit a relatively weak CSP-CFP relation. Realized performance in mutual funds depends on the overlapping effects of systematic and idiosyncratic risks (Campbell, Lettau, Malkiel, & Xu, 2001;Luo & Bhattacharya, 2009), construction constraints (Clarke, de Silva, & Thorley, 2002), and costs for portfolio implementation which can be as high as 2.5% per annum in various fees that are carried by the average mutual fund (Barber, Odean, & Zheng, 2005;Carhart, 1997;Khorana, Servaes, & Tufano, 2009). The 'drowned out by other noise' argument (Peloza, 2009(Peloza, , p. 1527) thus seems to be even more valid for mutual funds, which are the ultimate amalgam of market-based measures. Nevertheless, in the worst case, mutual funds investors could expect to lose nothing compared to conventional mutual fund investments.
Sixth, we expected that early published analyses would show greater effect sizes compared with subsequently published meta-analyses. This potential diminishing effect over time as data accumulate is well documented (Trikalinos & Ioannidis, 2005). Indeed, we find significantly lower correlations for more recent meta-studies compared to older studies. However, the additional analysis at primary study level (n = 551) revealed that the shrinkage in correlations in the meta-analyses most likely does not indicate a shrinkage of the actual CSP-CFP relation. Correlations at primary study level, in contrast to our findings at the meta-study level, do not shrink. One explanation for this finding is that more recent meta-analytical research methods have become more sophisticated though using higher attenuation factors.
Finally, there was the suspicion raised that evidence of the CSP-CFP relation depends on the kind of publishing journal. The speculation was that SIM journals tend to publish 'better results'. Interestingly, we find the opposite result: 'truth is not made' in special SIM journals and low-impact factor journals, but rather, in particular, in high-impact factor and general management journals, at least at a meta-analysis level. We did not discover any publication bias analyzed by Fail Safe, funnel plot, or with regards to the publishing status of a paper. In line with previous findings (Aguinis et al., 2011;Lipsey & Wilson, 1993), we also reveal that the methodological quality of meta-analyses has no influence on the reported effect sizes.
Concluding, this study takes an instrumental point of view: we ask whether it pays to be good and find that in fact it does. In light of the urgent global challenges that society faces, it appears essential that firms and investors consider any action possible to safeguard the living conditions for future generations. They can be encouraged that on first glance competing instrumental and moral motivations can supplement each other.