The relationship between environmental performance and environmental disclosure: A meta‐analysis

This research conceptually and empirically summarizes multiple aspects of the association between corporate environmental performance and corporate environmental reporting in previous literature, addressing the questions of (a) whether disclosure is a reliable indicator of performance and (b) whether variable measurement characteristics influence empirical outcomes. Systematic literature review and meta‐analytic techniques are employed to generate objective and valid summarized effects. The research covers a total of 251 effect sizes within 62 primary studies, representing a total of 56,387 observations. This study discovers a weak and negative association between environmental performance and environmental reporting, supporting the sociopolitical perspective that poor environmental performers have higher motivations to increase their level of disclosure than strong performers. At the same time, this research confirms the heterogeneity of previous studies in the field and verifies the effects of measurement methods on empirical outcomes.


Firm adjustment
A few scholars take firms' heterogeneity into consideration and normalize CEP by firms' individual features, for example, dividing total amount of emissions or waste by sales values (e.g., Clarkson et al., 2008;Connors & Gao, 2011;Dawkins & Fraas, 2011a).

Definitions and measurements of CER
Similar to the case of CEP, interpretations and measurements for CER vary (Alrazi, de Villiers, & van Staden, 2015). Since environmental reports are often prepared for specific purposes of companies or target recipients (Cormier, Ledoux, & Magnan, 2011;da Silva Monteiro & Aibar-Guzmán, 2010;Dawkins & Fraas, 2011b) and their publication is mostly voluntary, firms can decide for themselves concerning the content and fashion of their disclosure (Abba et al., 2018;Meek, Roberts, & Gray, 1995). GRI (2013) refers to CER as "the actions of measuring, disclosing and being accountable to stakeholders for a firm's environmental impacts." Campbell (2004) explains in more detail that CER presents a company's organizational and operational processes that influence the natural environment. Abba et al. (2018) describe CER as pertaining to aspects such as environmental policies, management schemes, environment-related investments, or pollution remediation. Researchers also characterize the attributes of high-quality disclosure. Cormier, Magnan, and van Velthoven (2005) set standards for the combination of precision, relevance, and usefulness for decision making. Hummel and Schlick (2016) highlight the presentation of relevant and transparent numerical data. In that sense, we sum up the CER measurement characteristics as follows.

Reporting aspect
Some researchers take only the presence of CER into account, for example, whether firms participate in the GRI reporting scheme or respond to the Carbon Disclosure Project (CDP) (e.g., Dawkins & Fraas, 2011a, 2011bLu & Taylor, 2018). A few others consider the completeness of disclosure, for example, the proportion of reported items in a reporting standard (e.g., Deswanto & Siregar, 2018;Hassan & Romilly, 2018). The majority of studies look at the quality of information since it is necessary for value relevance (Abba et al., 2018). Al-Tuwaijri et al. (2004) call into question the use of quantification (i.e., the amount of pages, sentences, lines, or words) as a common technique in early research, which is prone to bias since it is susceptible to manipulation by reporters. Later research introduces third-party indexes which offer higher reliability and comparability among industries or countries, for example, the Carbon Disclosure Leadership Index (CDLI) (e.g., Giannarakis, Konteos, Sariannidis, & Chaitidis, 2017a) or Bloomberg's Environmental, Social, and Governance (ESG) (e.g., Hassan & Romilly, 2018). Nevertheless, most scholars use scoring methods derived from content analysis (Meng et al., 2014), which allow for the transformation of texts into replicable numeric values (Krippendorff, 2012;Vourvachis & Woodward, 2015) and are considered more valid and meaningful. One of the most common methods is developed by Clarkson et al. (2008) and has been applied or modified by subsequent researchers often (e.g., Braam et al., 2016;Hassan & Guo, 2017;He & Loftus, 2014).

Quality aspect
Studies usually address information quality through the level and nature of disclosure. The CER level includes three categories: total disclosures (all indicators), hard disclosures (objective and not easily mimicked indicators), and soft disclosures (general and less verifiable indicators) (Clarkson et al., 2008). The nature of disclosure refers to characteristics of the information reported, such as the proportion of hard to total disclosures (e.g., Clarkson, Overell, & Chapple, 2011;He & Loftus, 2014), or the specificity of information, such as quantitative versus qualitative (e.g., Ingram & Frazier, 1980;Tadros & Magnan, 2019). Ingram and Frazier (1980) also inspect the types and time of the evidence presented.
CER scores can be adjusted by assigning weights to specific indicators based on their perceived importance (e.g., He & Loftus, 2014;Hughes et al., 2001) or to the quality of the information reported, that is, specific, detailed, numeric, transparent, and verifiable data versus generic, irrelevant, or imprecise data (e.g., Al-Tuwaijri et al., 2004;Meng et al., 2014). Deswanto and Siregar (2018) and Hassan and Romilly (2018) also consider the industry average scores. However, certain researchers are against the practices of weighting, stating that it leads to similar results (Hodgdon, Tondkar, Harless, & Adhikari, 2008) and does not reflect reality (Wallace & Naser, 1995).
These firms selectively disclose more to mitigate their negative impacts (Boiral, 2013;Brammer & Pavelin, 2006;Brown & Deegan, 1998;Freedman & Patten, 2004) or disclose more general, ambiguous, less verifiable (soft) information to appear as committed entities but do not truly reveal their performance (Clarkson et al., 2008;Clarkson et al., 2011). In short, sociopolitical theories imply a negative association between CEP and CER.

Empirical findings on the CEP-CER relationship
Early research in the field does not find a strong or signification association between CEP and CER, which is partly attributed to the lack of consideration for industry-and firm-specific characteristics (e.g., Fekrat et al., 1996;Li et al., 1997;Rockness, 1985;Rockness, Schlachter, & Rockness, 1986). Later studies demonstrate highly inconsistent and contradictory results.
There is various empirical evidence supporting sociopolitical theories. Ingram and Frazier (1980) suggest that poorer environmental actors disclose more. Hughes et al. (2001) demonstrate that bad companies report more positive data to offset their impacts. Patten (2002) also discovers that low performance levels are associated with high levels of reporting. Cho and Patten (2007) take the stand that low-performing companies report more proprietary information. Delmas and Montes-Sancho (2010) contribute to the discussion by showing that firms who comply with environmental laws the least disclose more than average. Excessive disclosure is also observed by Villiers and van Staden (2011), where underperforming companies voluntarily publish information to lessen negative impacts. Clarkson et al. (2011) provide evidence showing that high emission firms provide more information. Cho et al. (2012) conclude that poorer performers report more extensively. Recently, Aragón-Correa et al. (2016) and Braam et al. (2016) find out that poor environmental actors report more as they face greater threats and pressures. Hassan and Romilly (2018) also signal that low performance levels are related to high reporting levels.
There are also numerous findings supporting economics-based theories. Deegan and Gordon (1996) and Deegan and Rankin (1996) reveal that firms' disclosures are biased toward positive information. Al-Tuwaijri et al. (2004) revealed that better performers, aiming for a candid public image, disclose more pollution-related information. In the same manner, Clarkson et al. (2008) show that over performing companies are more active in disclosing discretionary and verifiable data. Boiral (2013) reports that many firms do not disclose a large proportion of their negative actions.
Likewise, good environmental actors from the study of Iatridis (2013) exhibit better disclosure scores. More recently, He and Loftus (2014) provide evidence that better environmental performers have higher levels of disclosure. Qiu, Shaukat, and Tharyan (2014) state that firms with excellent performance have more incentives to increase disclosure quality.
Against this background, many scholars emphasize that disclosure is neither useful nor reliable enough to be an indicator of firms' environmental practices (Braam et al., 2016;Clarkson et al., 2011;Hughes et al., 2001). This study therefore focuses on the statistical relationship between CEP and CER and the effects of different measurement characteristics on empirical outcomes. Figure 1 presents the research framework.

Sampling procedures
For the sampling procedures, three steps are developed to bring transparency and reproducibility: establishing search scope, database, and criteria, developing search strategies, and screening for suitable results (Tranfield, Denyer, & Smart, 2003).
Empirical studies with statistical findings on the association between CEP and CER are included in this review. To achieve comprehensive results, not only are academic articles in high-impact, peer-reviewed journals included, but also conference papers, dissertations, and working papers. First, database searching by keywords is applied for the sake of extensive coverage (Crossan & Apaydin, 2010;Hahn & Kühnen, 2013). Subsequently, as suggested by Hunter and Schmidt (2004) and in line with recent meta-analyses (e.g., Busch & Lewandowski, 2018), the reference lists of relevant articles and recommendations from colleagues, reviewers, and the meta-analysis of Cho et al. (2016) are taken into account to extend the primary sample.
The following databases are chosen: Social Science Citation Index (Web of Science) with 17,000 journals in various fields, EBSCO Business Source Complete with 2,000 journals in business, management, and accounting, Emerald Insight with 300 management journals, and ECONIS with 1,700 economics-related journals. Such a combination of extensive and discipline-specific databases ensures the breadth and depth of the review (Podsakoff, Mackenzie, Bachrach, & Podsakoff, 2005). A trial phase is conducted to test keyword combinations and avoid missing relevant articles (Fink, 2014). Four terms are chosen as anchors: "environment*" (environment, environmental), "performance," "disclos*" (disclose, disclosure, disclosed, disclosing), and "report*" (report, reports, reported, reporting). The terms "disclos*" and "report*" are used alternately to ensure the reliability and scope of the search. The inclusion of "relationship" or "association" neglects important studies and is dismissed. The extension of umbrella terms "environment* performance" to specific categories (carbon, climate change, emission*, pollution, waste, toxic, resource*) leads to further relevant studies and is applied.
Following the search, results that are not scientific articles, for example, book reviews, editorial notes, news, comments, lectures, presentations, and identical articles from different databases are screened out. Next, studies that are not particularly relevant to CEP and CER, for example, those which focus more broadly on environmental and social reporting, sustainability reporting, or CSR reporting, are excluded. Subsequently, articles that apply methods other than quantitative, for example, conceptual or theoretical reasoning, qualitative interviews, case studies, experiments, models, or surveys, are removed from the sample since meta-analyses require empirical estimates such as correlations or regression coefficients.
Accordingly, quantitative studies that do not provide these statistics are also not qualified. The study from Freedman and Wasley (1990) is not accessible because of its publisher and is not included in the sample.
The search period ends in October 2019, resulting in 62 studies from 1980 to September 2019, covering a sampling period from 1970 to 2017 and a total of 56,387 observations. Table 1 presents the primary studies and their research settings.

Coding procedures
In addition to the CEP and CER variables, also their respective measurement characteristics, study locations, company types, effect size sources (presented in Section 3.3), and the reliability of publication (e.g., rankings of journals) are categorized and coded (see Appendix S1 in Supporting Information S1 for the list of codes).
Within each primary study, the number of effect sizes, that is, the quantitative measures of the CEP-CER relationship, is identified. Some studies apply one measure for CEP, one measure for CER, and study one sample, resulting in one effect size. Many others use either multiple measures for CEP and/or CER (e.g., the amount of emissions and the amount of waste), and examine multiple samples or one sample in multiple time periods, resulting in multiple effect sizes. In these situations, different effect sizes are extracted separately to maintain their statistical dependence (Schmidt & Hunter, ). In total, there are 251 effect sizes coded from 62 studies (see Appendix S2 in Supporting Information S1 for the summary of effect sizes).
The metric to be analyzed is the Pearson product-moment correlation coefficient between CEP and CER variables, as it is a standardized metric that takes into account the differences between primary studies (Borenstein et al., 2009

Analysis procedures
Prior to meta-analysis procedures, the signs of the correlations of positive impact CEP variables are reversed. This practice transforms all CEP variables into negative impact variables (a high CEP score indicates a poor performance level), establishing a consistent direction of interpretation. Subsequently, all correlations are converted to Fisher's z indexes to normalize the sampling distribution of Pearson correlations and mitigate the bias from distribution skew (Corey, Dunlap, & Burke, 1998;Fisher, 1958) (see Appendix S3 in Supporting Information S1 for the calculation procedures).
With regards to meta-analysis models, because of the discrepancies in study settings of the primary studies, the random-effects model introduced by DerSimonian and Laird (1986) is applied to estimate mean correlations. The confidence intervals are set at 95%. A general model including 251 effect sizes is constructed for the overall effect. Subsequently, eight models are run for eight CEP and CER measurement characteristics. As a robustness check, two models are made for the sources of effect sizes (i.e., correlations or partial correlations) and the reliability of publication. Further models are built for study locations and company types. After each model, the discrepancy in the true effect sizes, which implies the presence of a heterogeneity issue, is analyzed through Q-statistics and I 2 index. To test for publication bias, that is, bias when studies with more significant or stronger effect sizes have higher publication opportunities, a funnel plot is used to investigate asymmetrical distribution of standard errors, and the Rosenthal (1979) Fail-safe N is performed to see whether the summarized effect sizes are artifacts of bias. In case of publication bias, the Duval and Tweedie's trim and fill method is applied to discover the hypothetical effect sizes that could be achieved if there is no information asymmetry. Extra sensitivity analyses are carried out in case some primary studies have significantly high proportions of effect sizes in the combined sample. Figure 2 illustrates the numbers of effect sizes of individual proxies. In terms of the CEP aspect, the most popular of the four indicators (i.e., environmental impact, regulatory compliance, organizational processes, and integrated proxy) is environmental impact (71%). The use of regulatory compliance and organizational processes as single indicators is not widespread; nonetheless, they are still indispensable parts of the integrated

Index adjustment
Adjusted Not adjusted

F I G U R E 2 Descriptive results
Note. Underlying data used to create this figure can be found in Supporting Information S2 proxy. Regarding measurements, qualitative techniques are used slightly more frequently than quantitative (129 vs. 122 effect sizes). A notable feature of CEP measurement is that significantly more studies employ negative rather than positive impacts to evaluate CEP (201 vs. 50). On another note, adjusting CEP scores to the firms' specific features is fairly common (30%).
Concerning the characteristics of CER, the vast majority of scholars select disclosure quality as the principal assessment aspect (92%), while presence and the completeness of environmental reports are less popular. The most commonly applied technique to evaluate quality is content analysis (69%), while the use of third-party indexes and self-developed scoring techniques is less prevalent. Referencing only the quality aspect, a large proportion of effect sizes focus on the level of reporting (total, hard, and soft disclosures) (76%), while the rest direct their attention to the nature of the information. The adjustment of disclosure scores is slightly less common than the case of CEP, with only 25% considering the perceived importance of individual indicators.
In the absence of a standard classification scheme for study locations, we group them into the United States, other single countries, and multiple  Table 2 presents the summarized correlations and other statistics of the meta-analyses. The overall CEP-CER relationship has a weak but statistically significant mean correlation (r = 0.147, p = 0.000). This result indicates that, although the association of CEP and CER is tangible, it is not substantial. The positive sign of the correlation suggests that CEP is negatively associated with CER. The Q-statistic is significant (p Q = 0.000), implying a critical extent of disparity between primary studies. The high I 2 index (95.2%) suggests that the majority of such variance comes from the true differences of the original effect sizes instead of random errors. These findings reveal a heterogeneous nature of previous empirical research, supporting the sociopolitical theories which suggest that poor environmental performers and those who are under greater societal pressures have higher motivations to increase their disclosure (Lindblom, 1994, Grey et al., 1995, and concluding that disclosure is not indicative of performance.

Meta-analysis results and discussion
Regarding the CEP definitions, all the mean correlations except organization processes are significant, which validates the Delmas and Blass (2010) classification of CEP aspects (environmental impact, regulatory compliance, organizational processes). However, the overwhelming proportion of environmental impact suggests that it should be further classified into specific sub-categories (e.g., performance-based: carbon, waste, toxic). The insignificance of organization processes could partly be attributed to the small amount of effect sizes. This research also contributes to literature and practice by adding the integrated proxy to the definition of Delmas and Blass (2010), and advises contemporary researchers and practitioners to investigate companies' environmental performance in an in-depth and comprehensive manner rather than a fragmented approach.
A new question, however, is raised about the formation of such a complex proxy, that is, the proportion and weighted importance of each aspect.
In terms of CEP measurement techniques, the results show a positive relationship between CEP measured by quantitative techniques and CER (r = 0.269), while such association in the case of qualitative techniques is statistically insignificant. We therefore support the opinion of Al-Tuwaijri et al. (2004) stating that quantitative measurement is more objective and informative, and recommend future researchers and relevant parties involved in reporting assessment, assurance, governance, and standardization to apply categorical data.
With regards to the directions of environmental impacts, negative impact CEP proxies show a stronger correlation with CER (r = 0.221) in comparison to positive impact proxies (r = −0.092). This finding sheds light on the directions of impacts that the majority of researchers in this field employ, confirming that the use of negative impact proxies is not only more popular but also slightly better at demonstrating the CEP-CER relationship. This result does not underestimate the role of positive-impact proxies; instead, it calls for a more balanced use of positive and negative indicators in future research. In other words, there should be more indicators that attend to the good conducts of corporations, for example, reducing resources or emissions, and pioneering initiatives.
Concerning the adjustment of performance scores in accordance with specific features of firms such as environmental efficiency, the normalization of firms' performance data has been proven to have certain validity, since adjusted CEP measures have a stronger relationship with CER (r = 0.231) compared to non-adjusted measures (r = 0.106). This study therefore upholds the popularity of data adjustment and suggest that this practice would mitigate firms' heterogeneity for better comparison or benchmarking in both academia and practice.
Among the three aspects used to define CER, presence and the completeness of reports are found to have insignificant relationships with CEP, implying that the availability or the quantity of reporting has little validity in assessing the relationship between CEP and CER. Although one possible explanation could be the small number of effect sizes in both cases, it is still recommended that CER be evaluated based on the quality of environmental reports. The positive association between the quality of reporting and performance (r = 0.169) indicates that poor performers tend to possess higher quality reports.
Concerning the measurement techniques of CER, there are significant CEP-CER relationships in the cases of content analysis and third-party index (r = 0.100 and 0.347, respectively), while the use of scoring techniques has little relevance. The strongest association, between third-party CER scores and CEP, confirms the validity and reliability of such indexes compared to methods developed by individual researchers. We also support the opinion of Al-Tuwaijri et al. (2004) which states that, even though self-developed scoring techniques are quite popular in early research, they are more prone to subjectivity and biases. With that in mind, we recommend future researchers to apply validated indexes or well-developed content analysis frameworks to assess environmental reports. The comparison of different measurement scales is also worthy of investigation, since it brings further insight into the accuracy and effectiveness of each individual scale (see, e.g., the study of Delmas and Blass (2010)).
Among the studies that evaluate the quality of reports, the CEP-CER relationship is significant in both the cases of measuring CER by the level and the nature (r = 0.088 and 0.423, respectively), confirming the validity of such categorization. Given the large number of indicators and small number of effect sizes to demonstrate the nature of CER in primary studies, it is not possible to provide concrete insights on the effects of each feature of the nature of CER. Future research should thus focus more on the specificity of information reported, which could reflect the behaviors of different performers (Clarkson et al., 2011).
Concerning the weighting of disclosure scores, the results cast doubt on its effectiveness, since the CEP-CER relationship is only significant in the case of non-adjusted measures (r = 0.236). We therefore support the opinion of Wallace and Naser (1995) and Hodgdon et al. (2008) that adjusting CER scores based on the importance of indicators or the quality of information does not influence empirical outcomes. However, since the use of such an adjustment inserts more emphasis on the quality of information reported, we do not advise researchers or practitioners against this practice; instead, further research on appropriate adjustment methods should be carried out.
One notable finding from the country perspectives is that the results are meaningful in the cases of the United States and multiple countries (r = 0.126 and 0.375, respectively), but not in the case of other countries outside the United States. The higher correlation in the case of multiple countries is a positive sign that there are certain similarities in their background and driving factors of the CEP-CER relationship. These results prove that the choice of study location influences empirical results, and raises the need for more theoretical framework and transnational comparison studies, especially those with similar contextual characteristics.
On another note, with regards to company classification, listed companies and companies from ESI show significant and high CEP-CER correlations (r = 0.244 and 0.250, respectively), supporting the sociopolitical perspective that companies who are under higher societal pressure disclose more (Bewley & Li, 2000). The size of companies shows no relevance to the CEP-CER relationship in this combined sample, though that could be attributed to the mixed samples of primary studies. Thus, we recommend forthcoming research to specifically define specific company size criteria for better comparison and summarization.
As a robustness check, the sources of effect sizes show fairly equal relevance to and influence on empirical outcomes (r = 0.151 and 0.135 for correlations and partial correlations, respectively), signifying that the CEP-CER relationship holds true when considering moderating factors or not. Publication reliability does not provide much insight since the results are only significant in one ranking (B), implying that the quality and importance of publication are not relevant to research in this area.
Across all variables, the presence of heterogeneity is quite apparent, proving that previous research provides inconsistent results. The Rosenthal's Fail-safe N test results in N = 22,150 (p N < 0.0001), indicating that 22,150 non-significant effect sizes have to be included to make the summarized effects insignificant. Therefore, the summarized effects achieved are not artifacts of publication bias. A sensitivity analysis that excludes the study of Delmas and Blass (2010), which accounts for 19% of the sample, also results in a significant, positive, and weak CEP-CER relationship (r = 0.051, p = 0.013), indicating that the results are not skewed by this study.

CONCLUSION
This research finds a weak and negative association between CEP and CER and concludes that disclosure is not indicative of performance. Compared to previous research, this study presents more comprehensive and up-to-date results, involving an extensive number of primary studies and providing summarized effects with greater generality and validity. For these reasons, the research provides future scholars a concrete starting point to investigate further into this field and other related fields.

Shortcomings of previous research
With regards to theoretical assumptions, most of the previous studies assume that the relationship between CEP and CER is linear. However, there has been increasing evidence implying a more complicated association, for example, a U-shaped relationship (Dawkins & Frass, 2011b;Meng et al., 2014, Hummel & Schlick, 2016Li, Zhao, Sun, & Yin, 2017). Furthermore, Patten (2002) suggests that the simple correlation between CEP and CER without controlling for other factors leads to weak or insignificant results. Later studies in the field, while attempting to overcome this issue by including more variables and conducting multiple regressions, have not yet accounted for variable endogeneity (Clarkson et al., 2008;Luo & Tang, 2014). Some studies employ structural equation models instead of regressions but rely only on cross-sectional data and lack a temporal dimension (Hassan & Romilly, 2018). The choices of moderating variables also account for the emergence of disparity between studies and limit the opportunities for generalization.
In terms of definitions and measurements of variables, this research suggests that not all the techniques being used have similar levels of relevance or effectiveness. The subjectivity in current self-developed methodologies also has certain influences on the accuracy and comparability of empirical findings (Patten, 2002). It thus raises the need for a comprehensive, uniform, and comparable method to define and characterize the aspects of performance and disclosure across different research contexts.
Considering sample characteristics, current studies have either samples which are relatively small or lack sufficient diversity to provide meaningful insights (Patten, 2002). For instance, findings from studies that observe only firms listed in the Standard and Poor's (S&P) 500 Index is not a fair representation of smaller firms or those from other countries with different political, economic, social, and technological settings. It is therefore necessary to deploy more substantial, broad, holistic, cross-sectional, and cross-national studies in the field. Timing also plays a role in determining empirical outcomes. Taking into account the possibility that the relationship between CEP and CER changes over time, short-term studies are not the best option for capturing and explaining this phenomenon. In this sense, Hassan and Romilly (2018) point out that if poor performers currently disclose more information that then results in better performance in the future, current study designs might miss out on such a temporal dimension.
Against this background, longitudinal studies are a promising option to invest in.

Limitations and implications
For research that follows our topic, the theoretical framework and application of a more specific performance categorization (e.g., carbon-, greenhouse gases-, or toxics-related proxies) could be explored. Different content analysis methods, third-party indexes, as well as their derivatives could also be compared to efficiently assess environmental reports. The categorization of study locations and company types could also benefit from the development and validation of appropriate frameworks to generate more meaningful insights.
Studies that target broader or more specific topics could investigate other characteristics of the CEP-CER association rather than its correlation.
The inclusion of moderating factors should also be highlighted as it promises stronger and more meaningful findings. The influence of disclosure on future performance should also be considered a potential topic for research and discussions. Furthermore, since this study has not extensively addressed the case of a non-linear CEP-CER relationship, we recommend forthcoming research to further investigate the possibility that sociopolitical theories and economics-based theories are not mutually exclusive.
For studies that stand on broader fields, there are numerous topics to follow, for example, the relationship of performance and reporting with regards to the economic, social, and governance dimensions. To reach beyond the scope of the corporate sector, there are also research opportunities in the social and public area, for example, the environmental and sustainability reporting practices of higher education institutions or government bodies.