Current testing programs for pesticides adequately capture endocrine activity and adversity for protection of vertebrate wildlife

The toxicity and ecotoxicity of pesticide active ingredients are evaluated by a number of standardized test methods using vertebrate animals. These standard test methods are required under various regulatory programs for the registration of pesticides. Over the past two decades, additional test methods have been developed with endpoints that are responsive to endocrine activity and subsequent adverse effects. This article examines the available test methods and their endpoints that are relevant to an assessment of endocrine‐disrupting properties of pesticides. Furthermore, the article highlights how weight‐of‐evidence approaches should be applied to determine whether an adverse response in (eco)toxicity tests is caused by an endocrine mechanism of action. The large number of endpoints in the current testing paradigms for pesticides make it unlikely that endocrine activity and adversity is being overlooked. Integr Environ Assess Manag 2023;19:1089–1109. © 2023 Bayer CropScience and The Authors. Integrated Environmental Assessment and Management published by Wiley Periodicals LLC on behalf of Society of Environmental Toxicology & Chemistry (SETAC).


INTRODUCTION
The widely accepted definition of an endocrine disruptor is "an exogenous substance or mixture that alters the function(s) of the endocrine system and consequently causes adverse health effects in an intact organism, or its progeny, or (sub)populations" (World Health Organisation/ International Programme on Chemical Safety [WHO/ IPCS], 2002). There are three elements worth highlighting in this definition: (1) an alteration in the function of the endocrine system-that is, "endocrine activity," (2) an adverse health effect in exposed intact organisms or populations, and (3) a causal link established between the endocrine activity and the adverse effect.
In the past two decades, regulatory bodies around the world have made significant progress toward assessing mechanisms and adverse effects that involve the endocrine system. Accordingly, a number of endocrine-specific test guidelines have been produced, allowing for investigations of adverse effects potentially resulting from endocrine mechanisms of action (MoAs; Organisation for Economic Co-operation and Development [OECD], 2018). Regulatory testing programs for potential endocrine-disrupting properties of chemicals include the US Environmental Protection Agency's (USEPA's) Endocrine Disruptor Screening Program (EDSP); the European Food Safety Authority's (EFSA's) testing requirements for Regulations (EU) No. 528/2012, and (EC) No. 1107, 2018) for pesticides and biocides, respectively; and the Japanese Ministry of the Environment's EXTEND2016 program (ECHA/EFSA, 2018;Iguchi et al., 2021;USEPA, 2019). The EDSP, which was initiated in 1998 (USEPA, 1998a), uses a two-tiered approach. Tier 1 screens are used to evaluate chemicals for potential endocrine activity, and Tier 2 tests are used to further evaluate the potential for adverse effects that may result from any endocrine activity identified in Tier 1.
The tests in Tiers 1 and 2 range from quick and relatively simple receptor binding assays to multigenerational life-cycle tests with whole animals that deliver endpoints potentially relevant at the population level (Table 1). The OECD has organized testing approaches to assessing endocrine activity and effects into a five-level conceptual framework. These levels include existing data and nontest information such as model predictions in Level 1. Level 2 includes in vitro assays with data on selected endocrine pathways. Level 3 is characterized by in vivo assays that provide data on selected endocrine pathways. Levels 4 and 5 include in vivo data on adverse effects on endocrine-relevant endpoints, with more comprehensive and/or extensive life-cycle in vivo assays in Level 5 (OECD, 2018). In Levels 3, 4, and 5 assays, some endpoints can be sensitive to more than one endocrine mechanism and may also respond to nonendocrine mechanisms.
Regulatory approaches to managing chemicals that are identified as endocrine disruptors vary in different jurisdictions, with some approaches being based on hazard and others considering exposure and risk. Concerns about implementing a risk-based approach to endocrine disruptors have focused on perceptions that standard test approaches are not done at sufficiently low dose levels, or with appropriately sensitive test organisms and life stages for adequate periods, and/or that a point of departure for risk assessment purposes may not be determined with endocrine-active substances where nonmonotonic response patterns are observed. These concerns have been investigated in a SETAC Pellston Workshop using case studies with known endocrine-active substances; the results were published in a series of articles summarized in a special online issue of Integrated Environmental Assessment and Management (https://setac.onlinelibrary.wiley. com/doi/toc/10.1002/(ISSN)1551-3793.ecotox-haz-assess). In summary, workshop participants determined that if effects (including delayed effects) were investigated with sensitive species, life stages, and at low concentrations, then the environmental risk assessment of endocrineactive substances would be reliable and scientifically sound (Matthiessen et al., 2017).
The purpose of this article is to summarize the endpoints that address endocrine activity and/or endocrine-mediated adversity within the standard tests that are carried out to establish and maintain global pesticide registrations. Additionally, the authors illustrate a weight-of-evidence (WoE) approach to assessing multiple lines of evidence related to endocrine activity and adversity.
The scope of this article is limited to vertebrate species for which certain pathways in the endocrine system are well understood. It should not be assumed that endocrine pathways found in vertebrates are also conserved in invertebrates Crane et al., 2022;LaLone et al., 2018). The endocrine modalities considered are estrogen, androgen, thyroid, and steroidogenesis (EATS), mediated through the hypothalamic-pituitary-gonadal (HPG) and hypothalamic-pituitary-thyroid (HPT) axes. These pathways have been the primary emphasis of endocrine assessments to date. The focus of this article is on pesticides, but to the extent that regulatory approaches to other types of chemicals are consistent with those for pesticides, the same principles are applicable.
Because the objective of this article is to evaluate standard pesticide test methods for their ability to assess endocrine disruption among vertebrate ecological receptors, the following questions have been considered: (1) Which test methods assess endocrine activity in vertebrate ecological receptors? (2) Which test methods assess adverse effects in vertebrate ecological receptors? (3) What are some considerations in linking endocrine activity and adverse effects in vertebrate ecological receptors? (4) How can WoE approaches be used to integrate all lines of evidence?

TEST METHODS ADDRESSING ENDOCRINE ACTIVITY IN VERTEBRATE ECOLOGICAL RECEPTORS
Test guidelines have been developed to specifically evaluate potential endocrine activity. These include in vitro and in vivo assays. Table 1 lists USEPA and OECD test guidelines that are, or can be, used to evaluate potential endocrine activity and includes a brief summary of the test organisms and endpoints. Some of the endpoints in Table 1 were specifically designed to detect endocrine activity, whereas other endpoints can measure endocrine-mediated adverse effects in addition to other toxicities. Although some endpoints provide stronger evidence of endocrine activity, others only support effects that may or may not be related to a direct endocrine action (Borgert et al., 2014;ECHA/EFSA, 2018). Effects on endocrine organs or pathways may also occur secondarily to nonendocrine mediated systemic or organ toxicity. For example, effects on the liver can result in reduced functional capacity or an induction of biotransformation processes leading to increased hormone clearance (Wheeler & Coady, 2016). Reduced liver function could lead to a reduction in vitellogenin (VTG) in female fish that is not directly related to an endocrine MoA (Mihaich, Schäfers, et al., 2017).
Information from toxicity testing conducted for both ecological assessments and for the protection of human health can be used to elucidate common mechanisms among vertebrate species, including potential endocrinemediated activity. Taken together, the endpoints used to evaluate endocrine activity include information ranging from receptor binding and the cascade of activation and key events at the subcellular level to changes in clinical biochemistry (hormone levels), organ histology (organ level), and effects at the whole animal level, according to the Adverse Outcome Pathway (AOP) concept (Ankley et al., 2010). Below, we will review the types of tests and data available for the HPG and HPT axes. Endpoints required in the current USEPA and/or OECD test guideline that relate to the endocrine pathway, or to apical effects that may be manifested due to impacts on the endocrine pathway. Some endpoints can be sensitive to more than one endocrine mechanism and all of them are sensitive to systemic and/or generalized toxicity when the Maximum Tolerated Dose or Concentration (MTD or MTC) is reached or exceeded. Effects on survival are considered nonspecific secondary consequences of other toxic effects and, as such, are not considered for the identification of endocrine disruptors as described in Commission Regulation (EU) 2018/605. Outside the range of systemic and/or generalized toxic doses and/or concentrations, endpoints should not be considered in isolation to draw a conclusion on an endocrine mechanism or effect. They should be used instead in combination to define effect patterns that characterize any potential dysfunction of the endocrine pathway. Directionality of the endpoint response should also be considered when interpreting results in accordance with the relevant test guidelines. The OECD did not consider the Avian Two-Generation Toxicity test validated and abandoned it from its test guideline program. There is limited validation to support the endocrine specificity of all the effects assessed.

Assays and endpoints for the HPG axis
The primary role of the HPG axis in vertebrates is the control of reproduction, as well as differentiation of sexspecific phenotypes during development. Secondarily, the HPG axis in vertebrates plays crucial roles in other systems, such as metabolism, growth, immune function, and cardiovascular function (Norris & Carr, 2013). The available regulatory guideline tests and associated endpoints for ecological receptors that can be used to elucidate activity within the HPG axis are shown in Table 1. These assays cover a wide range of biological complexity and time scales, from cell-free receptor binding assays to chronic, multigenerational reproduction studies.
A number of validated in vitro assays involving the HPG axis are available. These assays assess the potential for substances to bind with the estrogen receptor (ER) or androgen receptor (AR) as well as to modulate steroidogenesis pathways resulting in agonistic or antagonistic effects (Table 1; ECHA/EFSA, 2018). In addition, ER and AR receptor binding assays using medaka (Oryzias latipes) have been developed under the EXTEND2016 program (Iguchi et al., 2021). Quantitative structure-activity relationship (QSAR) models for ER and AR binding are also available. At this point, QSAR models for predicting ER binding in rats and humans have the most validation data supporting them (NAFTA, 2012). In addition, a system biology model has been generated by the USEPA (Judson et al., 2015), based on a computational model integrating ToxCast assay endpoints for ER-based pathway activity. ToxCast (https:// comptox.epa.gov/dashboard) is a high-throughput toxicity testing program developed by the USEPA that has produced in vitro and in silico data for many chemicals. An AR pathway model is also available (Kleinstreuer et al., 2017) based on the relevant ToxCast assays, and it has undergone extensive peer review. The USEPA has proposed the ER pathway model as an alternative to the ER in vitro assays and the uterotrophic assay (Judson et al., 2015). The potential for such computational models to replace ecotoxicological endocrine screens (e.g., the fish short-term reproduction assay [FSTRA]) remains to be demonstrated; https://www. epa.gov/endocrine-disruption/use-high-throughput-assaysand-computational-tools-endocrine-disruptor. The aforementioned in vitro assays are often used to prioritize substances for potential endocrine activity and are not, in isolation, deterministic of an endocrine-mediated adverse effect. This level of data provides valuable mechanistic information for performing an endocrine assessment. Taken together, these assays can contribute to an understanding of the stepwise progression of the early key events for direct ER-and AR-mediated activity, as well as for steroidogenesis modulators.
Another class of assays developed more recently uses transgenic fish embryos that are altered to fluoresce when specific endocrine pathways are perturbed. These early developmental stage assays are used increasingly to obtain mechanistic endocrine insight. Several assays have been developed including the REACTIV assay (estrogen and/or steroidogenesis), the RADAR assay (androgen and/or steroidogenesis), and the EASZY assay (estrogen; Lagadic et al., 2019). These assays combine the advantages of using organisms (embryos before free feeding) with at least partially competent metabolic systems and molecular tools that provide useful mechanistic data. Because the embryo is exposed before free feeding, there are ethical advantages, and the EU (and the UK and Switzerland) does not apply the same restrictions as for other animal assays. The EASZY and RADAR assays were recently validated as the OECD TG 250 (2021) and OECD TG 251 (2022), respectively. The REACTIV assay is being validated by the OECD and should be available as an OECD TG in 2023.
In vivo assays using whole animals provide endpoints possibly related to activity in the HPG axis. These assays and their corresponding endpoints are shown in Table 1 and include effects on conception, pregnancy, estrous cycle, secondary sex characteristics, as well as alterations in size and/or histopathology of male and female reproductive organs. The Hershberger, uterotrophic, pubertal male, pubertal female, and FSTRAs are designed to test specifically for activity within the HPG axis.
Some endpoints are more useful than others in determining which chemicals have endocrine activity in the HPG axis. Borgert et al. (2014) ranked the Tier-1 EDSP endpoints based on specificity, sensitivity, interpretability, and influence of confounding factors. Three ranks were given with the Rank-1 endpoints being the most useful, Rank-2 endpoints being useful but less informative than Rank-1 endpoints, and Rank-3 endpoints being relevant in combination with Rank-1 and -2 endpoints. As an example, Rank-1 endpoints for assessing chemicals that act as ER agonists were determined to be significantly increased VTG measurement in male fish in the FSTRA and increased uterus weight in the uterotrophic assay. Rank-2 endpoints for estrogen agonists include ER agonism in the ER transcriptional activation (ERTA) assay, reduced tubercle score in males as well as altered male gonadal histopathology and behavior in the FSTRA, and conversion to estrus in the uterotrophic assay. Rank-3 endpoints for estrogen agonists include ER competitive binding affinity in the ER binding assay (ERBA), the promotion of growth and estrous cyclicity in the pubertal female assay, and fecundity, female behavior, and plasma steroid levels in the FSTRA. Of course, endpoint ranking depends on the endocrine pathway under consideration and the hypothesis being tested. In contrast to the relevance ranking for ER agonism, Borgert et al. (2014) did not consider any of the Tier-1 EDSP test endpoints to be Rank 1 (the most useful information) for ER antagonism, and included ER competitive binding affinity in the ERBA with Rank-2 endpoints (which was considered a Rank-3 endpoint for ER agonism). Apical endpoints, such as growth and fecundity, were typically designated Rank 3 for both ER agonism and ER antagonism because they can be heavily influenced by nonendocrine related pathways.
A number of test guidelines in Table 1 are those that were not developed specifically to be used as endocrine screens; Integr Environ Assess Manag 2023:1089-1109 © 2023 Bayer CropScience and The Authors. DOI: 10.1002/ieam.4732 however, they can still provide information on effects that might be related to endocrine activity. These assays could be modified to include additional endpoints to yield useful information about the HPG axis, such as measurement of sex steroids, histopathology of sex organs, and so forth. However, these measurements have not been validated and the design of these studies may not be amenable. For example, a recent analysis examining the use of organ gross pathology from the standard avian reproduction assay (OECD 206/OCSPP 850.2300/;OECD, 1984;USEPA, 2012) and its utility in endocrine disruption evaluation found that a high incidence of findings in control data confounded the interpretation of the test results (Temple et al., 2020). Similarly, the addition of hormone measurements to fish and amphibian test guidelines has been considered (Martin et al., 2020); however, practical and animal welfare considerations illustrated through power analyses indicate how problematic this may be . The above work underscores the need for robust evaluations of the practicality, reliability, and utility of incorporating additional endpoints into previously validated study designs before conclusions can be made regarding their usefulness.

Assays and endpoints for the HPT axis
The primary function of the HPT axis in vertebrates is controlling growth, development, and metabolism (Zoeller et al., 2007), which are regulated, in part, by the production and control of thyroid hormones. The available guideline tests and associated endpoints reported for ecological receptors that can be used to elucidate interactions with the HPT axis are shown in Table 1. As for the HPG, some of these tests and endpoints are those included in existing programs specifically to evaluate endocrine activity, whereas some may be routinely performed for registration of pesticides and can provide information related to endocrine activity and nonendocrine mediated effects.
High-throughput assays for the thyroid receptor and other thyroid mechanisms are included in ToxCast (https:// comptox.epa.gov/dashboard) and other related literature (Buckalew et al., 2020). These assays include thyroid hormone receptor (THR) transactivation assays, yeast and mammalian two-hybrid assays, assays for DNA binding, cell proliferation assays (such as the T-screen), iodide uptake assays, thyroid peroxidase (TPO) inhibition assays, and thyroid hormone binding protein assays (OECD, 2012a). Tox-Cast/Tox21 screens using the THR transactivation assays have shown minimal hits; in other words, few chemicals are active, reflecting that effects on thyroid systems are predominately acting via other molecular mechanisms (Paul-Friedman et al., 2019). Therefore, assay development for other key events in thyroid toxicity is an active area of research. To that end, a TPO inhibition assay, a sodium-iodide symporter inhibition assay, and a deiodinase inhibition assay have been performed on large chemical libraries from the ToxCast program and results have been published (Hallinger et al., 2017;Hornung et al., 2018;Olker et al., 2019;Paul et al., 2014;Paul Friedman et al., 2016;Wang et al., 2018). Currently available regulatory assays in the EDSP and OECD programs do not include in vitro tests for the HPT axis.
Several in vivo tests are available that assess a chemical's ability to interact with the HPT axis. The amphibian metamorphosis assay (AMA) is sensitive to thyroid activity because amphibian metamorphosis is controlled by thyroid hormones. The Xenopus Eleutheroembryonic Thyroid Assay (XETA), recently validated as the OECD TG 248 (OECD, 2019b), uses stable transgenic X. laevis embryos containing a genetic construct that includes a gene promoter containing two Thyroid Responsive Elements coupled with a fluorescent reporter protein (green fluorescent protein [GFP]; OECD, 2019b). As stated above, the EU (and the UK and Switzerland) does not apply the same restrictions to this assay as to other animal assays because the embryo is exposed before free feeding. In addition, in vivo assays with fish models (summarized in Table 1) can provide information on interference with the HPT axis. In particular, in vivo tests with zebrafish are promising for assessing thyroid activity (Raldúa & Babin, 2009), and the OECD is currently developing a Detailed Review Paper to more generally assess the possibilities to include thyroid relevant endpoints in fish test guidelines. The mammalian assays summarized in Table 1 as well as additional mechanistic studies, such as the comparative thyroid assay in offspring and maternal animals, can provide additional information. Other in vivo assays should be able to detect effects on the HPT axis because some of the apical effects (e.g., growth and development) are under thyroid control. Borgert et al. (2014) also ranked the Tier-1 EDSP endpoints regarding their thyroid specificity. Rank-1 endpoints for assessing chemicals that act as thyroid agonists were determined to be asynchronous development and thyroid histopathology in the AMA. Rank-2 endpoints for thyroid agonists include thyroid weight in the pubertal male assay, thyroid-stimulating hormone (TSH) and thyroxine (T4) levels in the pubertal male and female assays, and advanced developmental stage and hind limb length in the AMA. Rank-3 endpoints for thyroid agonists included apical endpoints, such as growth and developmental ages in the pubertal male and female assays and AMA, as well as blood chemistry and pituitary weight in the pubertal male and female assays ( Table 1).
As with the HPG axis, nonendocrine-specific guidelines in Table 1 provide information on apical effects that could result from endocrine activity. Although additional endpoints could be included to provide more endocrinespecific data, the same cautionary statements apply.

TEST METHODS ADDRESSING ADVERSE EFFECTS IN VERTEBRATE ECOLOGICAL RECEPTORS
Generally, protection goals for wildlife focus on populations. Toxicity endpoints for this purpose are typically based on studies that measure survival, growth, reproduction, and development. These are considered apical endpoints, because they integrate processes at the Integr Environ Assess Manag 2023:1089-1109 © 2023 Bayer CropScience and The Authors. wileyonlinelibrary.com/journal/ieam molecular, cellular, tissue, and organ level that may ultimately affect the whole organism in a way that may affect the population.
Test methods covering apical endpoints related to population-level effects include those focused on sensitive early life stages, longer term reproduction tests, repeated dose studies, and chronic toxicity in both mammalian and nonmammalian vertebrates. These types of tests have been required for many years by regulatory authorities around the world as a condition of pesticide approval or registration. Although the specific data requirements vary by the type of product, use pattern, regulatory authority, and other factors, there is a standard suite of tests (for example those under the US Federal Insecticide, Fungicide and Rodenticide Act at 40 CFR Section 158) that are typically required to determine acute and chronic toxicity endpoints and to assess the ecological risks posed by pesticides. The standard suite of tests used by regulatory agencies to assess hazard include studies of representative species of various taxa including aquatic and terrestrial primary producers and invertebrates, fish, birds, rodents, and other mammals. These requirements have been described by Day et al. (2018) for pesticide active ingredients in the EU and USA. Several test guidelines, although not specifically developed to detect endocrine-mediated adverse effects, measure endpoints that can add to the WoE for an endocrine assessment. For example, the avian reproduction test (OECD 206/OCSPP 850.2300;OECD, 1984;USEPA, 2012) is commonly required for all outdoor-use pesticides in the USA, Canada, Great Britain, and the EU. It includes the measurement of a number of reproductive, developmental, and growth parameters. Adverse effects seen in these tests may be the result of mechanisms other than endocrine activity (Temple et al., 2020); however, the test endpoints integrate various biological processes that affect reproduction, growth, and development; thus, the endpoints are sensitive to, but not diagnostic of, endocrine activity (ECHA/EFSA, 2018). Similarly, results from mammalian testing may also indicate changes related to endocrine MoAs that are relevant to other vertebrate species, such as fish and amphibians (McArdle et al., 2020). These endpoints would include, but are not limited to, effects on the thyroid, pituitary, and reproductive organs. An example of this type of study for mammalian toxicity is the two-generation rat reproduction study (OCSPP 870.3800/OECD 416;OECD, 2001;USEPA, 1998b) in which toxicity is evaluated before mating and then through mating, gestation, lactation, and in offspring through two generations.
The assays in Table 1 use mammalian models (rat, mouse), fish models (medaka, fathead minnow, rainbow trout, and zebrafish), an amphibian model (African clawed frog), and bird models (Japanese quail, northern bobwhite quail, and mallard duck). Several of the chronic assays (e.g., larval amphibian growth and development assay, medaka and/or zebrafish extended one-generation reproduction test) are relatively new, and there is limited experience conducting them. As these studies begin to be performed for chemical safety evaluations, more information will be available to fully explore their adequacy and utility.

CHALLENGES LINKING ENDOCRINE ACTIVITY TO POPULATION-RELEVANT ADVERSE EFFECTS
The challenges of identifying endocrine disruptors in a regulatory context have been an important topic in recent discussions (Burden et al., 2022). A key challenge is being able to translate endocrine laboratory data in select species to predictions of adverse effects on wild populations, which is the protection goal. Certain endpoints are more useful in assessing adverse effects that may result from endocrineactive substances than others. For example, responses in in vitro studies do not indicate adverse effects in the whole organism because they do not consider in vivo processes involved in, for example, homeostasis or detoxification processes (absorption, distribution, metabolism, excretion). The USEPA (2018) states regarding ToxCast in vitro assays that the potential for a chemical to elicit adverse health outcomes in living systems is a function of multiple factors and that in vitro assays are not intended to provide predictive details regarding long-term or indirect adverse effects in complex biological systems. However, these assays can aid in the prioritization of chemical selection for more resource-intensive toxicity studies and can also elucidate early key events.
Tests such as those in Tier 1 of the EDSP and in Levels 1-3 of the OECD framework are designed to screen for the potential for endocrine activity as well as add to a mechanistic WoE. They are not designed or intended to be used for the evaluation of adverse effects or risk assessments in isolation. That is not to imply that the Tier-1 in vivo screens (also summarized in Level 3 of the OECD framework) do not provide any adverse-effect information (e.g., reproduction in the FSTRA, development in the AMA). Fish fecundity, for example, is a relevant adverse outcome in several AOPs (e.g., AOPs 23, 25, and 30 in https://aopwiki.org/aops). However, EDSP Tier-1/OECD Level-3 assays were not designed to specifically evaluate or quantify adverse effects so should be interpreted cautiously . Nevertheless, adapted AMA and FSTRA test protocols can provide consolidated apical endpoints with increased relevance to addressing potential population-level effects. The extended AMA is an example of modified test design that gives access to a population-relevant endpoint (i.e., time to metamorphosis) in amphibians by carrying out the study design to assess study endpoints at metamorphic climax rather than terminating the study at a fixed time (i.e., at 21 days; Ortego et al., 2021).
In contrast to selecting an apical endpoint based on an observed adverse effect, endocrine activity in and of itself does not necessarily lead to an adverse effect in an intact organism. Not every effect is adverse, such as upregulation of enzymes or proteins, because other compensatory mechanisms may mitigate the molecular responses (Lagadic et al., 2020). Endocrine systems have multiple compensatory mechanisms to maintain homeostasis, and substances that interact with the endocrine system may stimulate modulation in these feedback systems. If this modulation is temporary and/or within the homeostatic capacity of the Integr Environ Assess Manag 2023:1089-1109 © 2023 Bayer CropScience and The Authors. DOI: 10.1002/ieam.4732 endocrine system of the exposed organism, the effect of the substance on a certain endpoint might be considered "endocrine modulation" (EFSA, 2013). Alternatively, if the organism is unable to compensate for the induced changes within its limits of homeostasis, then the observed changes may be considered adverse, and there might be a need to evaluate the potential impact on the population. In this context, it is possible for populations to compensate for or recover from endocrine-disrupting effects, as discussed by Crane et al. (2019) in examples that were studied in fish.
Some adverse effects are more meaningful than others for populations, and therefore every adverse effect does not necessarily trigger a population-level assessment. Furthermore, not all adverse effects measured in laboratory toxicity studies will necessarily result in an effect at the population level. To determine the magnitude of an effect stemming from an EATS-mediated apical endpoint on a population, it is important to perform a critical evaluation of the extent to which endpoints such as development, growth, or reproductive effects constitute adverse effects at the individual level, which can result in effects at the population level. This evaluation includes the degree to which the loss of age classes as a result of affected growth has an impact on the population; the extent to which adaptation and recovery affect population-level impacts; and the quantitative relationship among initiating events, key events, and adverse population-level effects (Conolly et al., 2017). Indeed, there is much research interest to apply such relationships to predictive ecotoxicology (Conolly et al., 2017) though we are still some way from having fully operational quantitative AOPs for regulatory purposes (Perkins et al., 2019).
Population effect modeling is a promising approach to extrapolating adverse effects from the individual organismal level to the population level (Crane et al., 2019), and these models can provide additional lines of evidence in the assessments of adverse effects. Various modeling approaches have proven to be useful for predicting population-level effects of endocrine-active chemicals. For example, a population modeling approach was used to demonstrate how changes in fecundity of fathead minnow (Pimephales promelas) exposed to 17β-trenbolone in a short-term laboratory toxicity test translates into alterations in population growth rate (Miller & Ankley, 2004). This approach has the advantage of being directly applicable to the data from the FSTRA. In another example, Hazlerigg et al. (2014) evaluated the population-relevance of changes in sex ratio caused by androgenic (dihydrotestosterone) and estrogenic (4-tertoctylphenol) substances in zebrafish (Danio rerio). As these types of models develop in scope and accuracy over time, they will play a more prominent role in the evaluation of population-level adverse effects acting via endocrine pathways.

WEIGHT-OF-EVIDENCE APPROACHES TO INTEGRATE LINES OF EVIDENCE
Responses of isolated parameters measured in test guideline studies (see Table 1) or other effect studies are not sufficient by themselves to determine either endocrine activity or endocrine-mediated adversity. Multiple lines of evidence will need to be integrated to draw conclusions on whether a substance exhibits endocrine activity and whether this activity may result in adverse population level effects. In some jurisdictions, this will be combined with exposure for ecological risk assessments. Weight-of-evidence approaches have been demonstrated to have great value in evaluating the body of evidence for endocrine modulation (Borgert et al., 2014;Hutchinson et al., 2013;Juberg et al., 2013;Marty et al., 2015;Mihaich, Capdevielle, et al., 2017;de Peyster & Mihaich, 2014); therefore, this article recommends including WoE approaches to establish causality links between endocrine activity and population-level adversity evaluation, thereby resulting in the identification of chemicals as endocrine disruptors for wildlife. A review of WoE as applied to endocrinemodulating chemicals was published by Gross et al. (2017); thus, we will not review WoE approaches in detail here.
In general, WoE approaches are used to assemble, evaluate, and integrate the results from multiple pieces of information or evidence to reach an overall conclusion on the hypothesis that is being addressed, which may include topics related to MoA, such as exposure, hazard, and risk. Each piece of evidence (e.g., an endpoint in a validated assay) is evaluated for key properties, such as relevance, reliability, and strength of effect. Relevance, in this context, means the test provides data on endpoints that are important to determine whether a chemical can cause endocrine modulation and/or activity or whether a chemical can cause endocrine-related adverse effects (that can affect populations). Reliability is an inherent property that makes evidence convincing (e.g., study design, potential for confounding influences, standardization of the method, etc.). Strength of effect means the tests are able to provide data or evidence that the chemical can induce a response that is significantly different from baseline or reference and/or control conditions or a magnitude that is significant to the assessment being performed. Ecological significance is an important aspect when considering the strength of an effect. For example, species with regular boom and bust cycles in their life history can quickly recover from injury based on high reproductive capacities (i.e., r-strategist). Thus, these r-strategist species can be more resilient to transient injury from chemical exposure than long-lived species with relatively low reproductive capacities (i.e., K-strategist; Raimondo et al., 2006).
A WoE evaluation can assess information either quantitatively (e.g., a numerical score) or qualitatively (e.g., low, medium, or high) which results in assigning a weight (or level of confidence) to each piece of evidence. The collective evidence is then weighted quantitatively (typically by adding up scores) or qualitatively (e.g., using a matrix to visualize the dataset and weightings) to reach a conclusion. Regulatory authorities and international organizations have developed formal guidelines on applying WoE to ecological assessments (Hardy et al., 2017;OECD, 2019a;USEPA, 2016) to make WoE applications more transparent and systematic, including a scoring system based on key properties. Weight-of-evidence approaches may be used in different contexts, for example, to evaluate the body of evidence for a given chemical's ability to modulate endocrine systems or to prioritize chemicals for endocrine testing. The USEPA used a WoE approach to evaluate the results from ToxCast screening tests and Tier-1 tests in the EDSP (USEPA, 2019) as well as Other Scientifically Relevant Information (OSRI) for 52 chemicals (pesticides and inert ingredients). This WoE aimed to determine which EDSP Tier-2 tests would be recommended. Similar to the WoE approach used by the USEPA for the Tier-1 EDSP, the rankings of Borgert et al. (2014) pertain to relevance for identifying potential endocrine activity in endocrine testing programs and do not specifically address potential effects on populations or in ecological risk assessment.
Weight-of-evidence approaches may be used to assess if a substance is active in the endocrine system and to evaluate whether this endocrine activity is connected to adverse effects that are relevant to ecological risk assessment. In this case, at least two null hypotheses are addressed in the approach: (1) H 0 : the chemical does not exhibit activity in the endocrine system and (2) H 0 : the chemical does not cause adverse health effects in intact organisms and/or populations via this endocrine activity. Because there are multiple hypothesis statements in this approach, it is advisable to conduct separate WoE evaluations for each hypothesis so that the available evidence can be assessed appropriately, especially regarding relevance and the strength of the effect (Figure 1). Separate WoE approaches are appropriate because the assessment of relevance depends on the specific hypothesis being addressed, and the strength of the effects of endocrine activity can be distinct from adverse effects in organisms.
Collective properties for the entire body of evidence, such as consistency and consilience, are also important to evaluate endocrine-modulating chemicals. For instance, multiple test results are often available through endocrine testing programs and if one test endpoint found a chemical to be estrogenic, but three other test endpoints did not find estrogenic activity, then lack of consistency would suggest a low likelihood of that chemical possessing estrogenic activity. Consilience (i.e., evidence demonstrated to be consistent with scientific knowledge and theory, particularly with respect to underlying mechanisms; USEPA, 2016) is a key property for establishing the connection between endocrine activity and its potential link to adversity. In short, the mechanistic signs of endocrine activity (e.g., inhibition of aromatase activity) should be compatible with downstream key events in that endocrine AOP (e.g., decreased VTG, altered oocyte development) including the adverse effect on the individuals (e.g., decreased fecundity) that may translate into population-level adverse effects (e.g., declining population density). However, translation of effects observed in individual organisms to impacts on populations is made "by extension" (Ankley et al., 2010). It is therefore misleading to conclude from an AOP based on individual-level data alone, that the population trajectory will always decline; this ignores the level of change required and compensatory mechanisms that are known to occur in populations (Lagadic et al., 2020). The directionality of endpoint responses is important to consider in the WoE with regard to consilience with the known or proposed endocrine pathway activity. In addition, directionality is also important for determining the endocrine specificity of the response (e.g., significantly increased plasma VTG in male fish is considered a more specific indicator of potential endocrine activity in the FSTRA, whereas decreased VTG in female fish in the same assay may indicate endocrine activity or be the response to a more generalized toxicity). Figure 1 illustrates the concept of performing WoE evaluations for these different aspects of potential endocrine disruptors and expresses the results graphically for relevance and reliability as well as measured effects. When integrating the full body of evidence, the information from both the standard suite of (eco)toxicology studies and "endocrine-specific studies" is combined with knowledge of the AOP for specific endocrine pathways, where available. Several outcomes for the WoE evaluations are illustrated in  Graphical representation of weight-of-evidence (WoE) approaches to assessing chemicals for potential endocrine disruption. (A) WoE graphic that indicates the potential for endocrine activity. Relevance and reliability are scored on the y-axis, and the strength of the effect is scored on the x-axis. (B) WoE graphic that indicates the potential for adverse effects from endocrine activity. Relevance and reliability are scored on the y-axis, and the strength of the adverse effect is scored on the x-axis Figure 2. For example, if there is no evidence of adverse effects in organisms but there is strong evidence of endocrine activity, then the information is useful for informing on endocrine activity but suggests that there are no population-relevant adverse effects. Alternatively, if there is strong evidence supporting both endocrine activity and adverse effects in organisms consilient with the observed endocrine activity, then this combination of data is highly useful for investigating population-relevant effects in an ecological risk assessment of endocrine-active substances. Gross et al. (2017) distill some key aspects of WoE approaches to evaluating evidence that support or refute a given endocrine MoA that bear mentioning. Nonrelevant and unreliable studies should be excluded from the process during the review stage; relevant and reliable data are necessary for such an evaluation. In addition, systemic toxicity or other potentially confounding factors should be carefully reviewed and discussed in the evaluation because other MoAs may be contributing to the responses. Because MoA is an important component of assessments of endocrine-modulating chemicals, WoE approaches should employ some Bradford Hill causality criteria (Becker et al., 2015), such as concordance of dose-response relationships between key events and adverse outcomes, temporal association, consistency and specificity of effects, and biological plausibility . We recommend that a scoring scheme be developed so that relevance, reliability, and the strength of the effects can be systematically applied to the test endpoints in Table 1. The USEPA's WoE guidelines (USEPA, 2016) provide information on scoring schemes as do Borgert et al. (2014).
In summary, all lines of evidence should be considered, for example, in silico predictions, high-throughput data, information from AOPs, guideline study data, findings from academic research papers, knowledge of conservation of pathways between species, and understanding of the species population biology and ecology. After careful weighing of each piece of evidence, either a quantitative or qualitative approach should be used to develop conclusions for each hypothesis statement in a transparent manner. By bringing together both WoE assessmentsone addressing the potential for endocrine activity and the other addressing adverse effects via an endocrine modality, a final assessment of the plausibility of endocrine disruption can be deciphered. Adverse Outcome Pathway frameworks and the concept of conciliatory responses are particularly helpful to determine whether endocrinemediated activity is linked to population-relevant adverse effects.

ROBUSTNESS OF CURRENT TESTING METHODS FOR ENDOCRINE DISRUPTION ASSESSMENT
As evident from the number of endpoints that address the various pathways shown in Table 1, the tests currently in use for vertebrate wildlife (fish, birds, amphibians, and mammals) are suitable for detecting and characterizing adverse effects of dose-response and endpoint for ecological risk assessment and regulatory decision-making. Most of these assays are validated via international, interlaboratory testing programs, and thus the study designs are considered reliable and relevant to their intended purpose (either risk assessment, endocrine mode of action elucidation, or both). The studies use standard and documented procedures, incorporate appropriate controls, and have analytical verification of the stability of the test material in the test matrix and analytical verification of dose levels, and are generally conducted compliant with Good Laboratory Practice. Good Laboratory Practice is a quality control system used in testing laboratories that mandates the organizational processes and conditions for performance, monitoring, and recording of studies such that it is clear how the data were generated. It does not necessarily indicate scientific merit (established by the validation process for development of test guidelines) but offers transparency of data such that they can be sufficiently assured and interrogated for regulatory purposes.
The assays in Table 1 provide coverage across a range of species and biological levels of known EATS mechanisms. These studies are conducted with specific species and strains of vertebrates to provide consistent information that can be compared across different chemicals. The number of studies and the redundancy of the endpoints ensure that coverage is robust, as well as specific, especially for endocrine-mediated effects within the HPG and HPT axes. A number of existing standard test guidelines in Tier 1 of the USEPA EDSP and in Levels 1-3 of the OECD framework are available to assess the potential interaction of a chemical with the endocrine system of nontarget vertebrate ecological receptors. Standard test guidelines in Tier 2 of the EDSP and in Levels 4-5 of the OECD framework also exist to address adverse effects. The results of these Integr Environ Assess Manag 2023:1089-1109 © 2023 Bayer CropScience and The Authors. wileyonlinelibrary.com/journal/ieam FIGURE 2 Integrating the findings of the weight-of-evidence evaluations for endocrine-active substances can result in several outcomes. Data not indicative of endocrine activity nor adverse effects. Data indicative of adverse effects, but not endocrine activity. Data indicative of endocrine activity, but not adverse effects. Data indicative of endocrine activity and adverse effects studies can be evaluated for their potential to affect population-level endpoints.
Using and ranking endpoint responses for potential endocrine activity (e.g., as demonstrated by Borgert et al. [2014] for EATS endpoints) provide guidance on endpoints that more strongly relate to endocrine-mediated effects than others that support potential endocrine activity. No single effect in a single study is used to assess mode of action or hazard; rather, the entire database is assessed in a WoE approach. All available information from "standard" and endocrine-specific guideline studies of vertebrate species together with in silico and in vitro studies can be tied together through our growing knowledge framed by the AOP concept. Evaluating all relevant studies provides a fingerprint useful for AOPs, including endocrine-mediated effects.

CONCLUSION
Current testing paradigms for pesticides include numerous endpoints addressing endocrine activity and adversity. When evaluated in WoE frameworks, all reliable and relevant evidence from these endpoints for endocrine activity and for resulting adverse effects that are relevant to wildlife populations is integrated to support or refute the potential for endocrine disruption. The redundancy among these endpoints makes it unlikely that population-relevant adverse effects mediated by endocrine activity are being overlooked. This agrees with Matthiessen et al. (2018), who concluded that the current testing and risk assessment schemes for pesticides are functioning in the sense that pesticides approved under current procedures have not been observed to cause population-relevant endocrinemediated effects in wildlife, which have been observed for legacy chemicals.