Within‐laboratory reproducibility of Ames test results: Are repeat tests necessary?

The Ames test is required by regulatory agencies worldwide for assessing the mutagenic and carcinogenic potential of chemical compounds. This test uses several strains of bacteria to evaluate mutation induction: positive results in the assay are predictive of rodent carcinogenicity. As an initial step to understanding how well the assay may detect mutagens present as constituents of complex mixtures such as botanical extracts, a cross‐sector working group examined the within‐laboratory reproducibility of the Ames test using the extensive, publicly available National Toxicology Program (NTP) Ames test database comprising more than 3000 distinct test articles, most of which are individual chemicals. This study focused primarily on NTP tests conducted using the standard Organization for Economic Co‐operation and Development Test Guideline 471 preincubation test protocol with 10% rat liver S9 for metabolic activation, although 30% rat S9 and 10 and 30% hamster liver S9 were also evaluated. The reproducibility of initial negative responses in all strains with and without 10% S9, was quite high, ranging from 95% to 99% with few exceptions. The within‐laboratory reproducibility of initial positive responses for strains TA98 and TA100 with and without 10% rat liver S9 was ≥90%. Similar results were seen with hamster S9. As expected, the reproducibility of initial equivocal responses was lower, <50%. These results will provide context for determining the optimal design of recommended test protocols for use in screening both individual chemicals and complex mixtures, including botanicals.

for metabolic activation, although 30% rat S9 and 10 and 30% hamster liver S9 were also evaluated.The reproducibility of initial negative responses in all strains with and without 10% S9, was quite high, ranging from 95% to 99% with few exceptions.The within-laboratory reproducibility of initial positive responses for strains TA98 and TA100 with and without 10% rat liver S9 was ≥90%.Similar results were seen with hamster S9.As expected, the reproducibility of initial equivocal responses was lower, <50%.These results will provide context for determining the optimal design of recommended test protocols for use in screening both individual chemicals and complex mixtures, including botanicals.

| INTRODUCTION
The Ames test was developed in the early 1970s as a relatively simple and sensitive method for identifying mutagenic chemicals (Ames et al., 1975) by measuring gene mutation induction in Salmonella typhimurium and Escherichia coli bacterial strains.Although the test measures mutations in bacterial genes, a positive (mutagenic) response has ≥70% sensitivity for rodent carcinogenicity depending on the chemical classes tested (Zeiger, 1998).The test is required by regulatory authorities worldwide as an initial screen for chemical mutagens and carcinogens.A positive response in the test is considered strong evidence that the substance in question is a presumptive human mutagen and/or carcinogen.Protocol recommendations for the Ames test are found in the Organization for Economic Co-operation and Development (OECD) Test Guideline 471 (OECD, 2020).Detailed information on the molecular basis of the test and a description of its various procedures can be found in Mortelmans and Zeiger (2000) and OECD (2020).
A primary requirement for a valid test is that the results are reliable (i.e., reproducible within and across laboratories).OECD Test Guideline 471 does not require a repeat test for a clear positive result, while negative results should be considered for confirmation on a case-by-case basis.When results are clearly negative, justification for not repeating should be provided.Equivocal results should be repeated, preferably using modified experimental conditions.For this investigation of reproducibility, the freely available Ames test database generated by the U.S. National Toxicology Program (NTP) was used to examine the within-laboratory reproducibility of the various trials with and without metabolic activation (S9).Most tests were conducted on individual chemicals, rather than on undefined mixtures such as botanical extracts.The NTP Ames test database comprises more than 3000 distinct substances from a wide variety of chemical classes generated over a span of several decades.
To address the question of reproducibility, a cross-sector group of experts from the Health and Environmental Sciences Institutes (HESI)'s Botanical Safety Consortium (BSC) and Genetic Toxicology Technical Committee (GTTC) used this large NTP data set to evaluate how often initial Ames test results repeat in actual practice.Although a database of this size and complexity can be useful for a number of different analyses of the method and its reliability, the sole purpose of this study was the reproducibility of the initial positive or negative test results.The results of this evaluation will provide context to aid in deciding whether repeat testing should be an integral part of a recommended testing protocol, both for single chemicals and for complex mixtures such as botanicals.

| BACKGROUND ON THE DATA SOURCE
In 1975, the National Institute of Environmental Health Sciences (NIEHS) was tasked by the US Congress with developing a testing program to identify mutagenic chemicals because of their potential to also be carcinogenic (Zeiger, 2019).This program began testing chemicals for bacterial mutagenicity in 1979, using the Ames test, among others.In 1980, this new mutagenicity testing program was incorporated into the newly formed NTP.Ames test data were generated in multiple NTP contract laboratories over the years using the same, basic preincubation test protocol.The tests evaluated for this project on reproducibility were performed between 1979 and 2020.Although some modified test protocols were used by the NTP (e.g., different tester strains; varying levels and sources of S9), those consistent with OECD Test Guideline 471 (OECD, 2020) conditions (i.e., ±10% induced male Sprague-Dawley rat liver S9, S. typhimurium strains TA97, TA98, TA100, TA1535, TA1537, TA104, and E. coli WP2 uvrA pKM101) were of primary interest for this study.Data from tests employing other S9 sources (e.g., hamster and mouse) and percentages (e.g., 5% and 30%) are included in the NTP database.Although not standard, data from tests employing 30% rat S9 and 10 and 30% hamster S9 were also considered in these analyses to determine if their reproducibility was consistent with, and supportive of, the OECD test conditions.All tests were conducted using a preincubation protocol and repeat tests used the same vehicles (e.g., water; DMSO) as the original tests.The complete NTP Ames test data in the NTP comprehensive Chemical Effects in Biological Systems (CEBS) database is publicly accessible at https://cebs.niehs.nih.gov/cebs/.
All chemicals were tested under code at NTP contract laboratories and all calls (i.e., positive, negative, or equivocal) were made on the coded chemicals by scientific personnel at the NTP using expert judgment rather than a formulaic 2-(or 3-) fold rule or p value statistic (Haworth et al., 1983;Zeiger et al., 1992).This approach avoided situations where the difference between a judgment of positive or negative for a single trial was based on a strict cutoff value (i.e., ≥2-fold increase or p ≤ .05)(Zeiger, 2023).This approach, therefore, may actually have enhanced reproducibility between repeat trials.Each trial was evaluated independently and overall calls for the test chemical were based on the following criteria (Haworth et al., 1983).For a test substance to be judged as positive, a clear, reproducible, dose-related increase in mutant colonies over a range of five doses in at least one strain and activation condition was required.Negative tests were those in which no increase in mutant colonies was observed in either of the two trials in all strain/ activation conditions.Equivocal responses were characterized by small increases in mutant colonies insufficient to support a determination of mutagenicity and/or an increase in the absence of a dose-response.

| Data source and analysis
The data and mutagenicity decision from each test were extracted from the CEBS database and compiled, organized, and displayed in spreadsheets (Supplemental File 1).For each test, the unique test article name and identifier, study year, bacterial strain used, solvent/vehicle used, activation condition, and individual trial number were included; individual trial results and overall test results were also included.For this study, only within-laboratory strain-and %S9-specific data, and the identity of the solvent/vehicle were used to address the question of the reproducibility of an initial positive, negative, or equivocal response.
The majority of the test data examined in this study were from tests using only TA98 and TA100.This apparent limitation is an artifact of the test protocols used by the NTP, beginning in the mid-1980's, that called for initial evaluations of a test substance in these two strains (Zeiger et al., 1985).If the result in either strain was positive and reproducible, there was no requirement to use additional bacterial strains since the test substance had been determined to be mutagenic.If the initial trials in TA98 and TA100 were negative or equivocal, additional bacterial strains (e.g., TA1535, TA1537, and/or TA97) were tested (Zeiger et al., 1985).If all strains were negative in the initial trials, all strains were repeated.If the initial negative trial was performed with 10% S9, the repeat trial used 30% S9; if the initial negative was with 30% S9, the repeat used 10%.Positive results in any strain required a repeat test at least 1 week after the initial trial using the same test protocol although the test doses may have been adjusted to focus on a specific region of the response.All equivocal responses required a repeat test unless any of the other strain-S9 combinations were judged positive (e.g., Haworth et al., 1983;Zeiger et al., 1992).
Beginning in 2001, based on overviews of the results in several strains of bacteria, the NTP testing program opted to streamline testing by using only S. typhimurium strains TA100 and TA98, and E. coli strain WP2 uvrA pKM101, as these three strains detected the great majority of mutagens (Williams et al., 2019).Under this new approach to testing, all trials were repeated regardless of the initial response, and only 10% of rat liver S9 was used to provide exogenous metabolic activation.
Repeat trials in the post-2001 protocol generally used the same testing conditions as the initial trial although sometimes the doses tested may have been adjusted to focus on a specific portion of the dose-response curve.As before, the data were evaluated using expert judgment rather than a strict fold rule or statistic.

| RESULTS AND DISCUSSION
There are a number of strain/activation combinations in Tables 1 and   2, with very limited data, for example, tests with strain TA104 and the majority of equivocal trials.From a statistical and biological point of view, such low numbers cannot form the basis of a conclusion on the effectiveness of an assay, but the data are included for completeness.
The reproducibility of initial negative responses in all strains, with and without 10% rat S9, was quite high, ranging from 92% to 99% (Table 1).The within-laboratory reproducibility of initial positive responses for strains TA98 and TA100 with and without 10% and 30% rat liver S9 was >92% (Table 1).Reproducibility of positive responses was lower for strains TA1535, TA1537, and TA97 in the absence of S9 (83.9%-88.7%),but with S9, it was still at least 90% (Table 1).Reproducibility of initial positive responses in the E. coli strain with 10% S9 was 84.6%.
The reproducibility of the hamster S9 tests tended to be slightly lower than those conducted with rat S9 (Table 2).As seen with rat S9, the reproducibility of initial negative responses with and without 10% hamster S9 in all strains was ≥95% (Table 2).Reproducibility of positive responses with and without S9 was ≥95% in strains TA98 and TA100; the lowest reproducibility, 82.5%, was seen in strains TA97 and TA1537 with 10% hamster S9.
As anticipated, the reproducibility of equivocal responses, which indicate possible weak activity of the test chemical, was much lower in all strains and activation conditions (Tables 1 and 2).The numbers of samples for which data were available is too low in most cases to be meaningful.This observation underscores the difficulty in using set cutoffs for response characterization because a few colonies more or less can be the difference between a response judged "equivocal" and one judged "positive" or "negative."All initial equivocal tests required repeating.
One limitation of this study is that the potencies of the initial positive and equivocal responses were not factored into the calculations of reproducibility.It can be presumed that the more potent the initial positive response, the more likely it is to repeat.The same can be presumed for negative responses that have slopes of 0. This question of potency is important for chemical testing of complex mixtures such as botanicals.Mixtures can have numerous, often unknown constituents, each representing some fraction of the whole mixture that may contain a mutagenic substance.Thus, relatively weak positive responses may tend to be produced at low concentrations of the mutagens in the mixture and therefore, may be less likely to reproduce than if the same (single) chemicals of high purities were being tested, as was done in the majority of the NTP tests.Another consideration that would be expected to affect the reproducibility of weak responses is the challenge of obtaining a good solution or stable suspension.Some botanicals, for example, are difficult to get into solution, which could lead to inconsistencies from day to day or batch to batch in the relative proportions of the various constituents.Another factor that could affect reproducibility is the experience and competence of the testing laboratory.
Data evaluation using expert judgment may have increased the reproducibility value because the test conclusion is not dependent on where the mutant colony count falls with respect to the strict, but not biologically relevant, fold-increase, p value, or historical control range methods of response characterization.Expert judgment also considers the inter-plate range of responses at each dose (Zeiger, 2023) which would lead to a less conservative evaluation than a fold rule.The data presented here may be useful in any future attempts to revise the OECD Test Guideline 471, when considered along with the analyses published by Williams et al. (2019) and Gatehouse et al. (1994).
The high level of reproducibility of the initial negative and positive responses in our study, coupled with the results of previous analyses by Gatehouse et al. (1994) and Levy et al. (2019) showing the same pat- equivocal or questionable regardless of the number of times the experiment is repeated." The BSC is currently testing 13 well-characterized botanical extracts, as case studies, in the standard OECD Ames test protocol that includes repeat testing of all responses.The results from this exercise, along with the results from the NTP Ames test data analysis, will help in designing a recommended protocol for use in the routine testing of complex mixtures including botanical extracts.
Initial test result reproducibility by strain and rat S9 activation.
Note: NA, not activated (no S9). a Too few chemicals in this category for consideration.