Bateman's Data: Inconsistent with “Bateman's Principles”

Abstract A.J. Bateman (1948) hypothesized that a metric of sexual selection is in sex differences of intrasexual variance in number of mates (V NM). AJB predicted that (a) males have greater variance in reproductive success (V RS) than females; (b) males have greater V NM than females; and (c) a positive relationship between V NM and V RS is stronger among males. AJB used phenotypically observable mutations in offspring to identify parents and to count subjects' NM and RS. AJB's conclusions matched his predictions, later called “Bateman's Principles.” Empirical challenges to his conclusions guided analyses herein. (a) AJB's analysis pseudo‐replicated sample sizes, violating a sexual selection assumption: That is, individuals must be in the same population to choose and compete. (b) AJB's methods overestimated subjects with no mates while underestimating subjects with one or more. (c) A replication (Gowaty et al., 2012) showed that offspring inheriting nametags from both parents often died before expressing adult phenotypes, proving some of AJB's methods produced biased data. Science historian Thierry Hoquet located AJB's archived, handwritten laboratory notes, photocopied, and transcribed them. We tested each of the 65 unique populations for expected combinations in offspring of parental mutations: 41.5% failed Punnett's tests: Offspring carrying nametags simultaneously from both parents were missing showing estimates of parents' NM and V NM were undercounted. 58.5% of populations met Punnett's expectations providing an unparalleled opportunity to re‐evaluate AJB's predictions. 34 unbiased populations had no sex differences in V RS; 37 had no sex differences in V NM. No sex differences in slopes of RS and NM occurred in any unbiased population. Regressions showed weak, positive, significant associations between V NM and V RS for females and males, contrary to AJB's prediction that the relationship would be positive in males but not in females. AJB's laboratory data are inconsistent with “Bateman's Principles.”


| INTRODUC TI ON
was the first laboratory experimental test of a component of sexual selection, and it is among the most cited papers in modern sexual selection. Inspired by Fisher's (1930) fundamental theorem, Bateman (1948) hypothesized that a measure of sexual selection was in the sex differences in intrasexual variances in number of mates (NM). To experimentally test this idea, he organized populations of Drosophila melanogaster (Figure 1) to evaluate what became known as Bateman's Principles, which are as follows: (a) Males have greater variance in reproductive success (V RS ) than females; (b) variances in number of mates (V NM ) for males are greater than for females; and (c) the positive relationship between V NM and V RS is stronger for males than females.
Here, we use the data from the handwritten laboratory notes of Angus J. Bateman (AJB), which were the basis for his published results. We use the handwritten data to re-evaluate Bateman's predictions about sex differences in number of mates (NM), reproductive success (RS), variance in number of mates (V NM ), and variance in number of offspring (V RS ). Before describing our analysis methods, we review Bateman's original methods in the section What Did Bateman Set Out to Study and What Did He Do? Then, in "Flies in the Ointment: Modern Challenges to Bateman (1948)," we describe the literature of alternative explanations for his results, methodological errors in his published methods, and modern concerns over the implications of his conclusions, and we emphasize that the modern challenges informed our analysis methods of his laboratory notes. Thus, we did not attempt to replicate Bateman's (1948) original analysis methods because of previously identified (Gowaty, Kim, & Anderson, 2012;Snyder & Gowaty, 2007) errors in his published analysis. The last part of this section characterizes the creativity of AJB's basic experiments.

| What did Bateman set out to study and what did he do?
AJB designed his experiment to evaluate the hypothesis that a measure of sexual selection was in the sex differences in intrasexual variance in number of mates. From this logic, he predicted that a sex difference in variance of fertility (number of offspring) was a direct measure of the sex difference in the intensity of selection (measured in terms of within-sex differences in variances in NM and RS).
AJB's experiment to test the sex difference in the intensity of selection depended on complex and difficult culturing of 10 mutant fly lines to produce 10 types of heterozygote dominant subjects, each of which carried a unique identifying phenotypic marker, a "nametag, " which, when expressed in offspring, would identify the parents in each population. Table 1 is an example. It shows the relationship of each "nametag" allele among six subjects (three males and three females) illustrating that each heterozygote subject had a unique phenotypically expressed allele-a "nametag, " which when inherited in offspring would identify its parent. This method of parentage assignment provided an estimate of each subject's NM and RS. He said, "In this way, assuming the complete viability of all the markers half the progeny of each fly would be identified" (p. 353, Bateman, 1948), something that can readily be inferred from a Punnett square analysis (Table 2). Furthermore, AJB noted that using his method would mean that one quarter of the offspring should inherit simultaneously markers from F I G U R E 1 A photograph of Drosohila melangogaster. TA B L E 1 Parental genotypes at "nametag" loci for three subjects of each sex in a sample population (redrawn from SI in Gowaty et al., 2012). The genotypes are defined by six "nametag" marker loci (Sb, Pm, H, LCy, Cy, Mc). Note: Each subject was genetically and phenotypically distinct so that the dominant mutation it carried was a heritable "nametag" that, when inherited in offspring, indicated the identity of their parents. "+" indicates wildtype alleles.

TA B L E 2
A stylized Punnett square shows combinations of nametag markers in offspring that must occur when each parent is a heterozygote dominant at a different locus.
Father's genotype Mother's genotype D 1 D 1 D 2 D 1 + 2 + 1 + 1 D 2 + 1 + 2 Note: The subscripts are an indicator of the unique nametag loci of parents. When each parent is a heterozygote dominant at a unique nametag locus, the frequency of offspring in each cell of the Punnett square is 1/4. If the frequency of offspring in ithe cell for offspring inheriting dominant phenotypes (i.e., D 1 D 2 ) from both parents is significantly less than the expected 1/4, estimates of both NM and V NM would be misidenified as it was only the D 1 D 2 offspring that provided estimates of an individiual's NM and the within population V NM .
both parents, thus providing the only estimates of the relative NM and the V NM for male and female subjects in each population (see Figure 2a,b).
AJB's chosen nametag genes produced dramatic phenotypic markers: AJB noted that 7 of 10 of the mutant lines were "homozygote lethal" (see Bateman, 1948, p He then pooled all the populations to produce a single analysis of variance. At the end of the paper, he presented two graphs, showing the "relative fertility" (RS) of females and males as a function of their numbers of mates (NM). AJB justified making two graphs of sex differences in "relative fertility" saying the populations in "series 5 and 6 differed somewhat from the rest" (Bateman, 1948, p. 361 in males between number of mates and fertility. This is the cause of intra-masculine selection" (Bateman, 1948, p. 362)."

| Flies in the ointment: Modern challenges to Bateman (1948)
Scholarly interest in Bateman (1948) has been a key influence on modern sex differences research, stimulating arguments claiming modern empirical consistency with AJB's conclusions despite the concerns that propelled original and critical discussions about its predictions and alternative explanations for patterns. For example, Sutherland (1985) showed that chance explains Bateman's data, a hypothesis seldom considered in recent studies of variation in NM, V NM , RS, and V RS (however, see Hubbell & Johnson, 1987;Gowaty & Hubbell, 2005). In addition, Bateman's conclusions and the implications of his conclusions have been questioned for more than 35 years (Altmann, 1997;Gowaty & Hubbell, 2005;Hrdy, 1981Hrdy, , 1985Hrdy, , 1986Hrdy, , 1990Hubbell & Johnson, 1987;Sutherland, 1985;Tang-Martinez & Ryder, 2005). In more recent years, the quality of empirical support of Bateman's Principles has been evaluated, and discussed in relation to confirmation biases and theory tenacity (Gowaty, 2018;Tang-Martínez, 2012, 2016. The first paper to critically evaluate Bateman's methods (Snyder & Gowaty, 2007) identified the deficit in double-mutant offspring that was obvious in Figure 4 of Bateman (1948) (Figure 2 above). Snyder & Gowaty (2007) then speculated that AJB's methods may have seriously miscalculated V NM . To find out, Gowaty et al. (2012), Gowaty, Kim, and Anderson (2013) replicated AJB's original study using the same fly lines AJB had used. The replication (Gowaty et al., 2012(Gowaty et al., , 2013  were w♀M♂ and called here "+D offspring" (meaning that from their mother-offspring received only her wildtype gene but father's nametag gene); 2,102 (26%) were M♀w♂ offspring and called here "D+ offspring") (meaning they received their mother's nametag gene and their father's wildtype gene; and 1,247 (15%) were M♀M♂ and called here "DD offspring" (meaning that these offspring received both mother's and father's nametag genes). These frequencies were a departure from the expected 1/4 (likelihood ratio χ2 = 463.1, df = 3, p < .0001) with the biggest contribution to chi-square coming from the double-mutant (M♀M♂) category.
In the Bateman, 1948), yet there is no evidence in his paper or in the laboratory note data that he tested the viability of the markers, either by calculation of expected distributions of types of offspring (see  Figure 2a,b). The inference from the large repetition and the monogamous control experiments (Gowaty et al., 2012(Gowaty et al., , 2013 is that offspring inheriting both parental nametags often died before eclosion when parental nametags would express, thereby biasing estimates of NM and critically V NM : The repetition proved that AJB's assumption of "complete viability of all the markers" was false.
Consideration of the missing offspring in the critical category of double-mutant offspring in the repetition also proved that estimates of sex differences in V NM overestimated the NM of individuals with zero mates while underestimating the number of individuals with one or more mates. In other words, the repetition showed that missing double-mutant offspring would produce biases in inferences about a critical parameter of Bateman's study, that is, V NM .

| AJB's handwritten lab notes showcase simplicity and elegance in his basic experiments
Despite criticisms of Bateman's study, it was ambitious and it remains perhaps the largest ever on sexual selection. His handwritten laboratory notes consist of 65 explicit populations with tables similar to those in Figure 2 showing the counts of inherited offspring phenotypes that identified a parent's NM and their RS. His famous text (Bateman, 1948) mentioned 64, with 63 populations included in his published analyses (TH and PAG pers. obs.) His handwritten data show explicitly that he set out to study NM, RS, and sex differences in V NM and V RS in each population. AJB distributed his cultured subjects, the heterozygote dominant adults, into populations so that each adult subject in a particular population expressed a unique-in-that-population nametag phenotype coded by a unique dominant allele at a unique locus (Table 1). AJB recorded for each population a specific table characterizing the telltale phenotypes of all offspring expressing one or more nametags or none (Figure 2a,b).
For its day, AJB's culturing method, which fashioned his ability to link some resulting offspring to one or both parents, was potentially a creative way to empirically test hypotheses about sex differences in RS, NM, V RS, and V NM . However, the reliability of AJB's method of parentage assignment, just as in modern molecular genetic methods, depended on the absence of biasing factors that can be an intrinsic result of the genes offspring inherit (Gowaty et al., 2012).

| Unbiased observations allow unbiased analysis of Bateman's hypotheses
Because Bateman archived his data and because TH located the handwritten laboratory notes, we were able to perform tests in each population in his laboratory notes of the fit of expectations of frequencies of offspring types and AJB's predictions. Our analysis herein was guided by the insights of previous evaluations and repetition of Bateman's study that we discussed in the preceding sections of this introduction. In the methods section, we further characterize the steps we took in our reanalysis of AJB's data. Our use of the original data from the John Innes Archives is courtesy of the John Innes Foundation, used under CC-BY 4.0 (http:// creat iveco mmons.org/licen ses/by/4.0/). Word text files. We were unable to find in the original publication (Bateman, 1948) any data on two of the 65 populations in the laboratory notes (Population #s 43 and 65) in the list of populations with female and male parental NM, RS, which appears in the Results section as Table 3.

| Computerized data files
PAG recorded into JMP © data files each population from the transcribed laboratory book noting the variables Bateman (1948) listed as "distinctive features" of each population. The primary data set we constructed summarizes observations AJB reported by hand in his laboratory notes as a set of 65 tables, each representing a unique population.
The observations included the observable phenotypes of 20,417 adult offspring from 65 populations, representing 1,300 parental nametag combinations in the reported adult offspring from 65 populations. We devised unique names for each population using the "distinctive characteristics of each population" as described in Bateman (1948).

| The basis of our analyses
Our analysis tactics were inspired by the multiple challenges to AJB's methods noted in Sutherland (1985), Snyder and Gowaty (2007), and Gowaty et al. (2012Gowaty et al. ( , 2013.

| The steps in our analyses
Step 1: How we proved which of AJB's populations were robust to evaluation of subjects' NM, V NM , RS, and V RS .
To determine whether the data in each population reliably informed questions about the NM and RS of each subject, we used likelihood ratio chi-square tests in each population to evaluate consistency with the expectations from a Punnett square (Table 3) of the frequencies of offspring phenotypes given possible parental genotype/phenotypes (Table 1) (see discussion in Gowaty et al. (2012Gowaty et al. ( , 2013). In addition, for each population, we also tested whether the number of assigned mothers and fathers was statistically similar (Table 3) as they must be in diploid species. The outcome of these analyses are in the Results section.
AJB could infer if a subject mated with other subjects only from offspring simultaneously inheriting both of its parents' nametag genes, which we call DD offspring ( Thirty-eight of the 65 populations fit Punnett's expected offspring frequencies given parental genotypes/phenotypes (Table 3). Using these 38 Punnett-consistent populations, we tested Bateman's predictions.
Step 2: How we evaluated sex differences in NM, RS, V NM, and V RS in each population.
We a priori assumed that evaluation of Bateman's Principles should occur within each population, because mate choice and within-sex rivalries could not have occurred between individuals in different populations. Therefore, we tested the first two hypotheses using the 38 unbiased populations by determining if there were sex differences in V RS and V NM with two-tailed F tests (Table 4 and 5). Whenever the F test was not computable because the variance in one sex was zero, we indicated in Table 5 that the two-tailed F test was nonapplicable (NA).
We tested Bateman's third hypothesis using each of the 38 unbiased populations. We compared female and male linear slopes for the relationship between RS and NM. For each population, we estimated a model that related RS to NM, sex and the NM * sex interaction. We specifically included the NM * sex interaction term of each population to allow for different slopes for females and males. We then used ANOVA to produce F-statistics to test the model terms in each population. Statistically significant within-population NM * sex interaction terms would indicate support for Bateman's third prediction.
Step 3: How we evaluated the within-sex relationship of V NM to V RS across populations.
We further tested Bateman's third hypothesis using two analyses shown in Figure 5a were similar to the regressions using the untransformed data and so we report the figure and analyses in original scale for ease of interpretation. Table 3 Figure 4 shows the male and female slopes relating NM and RS for each of the Punnett-consistent populations. Table 6 contains the F-ratio, and the probability of a greater F for comparing female and male slopes (AJB's third prediction) for each population displayed in Figure 4. The meta-analysis of dependence of V RS on V NM for females is in Figure 5a and for males in Figure 5b.

| Tests of Punnett's expectations about DD offspring
Thirty-eight of the 65 populations unambiguously fit Punnett's expected offspring frequencies given parental genotypes/phenotypes (

| Tests of the assumption that assigned mothers and fathers were statistically equal
In diploid species, all offspring have both a mother and a father: When the frequencies of offspring that were D+ (indicating mother's identity only) and +D (indicating father's identity only) were statistically different, the estimates of RS by sex would have been inaccurate. A significant difference in the number of assigned mothers and assigned fathers could occur if the dramatic phenotypes of the nametag alleles inherited from one sex of parent were more likely to be lethal in offspring than when inherited from the other parent (Gowaty et al., 2012). Four of the 65 populations in Bateman's laboratory notes had statistically significantly different numbers of assigned fathers than mothers indicating failure to meet expectations from diploid parentage.

| Tests of the first and second of AJB's predictions
In four of the 38 unbiased populations, there were significant sex differences in parental V RS (Table 4), that is, fewer than 11% of unbiased populations showed the predicted sex differences in V RS .
In only one of the 38 unbiased populations were V NM statistically different between the sexes at the P < 0.05 level (Table 5), thus rejecting Bateman's prediction about sex differences in V NM .

| Did the "Bateman gradients" show a stronger relationship of NM variances on RS variances for males than females?
There were no significant sex differences in the slopes of the Bateman gradients (Table 6 and Table 2). The light grey shading at the top of each vertical bar shows graphically that the frequencies of DD offspring were often less than the expected ¼.

F I G U R E 4
Bateman gradients for the 38 fair populations. None show significant sex differences in slopes between females and males. See statistical tests in Table 6.
populations. Thus, there was no evidence in any of the Punnettconsistent populations that multiple mating by males had a greater effect on RS than it did for females providing no support for AJB's prediction.

| Is there a dependence of RS variances on NM variances for males but not for females?
Because variances are population level metrics, to evaluate a dependency of V RS on V NM for each sex, one needs to evaluate the relationship across populations separately for females and males. Bateman expected that among males, but not among females, that the V RS depended on the V NM . Figure

| What if most variance differences seemed higher in males?
Given the way Bateman (1948) analyzed his data, it is tempting to consider a combined global analysis across populations of sex differences in V NM and V RS , similar to his original analysis of variance (but see Snyder & Gowaty, 2007). Alternatives that one might find inviting to do across populations to evaluate differences in V NM and V RS might be a sign test or an ANOVA with population as a random effect. However, rather than a global test, we statistically tested sex differences in V NM and V RS for each population and reported the results in Table 4. The format of Bateman's laboratory notes-a set of stand-alone tables describing for each population the NM and RS for each subject-had reinforced our insight that we should evaluate within-population sex differences in V NM and V RS rather than pooling the data from different populations. The basic insight is that individuals must be in the same population to choose among potential mates or to compete with rivals. Thus, combining data across different populations is inconsistent with the fundamental assumptions of sexual selection.
Had we done a global analysis, we would have nevertheless needed to point out that there are some populations in which there are minimal or no sex differences in V NM and V RS estimates.
Such populations (see Table 4) are evidence of inconsistency with Bateman's Principles, namely that the key variance differences in his study (V RS and V NM ) would-he said-always be greater in males than females. We also note that an overall analysis based on pooled populations can be misleading. An analysis combining the populations as Bateman (1948) did to test sexual selection hypotheses fails to link all the components in the inferential chain of sexual selection (as emphasized earlier). Sexual selection is a within-population, within-sex process in which within-sex trait variation is associated with variances in some components of reproductive success because of mate choice or behavioral or physiological "competition" among same-sex rivals that affect individual reproductive success in terms of either the numbers of offspring or the quality (viability) of offspring (Altmann, 1997). Furthermore, any "patterns" in a global analysis would potentially produce "statistical traps" because an overall conclusion about within-population sex differential variances based on for example, difference scores across populations only works if the populations truly represent a random sample of populations, and the underlying phenomenon of interest is consistent across the populations. In other words, the assumptions necessary for a global test of Bateman's sex-differences hypotheses using all of his populations combined were not met in his experiment.

F I G U R E 5
Meta-analyses of the relationships between V NM and V RS for females (panel a) and for males (panel b), using the 38 unbiased populations from AJB's laboratory notes. In (a) for females, the r 2 = 0.25, N = 38, df = 37, p < .0013; in (b) for males, the r 2 = 0.185, N = 38, df = 37, p < .0070.

TA B L E 3
Tests in each of the lab note populations of the fit to Mendel's expectations that (a) the frequency of DD offspring is equal to ¼ (Bold values show DD significantly < 25%); (b) the number of assigned fathers and mothers is statistically similar (Bold values are statistically different from 50%). We note that if larger samples' sizes were tested in each of the unbiased 38 populations, and the tests for sex differential variances had adequate statistical power within each of the populations; then, Bateman's data would provide more convincing evidence for his hypotheses. In addition, we also note that there are many possible alternative explanations to the patterns Bateman claimed, including potential observer bias, stochastic processes, physiological mechanisms of sexual conflict that can modify behavior of either sex, naturally occurring schedules between copulations, or the short durations of each population.  (1) 1 Series 1 and 2 had five female and five male subjects. Series 3-6 had three subjects of each sex. 5 Likelihood ratio chi-square test statistic indicates the deviation from 50% of assigned mothers and assigned fathers. 6 Likelihood ratio chi-square test statistic indicates deviation of DD from ¼, df = 3, or as noted df − 1.

| D ISCUSS I ON
7 Tests of Mendel's assumptions "failed" if the number of assigned mothers and fathers were not equal or if the frequency of DD offspring was significantly less than 25% or both. Additionally, if the DD frequency was < 21%, we flagged the population as questionable and did not include it in the "unambiguously unbiased" populations.

TA B L E 3 (Continued)
TA B L E 4 Reproductive success means and variances by sex in 38 unbiased populations.  Figure 4); and the correlations between V NM and V RS from the 38 unbiased populations were significantly positive both for females ( Figure 5a) and for males (Figure 5b).
That AJB's laboratory note data are largely inconsistent with his predictions does not imply that tests in other species or even other tests with D. melanogaster would be inconsistent. One simple explanation for the results we report is that the sample sizes in each population were too small to expose sex differences in fitness measures.
The small sample sizes of adult subjects-3 females and 3 males or 5 females and 5 males-suggest that other studies using larger within-population sample sizes of subject females and males may indeed show sex differences. This would be particularly true if genetic parentage identification methodologies are not associated with differentially killing some offspring, as they did in Bateman's study and in the recent repetitions that used fly lines carrying the same dramatic mutations as in Bateman's original study (Gowaty et al., 2012(Gowaty et al., , 2013. is no evidence (PAG pers. obs.) in the 1948 publication or his laboratory notes (PAG and TH pers. obs.) that mating rate of either sex is heritable or that fertility associated with mating rate was heritable. If individual mating success is stochastic (Sutherland (1985), the variances in NM are uninteresting relative to sexual selection acting on traits affecting the mating rate of either sex. In fact, the significant results of Bateman (1948) might be attributed to stochastic demography as previously demonstrated in re-analyses of the Bateman's original paper (Snyder & Gowaty, 2007;Sutherland, 1985). Even if one observes that mating rate is heritable, the associated fitness variances would require partitioning to account for chance effects that inevitably occur along with any deterministic effects (Hubbell & Johnson, 1987).
A further critical perspective says it is not clear why, as AJB posited, an essential "sign" of sexual selection acting on males is lower V RS or lower V NM in females. Because sexual selection occurs within a sex, the V RS or V NM between females and males need not say anything about sexual selection in the opposite sex. This is particularly easy to justify if sexual selection works through different fitness components for females and males as hypothesized 25+ years ago (Altmann, 1997). That is, female rivals may compete over mate quality rather than quantity, and female rivalries may act through different components of RS than do the rivalries of males. For example, selection may act to favor females that increase the viability of their offspring through increased access to diverse male haplotypes complementary to their own (Gowaty, 2008) something associated with enhanced offspring immunity.

| CON CLUS ION
The fact that AJB's original handwritten data fail to support his paradigmatic predictions of sexual selection suggests that it might be time for a re-assessment of how to study sexual selection. Now may be the time to seriously take up the critical challenge in Altmann's (1997) contested the conclusion of Bateman (1948) that sex differences in V NM and V RS have any implications for an understanding of sexual selection in either sex, and now may be the time to take Altmann's hypothesis seriously and to test it. There is no reason within-sex selection need act the same way in both sexes, and whenever it does not, there is weak justification for inferring sexual selection acting on either sex by comparison of within-population sex differences in fitness variances. In other words, it is valid and likely preferable to study sexual selection within each sex separately to identify the potentially different fitness components operating on individuals by sex (Gowaty, 2015(Gowaty, , 2017. Future tests of sex-dependent selection may profit by considering patterns of variation within sexes and between populations that differ in trait distributions, in processes of between-sex mating choices and within-sex behavioral or physiological rivalries, in population sizes, in other ecological and demographic constraints that individuals experience (Gowaty & Hubbell, 2009) as well as in fitness components.

ACK N OWLED G M ENTS
We thank the John Innes Foundation Archives for providing access to A.J. Bateman's laboratory note data and for allowing us to use the data under CC-BY 4.0 (http://creat iveco mmons.org/licen ses/by/4.0/). We also thank Stephen P. Hubbell for advice and moral support throughout the considerable time it took to organize and analyze the data and write the paper.