## SEARCH BY CITATION

### Keywords:

• Teaching;
• Simple linear regression;
• Classroom activity;
• Variable selection

### Summary

This article describes a bivariate data set that is interesting to students. Indeed, this particular data set, which involves twins and IQ, has sparked more student interest than any other set that I have presented. Specific uses of the data set are presented.

### Introduction

Data sets that appeal to almost all students are to be treasured. Often, students in statistics courses show little interest in sets of observations, even when the set arises in their major field of study. Also, a statistics professor can appear to be less informed than the students about the background of a particular set, the set can be old fashioned or the students come from many majors, making the fit more difficult.

One data set that I have used in many ways when covering simple linear regression is Shields' (1962) data on twins that were raised apart. I have presented the set in Power Point slides and used it for quiz and test questions, review assignments and class discussions. It appears that almost everyone knows or has known at least one set of identical (monozygotic) twins. Students relate to the observations. The set is very crisp, as we shall see.

The observations are presented in the next section, which is taken from my slide presentations to some classes. A set of 23 questions that I have used extensively is in the following section. Annotated solutions to the questions appear in the Appendix. The next three sections contain items that dig a little deeper and are also from my lectures. The last section offers conclusions and final thoughts.

### The Data

The data are the IQs of monozygotic twins raised apart from Monozygotic Twins by James Shields (1962). In the pairs, the predictor variable is the IQ of the First Born twin, and the response variable is the IQ of the Second Born twin. The set is often cited, as in the work of Jensen (1972). Anderson and Finn (1996, pp. 592–593) select a subset of 34 of Shields' pairs based on the pairs having no missing test scores and known birth order and there being no irregularities in the test sessions. We use that subset. Anderson and Finn give the observations and ask the reader to compute r and 100r2.

The IQ scores from Shields (1962) that Anderson and Finn use are Dominoes Intelligence test scores. The test consisted of 48 questions carrying one point credit for a correct response. The mean score for the population of all test takers is 28, which is analogous to 100 on the more familiar scale. The scores are presented in figure 1.

For the 34 pairs, the equation of the least-squares regression line is  = 1.839 + 0.8499x, or as Minitab might write it, IQ of Second Born = 1.839 + 0.8499 IQ of First Born, and r2 = 0.628. Figure 2 contains the regression line plotted on the scatter diagram, which indicates a linear scatter. Figures 3 and 4 show that the residuals appear to be normally distributed and have no unusual pattern.

The coefficient of determination 100r2 is called the heritability coefficient. One can estimate that about 63% of the variation of IQ is from heredity, genes or nature, and the other 37% is from the environment or nurturing. These figures are typical for nature verses nurture for many variables; see Miller (2012).

### Twenty-Three Questions

The 23 questions below might be used in a variety of ways. I have distributed them for a review assignment and used a subset on quizzes and exams. The most successful use has been to distribute all of them to the class and ask students in-turn going around the room to answer a question or whether he or she agrees with the previous student's answer. Sometimes, as many as five students are asked to comment on a question. The following order seems to optimize student interest. Extensively annotated solutions to these questions are in the Appendix. Many of the solutions indicate the substantive meaning of the questions. There are analogies in other bivariate data. Questions 22 and 23 are beyond the scope of the many elementary courses in statistics, but I have occasionally used them.

Using the least-squares criterion, the regression line  = 1.839 + 0.850x was fitted to 34 (x, y) pairs of data from monozygotic twins raised apart. The X-variable is the IQ of the First Born, and the Y-variable is the IQ of the Second Born in each pair of siblings. The correlation coefficient for these observations is 0.792. These IQs are on a different scale than the more familiar one, which has mean 100.

1. What is the numerical value of the slope of the regression line?
2. What are the units of 1.839 in the regression line?
3. What are the units of the correlation coefficient 0.792?
4. We are told that the sample mean of the 34 x-values that were used to compute the regression line is 24.59. What is the numerical value of the sample mean of the 34 y-values in the sample?
5. What is the predicted IQ for a second born twin whose first born twin sibling has IQ of 30?
6. A friend computed the sum of all 34 residuals. What numerical value would you expect to be obtained?
7. Given that the least-squares regression line is  = 1.839 + 0.850x, could r have had the value −0.792?
8. Given that the least-squares regression line is  = 1.839 + 0.850x, could r have had the value 1.839?
9. Given that the least-squares regression line is  = 1.839 + 0.850x, could r have had the value 0.850?
10. A particular data point has the residual −2.5. On a graph of the regression line, is that data point above, on or below the line?
11. On the basis of this analysis, what percent of the variation in the IQ of the Second Born is explained by the regression line?
12. Is the variable X the explanatory variable, the dependent variable, the output variable or the response variable?
13. Would the numerical value of the correlation coefficient be larger, smaller or unchanged if the units of x were unchanged and the units of y were changed to the usual scale for IQs? Use the conversion that a usual IQ is 3.57 times the IQ measurements, which are the basis of the least-squares regression line  = 1.839 + 0.850x. (The original IQ scale is used in all of the other questions.)
14. Is the variable Y the explanatory variable, the independent variable, the input variable, the response variable or the predictor variable?
15. On the basis of the regression line, on average, by how many IQ units does the IQ of the second born twin increase with each additional two IQ units in his or her first born twin sibling's IQ?
16. One observation is (32, 28). What is its residual in these IQ units?
17. A certain data point has the residual 0.00. On a graph of the regression line, is that data point above, on or below the line?
18. Which one of the following is not an interpretation of the slope of the line?
1. The rise over the run
2. The amount on average that y increases if x is increased by one x-unit
3. The tangent of the angle that the line makes with the x-axis
4. The marginal increase in the value of y at each x-value
5. The value of y at x = 0
19. Which of the following is an interpretation of the coefficient 1.839 in the regression line's equation?
1. It is the slope.
2. It is the value of x at y = 0.
3. It is the value of y at x = 0.
20. On a graph of the regression line, is the data point (25, 20) above, on or below the line?
21. We decide to reverse the roles of the two variables; that is, we construct a new least-squares regression line where the X-variable is the IQ of the Second Born twin and the Y-variable is the IQ of the First Born twin. We are thinking about predicting the IQ of the First Born twin from the IQ of the Second Born twin. We use the same 34 observations in the original units. Which one of the following is true?
1. The slope of the new line will be negative.
2. The new correlation coefficient will be zero.
3. The new correlation coefficient could be any number between −1 and +1, inclusively.
4. The new correlation coefficient will be −0.792.
5. The new correlation coefficient will be +0.792.
22. Is it reasonable to believe that the population correlation coefficient is truly non-zero? That is, perform a model utility test for the linear regression line.
23. What is a 95% confidence interval for the true slope, and what is its interpretation?

### Going Deeper – Altering the Measure of IQ

In this section, we alter the IQ measure. For each twin, Shields (1962) gives the score on the Synonyms part of the Mill Hill Vocabulary test. Anderson and Finn (1996) ignore those scores. Jensen (1972) separately standardizes the Dominoes scores and Mill Hill scores and uses the mean of the standardized scores for each twin. Shields suggests adding twice the Mill Hill score to each Dominoes score. The outcomes of those sums are in figure 5. Shields' method has more transparency and might arouse less suspicion in students than two standardizations.

Since the twin pairs' IQ results are presented in the same order in figures 1 and 5, the Mill Hill scores of each individual can be found by subtracting the entry in figure 1 from the corresponding entry in figure 5 and dividing by 2. Figures 6-8 show that similar results to using the Dominoes test scores alone are obtained with the combined scores.

### Going Deeper – Reversing the Role of the Birth Order

Figure 9 shows the line and 100r2 for the fit using the IQ of the Second Born twin as the predictor variable and the IQ of the First Born twin as the response variable for the data in figure 1. Class discussion topics might include the reason that the two lines are different, how the lines differ and the fact that the numerical value of 100r2 is unchanged from figure 2.

### Going Deeper – Further Study

Twins data are fertile ground for further study in the classroom or as independent work. There are other studies besides Shields (1962); see Jensen (1972). The book by Segal (2012) contains information about those and the large Minnesota Twin Study. There is considerable literature about dyzygotic twins of the same and different sexes, both raised together and raised apart. Perhaps, most interesting for statisticians are the different statistics used in genetics and in testing. For example, there is a wide range of correlation coefficients (Jensen 1972). Jensen devotes a brief chapter alerting readers to their different characteristics.

### Conclusion and Final Thoughts

We have described a bivariate data set that is interesting to students and given specific uses of it, including a suggested handout or worksheet. The observations on twins have a universal appeal and do not depend upon students' field of study for that interest. Most statisticians might say that no statistical study should be limited to just two variables, since the response variable and even the predictor variable probably depend upon many other variables. In our twins and IQ example, the ages of the twin pairs might be addressed. The ages can be obtained from Shields. The IQ test results may not depend upon age. Shields contains many pages of information on each pair, such as psychiatric histories. The possible projects for further study are numerous.

### Appendix: Annotated Answers to the 23 Questions

1. 0.850. See Question 15's answer for an interpretation of slope.
2. IQ units.
3. There are no units. One of the many equivalent formulae for r is, which is in terms of products of z-scores, which have no units. Also, since slope = r(sy/sx) and slope has units of y-units/x-units, r has no units.
4. Since the point is on the regression line,  = 1.839 + (0.850)  = 1.839 + (0.850)(24.59) ≈ 22.74. The fact that the point of means is on the line is part of the derivation of the least-squares regression line. Certainly, if that point were not on the line, we would question the validity of the fitted line.
5.  = 1.839 + (0.850)(30) = 1.839 + 25.5 ≈ 27.34. Prediction of y-values from x-values is one utility of the line.
6. The sum of the residuals is always zero or approximately zero depending on rounding error. This comes from the least-squares theory for lines in the same manner as for the sum of the residuals or deviations from the sample mean for univariate data.
7. The answer is ‘no,’ since the slope and r always have the same sign, and the slope is positive.
8. The answer is ‘no,’ since r is always between −1 and +1, inclusively.
9. Yes.
10. The answer is ‘below.’ Since an observation's residual is its y-value minus the predicted value for the observation's x-value (a kind of remainder in the data after fitting the line), a negative residual results when the predicted value is greater than the observation's y-value. Observations with negative residuals are below the line.
11. 100r2 = 100(0.792)2 ≈ 62.73%. This is a frequently used interpretation of 100r2, which is called the coefficient of determination for this reason. In this case, we have nature (62.73%) verses nurture (100–62.73 = 37.27%) for our data as indicated in the main text.
12. It is the explanatory variable, since the direction of estimation is to select a value for x and find the estimated value for y from the line.
13. It is unchanged, since linear transformations do not change z-scores, and r is composed of a sum of products of z-scores:
14. It is the response variable. See Question 12's answer.
15. For a regression line with equation  = a + bx, each 1 added to an x-value produces a change in numerically equal to the slope, since  = a + b(x + 1) = (a + bx) + b. For this example, two times the slope is 2(0.850) = 1.70.
16. The residual is 8–[1.839 + (0.850)(32)] = 28–[1.839 + 27.2] = 28–29.039 ≈ −1.04. See Question 10's answer for more.
17. It is on the line. Since the residual is zero, the predicted value and the observation's y-value are the same.
18. The answer is (e), which is an interpretation of the y-intercept.
19. The answer is (c).
20. The point is below the line's graph, since the fitted or predicted value of y is 1.839 + (0.850)(25) ≈ 23.09, which is greater than 20. Also, the residual is 20–23.09 = −3.09, which is negative.
21. The new correlation coefficient will be +0.792, since the correlation coefficient's formula is symmetric in the x and the y terms. Because of the commutative property of multiplication, reversing each pair of terms in parentheses in the summation in does not change the value of r.
22. The model utility test has the null hypothesis that the population's correlation coefficient is zero. The test statistic is with n − 2 = 32 degrees of freedom, leading to rejection of a zero correlation coefficient and rejection of a zero slope in the population for any of the usual levels of significance. Along with the data's linear-appearing scatter in figure 2, this reinforces our use of a linear model. See Question 23's answer for more on testing.
23. One way of writing the interval using the information that we have been given is  , where b is the slope of the regression line and t is the t-statistic with n − 2 degrees of freedom. The 95% interval for the slope is (0.77, 0.93). Our expected value of the slope is not really zero, as Question 22 might imply. One purpose of Question 22 is to eliminate that the slope is zero, which is outside the interval. Actually, we might have surmised that the slope is closer to 1.

### Acknowledgement

The author wishes to thank Professor Patricia M. Burgess of Monroe Community College for her many insightful and valuable comments on this topic.

### References

• and (1996). The New Statistical Analysis of Data. New York, NY: Springer-Verlag.
• (1972). Genetics and Education. New York, NY: Harper & Row.
• (2012). A thing or two about twins. National Geographic, 221(1), 3865.
• (2012). Born Together – Reared Apart: The Landmark Minnesota Twin Study. Cambridge, Massachusetts: Harvard University Press.
• (1962). Monozygotic Twins: Brought Up Apart and Brought Up Together. London: Oxford University Press.

### Supporting Information

Additional Supporting Information may be found in the online version of this article at the publisher's web-site.

FilenameFormatSizeDescription
test12046-sup-0001-MINITAB_WORKSHEET.CSVapplication/unknown0KSupporting info item

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.