The classical birthday problem is well known among statisticians and even among some school children. It often comes as one of their early lessons in probability – perhaps because it involves birthdays, which all children like, and it has a somewhat unexpected answer. It is also fairly simple to solve. It can be expressed as follows: “How large a group of people do you need to make it more likely than not that two of them share a birthday?” Or, in a classroom: “How big does this class have to be before two children are likely to have the same birthday?” Most children (and most adults) would guess the answer to be around half of 365 – call it 183, since you cannot have half-people; of course they are way out. The correct answer is 23 – see the box. In a class of 23 children, it is more likely than not that two of them have the same birthday. On a football pitch there are 23 players if you count the referee; in half of all football matches there will be at least one shared birthday on the pitch.
How many children do you need in a class before two of them are likely to have the same birthday?
The way to work this out is to calculate the probability that the children all have different birthdays, and then subtract that answer from 1. Suppose the class announce their birthdays in turn. Child A has 365 possible days for his birthday (we will ignore leap years, here and throughout, as one complication too many); suppose it is June 27th. Child B has one chance in 365 of having the same birthday – so 364 chances in 365 of having a different one. Say hers is April 3rd. If we have got this far without a shared birthday, Child C announces his birthday (it is November 5th); the probability that it is neither of those two dates is 363 out of 365. So the probability of the first three children all having different birthdays is 364/365 × 363/365, which is about 0.99179.
Child D has her turn. Hers must not be on any of the three dates mentioned so far, which leaves 362 possible days out of 365; so the probability of this fourth birthday also being different is the previous answer multiplied by 362/365 (which comes to 0.98364); and so on down the rest of the class. By the time the 23rd child gives his birthday, the probability that all of them so far are different is 364/365 × 363/365 × 362/365 × 361/365 × … × 343/365. This comes out to 0.493, which is a little under a half. In other words, there is a just under 50–50 chance that all the birthdays so far are different.
And that in turn means that with 23 children there is a just over 50–50 chance that not all the birthdays are different, and that two of them share the same day.
So far so simple. And we can continue beyond 50–50 chances: in a party of 57 people there is a 0.99 probability of a shared birthday; 70 people are enough to give a 0.999 probability1. And of course if you have 365 (or 366 if you really want leap years) people in a room, a shared birthday is certain. Even for a group of 100 people the probability is 0.999999693, which is as close to one as makes no real difference.
However, we have made two assumptions in all of this. Firstly, we supposed that birthdays are independent – that the date of one birthday does not affect the date of another. This would not be true if we had done the experiment at a convention of twins and triplets – one twin's birthday would determine the birthday of someone else in the room. We have also assumed equiprobability: that birthdays are evenly distributed throughout the year – that January 1st, April 23rd and September 19th are all equally likely as birthdays, each with a probability of 1/365.
The assumption of uniform birth dates simplifies these calculations considerably; however, it is not actually true. There is much evidence that it does not hold for real human populations: birthday distributions actually depend on social, religious, economic and environmental factors. Figure 1 illustrates these patterns for children born in 2011 in four European and four American countries. It uses the adjusted proportions of months of birth and approximate month of conception, and takes into account the different number of days in months.
The countries shown have substantial variations in the months when peaks occur, in their amplitudes, and in the deviations from the hypothesis of uniformity. Western Europe has birth-peaks from July to October, with spring troughs; in the Americas, Brazil's births are distributed in an opposite pattern. These differences may reflect environmental factors, notably temperature, which has an effect on human fertility and on human desires2. However, aggregated national monthly figures like these conceal effects resulting from (i) the well-known variability of the number of births by day of the week (hospitals discourage weekend births); (ii) public holidays, which may result in potential increments of conceptions; and (iii) regional variations. What effect does this non-uniform spread of birthdays have on our calculations? Do they increase the probability of common birthdays or decrease it?
It has been shown3 that any deviations from the equiprobable birthday model increase the probability of finding at least two individuals with a common birthday out of a group of size n; our previous answers are therefore an upper bound for the birthday problem solution for any real-life, non-uniform distributions. It is also known that the adjustments required to account for such empirical distributions tend to be small – too small to change the results from the equiprobability assumption, even by one person4.
The strong birthday problem
There are many generalisations and variants of the birthday problem. For example, one can study multiple coincidences (for example, more than two people having the same birthday: if you want three children to be likely to share a birthday you need a class of 88), or almost-coincidences (the number of people required to have at least two persons whose birthdays fall at most one or two or three days apart), or birthdays shared among people of different characteristics (e.g., males sharing their birthday with females in a population). Two excellent reviews of such generalisations and methods to study them are available5,6. In this article we review what is known as the “strong birthday” problem as seemingly first defined by Anirban DasGupta5 in 2005. It refers to the probability that not just one person but everybody in a group of n individuals has a birthday shared by someone else in the group. Nobody in the group has a lone birthday. Clearly you need more people for this than for the classical birthday problem: n must be greater than 23, but how much greater?
Suppose that there are m classes (for example, m = 365 for daily dates of birth), in which a group of n people fall with equal probability pr = 1/m for r = 1, …, m. Let N be a random variable which counts the number of unique individuals (those whose birthdays are not shared by anybody else) out of a group of size n; N must lie between 0 and whichever of m or n is smaller. It can be shown5 that
and using this expression for N = 0, the probabilities of everybody in a group of n individuals having a shared birthday are given in Table 1.
Table 1. Probabilities of no lone birthdays for the strong birthday problem
2 × 10−19
Note that if n = 1 the probability of a lone birthday is trivially 0; for n = 2 it is 1/365 = 0.0027; thereafter it decreases, reaching its minimum around n = 365, and rising thereafter – first at a very slow rate, quickly gathering pace for n > 2000.
With 23 people, the probability of no lone birthdays is vanishingly small. Even with around 2200 people, it is only 0.004 – around 1 chance in 250 that there is no lone birthday. The largest West End theatres seat about 2200 people, so if a pantomime offered a prize to any lucky person with a lone birthday in its audience it would have to pay out almost every time. When you get more than 2300 people together, however, the probability of no lone birthdays grows rapidly. It hits 50% at 3064 individuals – say, the number staying overnight at a very large hotel.
The strong birthday problem with unequal probabilities
The simple birthday problem was very easy. The strong birthday problem with equal probabilities for every birthday was more complex. The strong birthday problem for no lone birthdays with an unequal probability distribution of birthdays is very hard indeed.
Two of the players will probably share a birthday. Hieu Le/iStock/Thinkstock
The following expression is available5, where m is the number of classes (again 365):
Though the mathematics is straightforward, the computational cost of solving it is fiendish – unfeasible in fact. Those interested can see it investigated on the Significance website at www.significancemagazine.com.
However, as in the classical birthday problem, an unequal distribution of birthdays makes only a small difference: using a uniform distribution of birthdays approximates well to most real-life cases4, unless the birthdays are concentrated on a few high-frequency days with the other days occurring very rarely. (It would not work for a gathering of astrologers born under Aries, for example.) Using equation (1) I calculated the probability of no lone birthdays for the real-life birthday distributions shown in Figure 1, as well as for several other empirical distributions. I did it for gatherings ranging from 1 to 5000 people and compared these probabilities to those from the uniform case; all along the range the absolute differences were less than 1%. Besides this empirical evidence, it is possible to develop a formal argument4 to show that the adjustments required to cope with unequal probabilities are as small as those in the classical birthday problem.
The answer to the strong birthday problem is 3064. This is the number of people that need to be gathered together before there is a 50% chance that everyone in the gathering shares his birthday with at least one other. A non-uniform spread of birthdays through the year has very little effect on the strong birthday problem; so it seems that 3064 is very close to reality for virtually all populations of interest. Apart from the obvious areas in which this solution might be seen as useful – such as a particularly brilliant answer in the University Challenge TV quiz, or as a fascinating topic of pub conversation – there are actually some potentially important applications. DasGupta5 has mentioned using the strong birthday problem methodology in criminology, for instance in calculating the expected number of matched physical features in lookalikes. Another application is in record linkage problems, where the distribution of the number of non-matched individuals with data in two files is of interest. This can be used in cryptographic attacks on digital signatures, for example. Both lie beyond the scope of this brief review; but happy birthdays do not.