An index of Chinese surname distribution and its implications for population dynamics

Abstract Objective We propose an index to characterize the key feature of Chinese surname distributions and investigate its implications for population structure and dynamics. Materials and methods The surname dataset was obtained from the National Citizen Identity Information Center, which contains 1.28 billion Chinese citizens enrolled in 2007, excluding those of Hong Kong, Macao, and Taiwan. An index, the coverage ratio of stretched exponential distribution (CRSED), is proposed based on the crossover point of stretched exponential truncated power‐law distribution, where the stretched exponential term and the power‐law term contribute equally. We use multidimensional scaling technique to demonstrate the dependence of the similarity of one prefecture to the others on the CRSED. Results The CRSEDs of 362 prefectures exhibit an uneven distribution. The consistency of this index is evident by strong positive correlations of CRSEDs at the three administrative levels. This new index has a strong negative correlation with the proportion of the rare surnames. The prefectures with similar CRSEDs tend to adjoin each other on the administrative map, resulting in several distinct regions, each of which shares similar terrain features or historical migrations. The prefectures with lower CRSEDs are more dissimilar to the other prefectures, while the ones with higher CRSEDs are more similar to the others. Discussion The population dynamics of the prefectures with higher CRSEDs are more likely dominated by migratory movements, the dominant evolutionary forces of the prefectures with lower CRSEDs can be attributed to drift and mutation.

In most countries and regions, surname distributions are found to follow power-law in their representation of frequency distribution, cumulative distribution or Zipf plot (Baek, Kiet, & Kim, 2007;Miyazima, Lee, Nagamine, & Miyajima, 2000;Zanette & Manrubia, 2001). However, some different kinds of surname distributions have also been observed. For example, the logarithmic form of the cumulative surname distribution of Korea has remained unchanged for five centuries (Kim & Park, 2005). Similarly, the top 100 most popular surnames in China exhibit an exponential Zipf plot, which has been maintained since the Song dynasty (Baek et al., 2007;Yuan & Zhang, 2002). Nevertheless, the cumulative distributions of surnames on all three levels of province, prefecture, and county in China were found to follow a unified form of stretched exponential truncated power-law (Chen, Chen, Liu, Wang, & Wang, 2011).
Surname distribution, as an integrative result from the evolutionary forces such as drift, mutation, and migration, contains important information of population dynamics. For example, Pavesi et al. studied the surname distribution of 312 communes in Sicily, where all the distributions could be regarded as power-law type. However, the fitted exponents varying from 0.46 to 1.83 appeared to be associated with the level of isolation and thus indicating that the relative strength of migratory movements in these communes may govern the population dynamics (Pavesi, Pizzetti, Siri, Lucchetti, & Conterio, 2003). From this result, a question is raised about whether any other key features of surname distribution can also be regarded as an indicator of population dynamics. In this article, we will employ surname distributions in China to address this question.
Chinese surnames are quite suitable for investigating the implications of surname distribution for population dynamics (Chen et al., 2011;Liu, Chen, Yuan, & Chen, 2012;Shi et al., 2018;Shi et al., 2019). Chinese surnames have been well preserved through generations due to the prevalence of Confucian culture, in which people do not change their surnames unless they have to do so (Du & Yuan, 1995;Du, Yuan, Hwang, Mountain, & Cavalli-Sforza, 1992). This has allowed long-term random drift to take its function for more than 4,000 years. During the process of random drift, there were also many large-scale migratory movements in the history of China. As a result, Chinese surnames have experienced long-term integration between locals and migrants. However, the scale of these immigrations and their effects on the local population are quite different from region to region. And such variations will be definitely embodied in surname distributions. In fact, the cumulative distribution of Chinese surnames follows a unified form of stretched exponential truncated power-law, but the fitted parameters vary greatly in different regions (Chen et al., 2011), which must be associated with different migratory movements.
In this article, a new index of surname diversity, the coverage ratio of stretched exponential distribution (CRSED), is put forward to characterize the relative importance of stretched exponential term to power-law term in this kind of surname distribution in Subsection 2.3.
That is, a surname distribution with a higher CRSED corresponds to a more stretched-exponential-like distribution, while that with a lower CRSED corresponds to a more power-law-like form. The implications of CRSED for population structure are thoroughly investigated at the level of prefecture in Section 3.1. Then, three aspects of CRSED are investigated, including the consistency of CRSED at the three administrative levels in Section 3.2, the spatial distribution of CRSED and the corresponding features of geographic environment and historical migratory movements in Section 3.3, and the relevance of CRSED for each prefecture to its degree of surname similarity with other prefectures in Section 3.4. Based on the results, a hypothesis on the relationship between CRSED and population dynamics is put forward and qualitatively explained in Section 4.

| Data and materials
The surname dataset in this article was obtained from China's identity information system, which was constructed by the National Citizen Identity Information Center. The data contain 1.28 billion people who were enrolled in 2007 and who live in mainland China, excluding Hong Kong, Macao, and Taiwan. In Chinese naming system, most surnames appear generally the first Chinese character followed by the given name, so the first Chinese character of one's name is taken as his/her surname. However, a few surnames, such as Ouyang (欧阳), Zhuge (诸葛), and Linghu (令狐), consist of multiple characters. In these cases, taking the first Chinese character as surname may result in some inaccuracy. However, this inaccuracy is so slight that it could be dismissed due to the rarity of such kinds of surnames. Further preprocessing, including removing non-Chinese character surnames and merging the surnames expressed in traditional characters into the corresponding simplified ones, is necessary. After these operations, we obtain a total of 7,184 surnames, which are used in the following analysis.
China is an integrated country of multiple ethnic groups, with the Han as the largest one accounting for 91.4% of the total population and with 55 ethnic minority groups. The naming systems of most ethnic minorities are the same as that of the Han. However, some ethnic minorities have different naming systems or even have no surname at all (Qian, 1989). In the latter case, surnames have been assigned using the first character of their names so that all the surnames can be treated in a consistent way. As a result, surname distributions in the prefectures with a high proportion of these ethnic minorities may be extraordinary.

| Previous index on surname structure
Isonomy is one of the most commonly used index in the surname researches. The isonomy within a region i is defined as I i = P S k = 1 p 2 ki , where p ki is the proportion of the population with surname k to the entire population, and S is the total number of surnames. The isonomy between two regions i and j is defined as I ij = P S k = 1 p ki p kj . The isonomy within a region characterizes the aspect of within-population structure, while the isonomy between two regions reveals another aspect of population structure, the between-population similarity.
The difference in population structure between any two regions can be measured by surname distance. There are several definitions of surname distance, such as Lasker's distance (Rodriguez-Larralde et al., 1998), Euclidean distance, and Nei's distance (Cavalli-Sforza & Edwards, 1967). Nei's distance, which can also be taken as a specially normalized form of the isonomy between two regions, is commonly used in relevant works and will be adopted in this article. Specifically, Nei's distance between regions i and j is defined as N ij = − log I ij = ffiffiffiffiffi I i I j p À Á (Nei, 1972).
The isonomy analysis is helpful for measuring the structure and regional consanguinity of the Chinese population, as shown in previous studies (Du et al., 1992;Yuan, Jin, & Zhang, 1999;Yuan, Zhang, Ma, & Yang, 2000). However, the definition of isonomy implies that the popular surnames have absolute dominance over the less popular ones, so the information contained in the less popular surnames can-

| A new index of surname distribution: CRSED
The cumulative distribution function (CDF) of Chinese surnames for all the administrative levels can be fitted with a stretched exponential truncated power-law function (Chen et al., 2011), that is, where P(n) represents the proportion of surnames whose sizes are no less than n, b is the power exponent, c is the cutoff size of power-law part, and d is the stretch parameter in the stretched exponential function (Bonabeau, Dagorn, & Fréon, 1999).
Actually, the function exhibits a crossover from the power-law form to the stretched exponential one. Specifically, the function looks like power-law in the domain where the value of n is small enough, while it will transform into a stretched exponential function when n is large enough. Although the parameter c is commonly taken as the cutoff size of power-law, a more justified crossover point will be defined as follows.
The right side of Equation (1) is the product of two terms, the power-law n −b and the stretched exponential e − n=c ð Þ d , thus the first derivative of the function contains two parts: the one from the first derivative of n −b and the one from that of e − n=c ð Þ d . The relative importance of the power-law term and the stretched exponential term in this function can be determined by the relative magnitude of their counterparts in the first derivative. According to the crossover point can be reasonably defined as the point where the two derivative parts are equal to each other, thus we can get the expression of the crossover point as follows: Specifically, the power-law form dominates in the domain of n < n 0 , while the stretched exponential form dominates in the domain of n > n 0 (Chen et al., 2011). Combining the definition of n 0 in Equation (3) and the definition of P(n) as the proportion of surnames whose sizes are no less than n, the value of P(n 0 ) represents the proportion of surnames that fall into the domain of stretched exponential form.
Thus, P(n 0 ) means that CRSED can be used as an index to characterize the key feature of the stretched exponential truncated power-law distribution. Generally speaking, a higher CRSED corresponds to a more stretched-exponential-like distribution, while a lower CRSED corresponds to a more power-law-like one.
In order to estimate the CRSED of a surname distribution, the CDF profile should be used instead of the fitted curve. Specifically, the value of n 0 on the fitted curve will be estimated with the fitted parameters b, c and d according to Equation (3). Then the crossover point has to be set as the ceiling of n 0 and the CRSED on the actual CDF will be the proportion of surnames whose sizes are no less than this crossover point. In the extreme case, if the estimated value of n 0 on the fitted curve is less than one, it has to be set as one and then the CRSED will be set to be 100%. For simplicity, the same symbol n 0 is used to represent the fitted value and its ceiling, and the same index CRSED is used to represent P(n 0 ) of the fitted curve and the real data.
The intuitive meaning of CRSED can be illustrated by those of two typical prefectures, Nanjing and Guangzhou, as shown in Figure 1a The correlations between the fitted parameters in Equation (1) and the CRSED or n 0 are also important to understand the new index, especially the necessity of introducing n 0 . According to Equation (3), n 0 is determined by the three fitted parameters such as b, c, and d.
There is a significantly positive correlation between n 0 and the cutoff size c, while the former is roughly one order of magnitude smaller than the latter as shown in Figure 3a. tributions at the prefecture level mentioned above, the distributions at the other two levels will be also investigated.
The histogram of CRSEDs for all the 31 provinces (or municipalities, autonomous regions, or special administrative regions) is shown in Figure 5a. There are nine provinces whose CRSEDs are <0.04 and only two provinces whose CRSEDs are 1. By comparing the histogram in Figure 5a and that in Figure 2b, it can be inferred that the histogram of CRSED at the province level is more concentrated at the lower part, implying that the surname distributions are more power-law-like form compared to those at the prefecture level. In contrast, the case at the county level is opposite as shown in Figure 4c, where the CRSED histogram of the 2,832 counties are more concentrated at the higher part.
There are 63 counties whose CRSEDs are <0.04 and 1,360 counties whose CRSEDs are 1. Thus, there is an evident trend that the CRSEDs at higher administrative level are relatively lower, implying that the surname distributions at the higher level are more power-law-like form.
In order to check the consistency of CRSED between the province level and the prefecture level, the average CRSED of the prefectures within each province is calculated. There is a positive correlation between the CRSEDs at the province level and the average CRSEDs at the prefecture level as shown in Figure 5b. The Spearman correlation coefficient is .79. Thus, the prefectures within a province with a lower CRSED are more likely to have relatively lower CRSED and vice versa.
This indicates a consistency of CRSEDs at the provincial and prefectural levels. Such consistency can be further confirmed by the comparison between the prefecture level and the county level. As shown in Figure 5d, there is also a positive correlation between the CRSEDs at prefecture level and the average CRSEDs of counties within each prefecture and the Spearman correlation coefficient is .88.
Overall, it can be concluded that the CRSEDs are qualitatively consistent at the three administrative levels and thus the CRSED can be regarded as a valid index in characterizing surname distributions in China.

| Geographical representation of CRSED
The geographical distribution of the CRSEDs at the prefecture level on Chinese administrative map is represented in Figure 6. I represents the prefectures whose surname distributions look most like a power-law function, while Group IV represents the ones whose surname distributions are almost stretched-exponential function.
The spatial association of the CRSEDs can be easily obtained by this way. Strikingly, an explicit pattern appears that the prefectures in the same group tend to adjoin each other geographically. As a result, the prefectures in each of the four groups form a distinct geographical region with only a few outliers. Furthermore, Groups I, II, III, and IV are located in the map from the south to the north sequentially, with the CRSEDs of the corresponding prefectures increasing gradually in this direction.
Next, the general features of each group will be explained, including geographic environment and historical background, especially long-term migratory movements.
The prefectures in Group I are mainly located in the south and west of China. In most of these prefectures, the terrain is mountainous or hilly, the population contains a relatively high proportion of ethnic minorities, and the local language embodies a specific dialect.
Therefore, the people in each of these prefectures are relatively isolated from those living in neighboring prefectures. Due to that, there have been relatively fewer migratory movements between these prefectures and others according to the historical records. Thus, the prefectures in Group I can be taken as the "Isolated Region" hereafter. The Yangtze River basin is a land abundant in water resources and products, making it very suitable for humans to live in. For long stretches of history, especially after the Song dynasty in the 11th century, people in northern China continued to move from the Yellow River basin to the Yangtze River basin, forming another center of the population there (Tian, 1998). After these long-term, continuous immigrations, a mixture of populations with different origins had accumulated in the Yangtze River basin, resulting in the highest level of surname diversity (Liu et al., 2012). Therefore, this region can be named the "Immigration Region." Most prefectures in Group III are situated in central and northern China along the Yellow River basin. It is well known that the Yellow River basin was the core birthplace of Chinese civilization, as it housed most of the capital cities of ancient empires before the Song dynasty, including Xi'an, Luoyang, and Kaifeng. However, since the Song dynasty, there have been continuous and massive emigrations to other areas, such as "Moving the capital to Lin'an" during the buildup of the Southern Song dynasty. Therefore, this region can be regarded as the "Emigration Region." The prefectures in Group IV are mainly located in the northeast and northwest of China. Due to the frigid climate, there was a rather small population in northeast China until the Qing dynasty, and most of the current inhabitants came from the Yellow River basin during the migratory movement of "Braving the journey to the northeast of China" or "Rush to Northeast" in the last two centuries (Fan, 2005). Regarding the northwest of China, although the famous Silk Road was there, its population has also remained small due to the desert climate. However, the

| CRSED and surname distance
All the above analysis on CRSED characterizes the surname structure within a given prefecture. Next, the relevance of CRSED for each prefecture to its degree of surname similarity with other prefectures will be addressed.

F I G U R E 6
The geographic representation of the CRSEDs for 362 prefectures. CRSED, coverage ratio of stretched exponential distribution The (dis)similarity of surname structure between any two prefectures can be measured by their surname distance. In order to show surname distances among all the prefectures graphically, a nonlinear dimensionality reduction technique, multidimensional scaling (MDS), will be used.
MDS technique can place each object in low-dimensional space and preserve the between-object distances as well as is possible (Kruskal, 1964).
With Nei's surname distance matrix among the prefectures as the input, each prefecture will be represented as an object on a two-dimensional space by MDS technique so that the prefectures with smaller Nei's distances are more likely to be close to each other.
There is an evident feature on the two-dimensional space of Nei's distance as shown in Figure 7a. Nei's distances for each group is shown in Figure 7b.

| Qualitative explanation
The relationship between CRSED and population dynamics in China can be qualitatively interpreted by the simple model of population dynamics proposed by Baek et al. (2007), who argued that the difference in surname distributions may originate from the difference in the appearance of new surnames. That is, if the number of new surnames generated per unit of time is proportional to the population size, the power-law distribution can be derived, whereas if new surnames appear linearly in time irrespective of the total population size, the logarithmic distribution of surnames can be obtained.
More specifically, for prefectures in Group I (or Isolated Regions) whose surname distributions look most like power-law function, the main source of new surnames should be mutation from local residents.
Thus, it is reasonable to presume that the rate of appearance of new surnames is proportional to the population size, resulting in a power-law-like surname distribution in these prefectures according to the model. On the contrary, for prefectures in Group IV (or Reclaimed Regions) whose surname distributions are almost stretched-exponential function, the population consists of a large portion of migrants who could bring new surnames into the area. Since most migratory movements were driven by external forces, the rate of new surnames from migrants should be irrespective of the local population size; thus, the prerequisite for a power-law surname distribution is violated. Additionally, most immigrations at the prefectural level were discontinuous; thus, new surnames likely appeared nonlinearly in time, disobeying the prerequisite for a logarithmic surname distribution as well. As a result, the surname distributions in these areas must follow a new kind of function. This arouses the question of why the surname distributions in Reclaimed Regions follow stretched-exponential function. This issue is too complicated to be simply modeled because migrants may not only bring new surnames but also increase the population size of some existing surnames, which are explicitly irrelevant to the local surname composition.
It has to be pointed out that the relationship between CRSED and population dynamics can be only taken as a hypothesis at this stage. It seems true in China, but before it can be regarded as a general rule, much more convincing evidences and theoretical attempts are required in the future.