Regional surname affinity: A spatial network approach

Abstract Objective We investigate surname affinities among areas of modern‐day China, by constructing a spatial network, and making community detection. It reports a geographical genealogy of the Chinese population that is result of population origins, historical migrations, and societal evolutions. Materials and methods We acquire data from the census records supplied by China's National Citizen Identity Information System, including the surname and regional information of 1.28 billion registered Chinese citizens. We propose a multilayer minimum spanning tree (MMST) to construct a spatial network based on the matrix of isonymic distances, which is often used to characterize the dissimilarity of surname structure among areas. We use the fast unfolding algorithm to detect network communities. Results We obtain a 10‐layer MMST network of 362 prefecture nodes and 3,610 edges derived from the matrix of the Euclidean distances among these areas. These prefectures are divided into eight groups in the spatial network via community detection. We measure the partition by comparing the inter‐distances and intra‐distances of the communities and obtain meaningful regional ethnicity classification. Discussion The visualization of the resulting communities on the map indicates that the prefectures in the same community are usually geographically adjacent. The formation of this partition is influenced by geographical factors, historic migrations, trade and economic factors, as well as isolation of culture and language. The MMST algorithm proves to be effective in geo‐genealogy and ethnicity classification for it retains essential information about surname affinity and highlights the geographical consanguinity of the population.

At the beginning of this century, an innovative clustering technique called "community detection" was proposed by researchers working in network science (Girvan & Newman, 2002). With the discovery of "small world" and "scale free" phenomenon (Barabási & Albert, 1999;Watts & Strogatz, 1998), complex network has attracted much attention as a new systematic way of modeling complex systems (Barabási, 2016). It can also be used to study the relationships among individuals, groups, and organizations in human societies (Borgatti, Brass, & Halgin, 2014;Szell, Lambiotte, & Thurner, 2010).
Some researchers have used network approach to examine naming connections and have used the technique of community detection to identify naming communities. Mateos, Longley, and O'Sullivan (2011) took the lead when they constructed a two-mode (bipartite) network of forename and surname associations and two one-mode networks of surnames and forenames by using a large sample of population in 17 countries. Kowalska, Longley, and Musolesi (2015) expanded this work to 23 countries in four continents. Another approach to constructing surname networks uses similarities among surnames. Novotný and Cheshire (2012) built a Czech surname network using a similarity based on the pairwise probabilities of the co-occurrence of surnames. By employing the well-developed techniques of community detection to a naming network, these pioneering researchers have found that the network representation clearly defines ethno-cultural boundaries. As far as we know, there is still little research associated with surname studies which concerns on spatial network and the community detection in it.
Chinese surnames are a significant and remarkable data source in surname studies. The cultural continuity in China is one of the oldest in the world, and its hereditary surname history dates back approximately 5,000 years (Hanks, 2003). 1 In traditional Chinese society, the effect of the small-scale peasant economy was such that people seldom moved from their homeland of origin (Lee, Fok, & Zhang, 2008).
Families sharing the same surname tended to live together, especially in the villages (Wu, 1927), and thus the Chinese regional surname structure strongly reflects regional consanguinity and ethnicity. Chinese culture is dominated by Han culture in which the concept of patrilineal surname has been deeply rooted in the minds of people.
Hence, Chinese people attach great importance to the surname and its inheritance. People cling to their surnames loyally and do not change their surnames unless some special circumstances, such as taking noble surname from the emperor, adoption. Moreover, women do not change their surnames after marriage. As a result, Chinese surnames are paternally inherited in a stable and continuous way (Yuan & Zhang, 2002). Using sampling data of surnames and a short tandem repeat on the Y-chromosome (Y-STR) in Shandong province, Shi et al. (2018) found that Chinese surnames can be inferred from Y-STR profiles, indicating that Chinese surnames are an accurate data source when studying geo-genealogy and ethnicity classification.
We here apply a community detection algorithm to a spatial network to analyze surname affinities among geographic areas and create a regional surname geography. We use a large sampling of Chinese surname data to construct the spatial networks. The network nodes are administrative regions at the prefectural level, and the edges are defined by isonymic distances. To guarantee that there are no isolated nodes in the network, we can construct it using the minimum spanning tree (MST) algorithm (Prim, 1957). Although the most essential edges are retained in the MST network, many important links are lost. To remedy this weakness, we modify this algorithm by retaining as many essential edges as possible. The new algorithm, multilayer minimum spanning tree (MMST), is an enhanced and expanded MST. We use the MMST to construct a spatial network with a topology that allows the implementation of community detection. As we will show, without any previous knowledge on the geographical information of the concerned regions, this method is able to produce a clear community structure in both topological network and geographical connections. There are 1.28 billion people listed in the data set, and they share 7,184 different surnames. Rodriguez-Larralde et al. (2011) compared the surname data in eight European countries, three South American countries, the United States, and Yakutia, and they found the number of different surnames of most countries is more than 100,000, which is much more than that of China. China has a relatively small number of different surnames, even its population is quite large. This fact is attributed to Chinese specific surname history and culture. Chinese surname possibly originated in 5000 years ago (Hanks, 2003). At the very beginning, it was a symbol of social status and nobility, and surname included clan name and lineage name (He, Hu, Zhu, Xia, & Huang, 2016). In the Han dynasty (206 BC-220 AD), lineage names became indistinguishable from clan names, and they both evolved into modern Chinese surnames. Moreover, patrilineal inheritance of Chinese surnames has been strongly maintained and reinforced by cultural constraints. Most Chinese surnames in use today had formed for 2000 years. This has been claimed by Yuan, Zhang, Ma, and Yang (2000). They compared the distributions of 100 most common Chinese surnames in Song dynasty (960-1279 AD), Ming dynasty (1368-1644 AD), and present day, and found the three distribution curves are nearly overlapping. The stable distribution of surnames indicates that Chinese surnames are well-preserved for a long period.

| Data and materials
However, Chinese population has risen almost twentyfold since Han dynasty. As a result, the number of different surnames of China is far less than those of most countries in the world.
China is a multiethnic country with 55 minorities. The dominant Han people comprise 91.5% of the Chinese population. 3 Because of the differences among surname cultures, the surname structure of some minority regions differs from that of the predominantly Han regions. In particular, people in some ethnic minority groups do not use surnames, such as Yi, Miao, Tibetan, and Mongolian. Unlike most western countries, Chinese people put their surnames before first names. For the citizens without surnames whose population is very small, NCIIC took the first Chinese character of their names as their surnames. Although these surnames are not paternally inherited, they are still influenced by regional naming cultures.

| Isonymic distance
Isonymy measures how frequent the same surnames are shared by two geographical areas (Lasker, 1977). It can also indicate the inbreeding frequency and biological relatedness within a given area (Crow, 1980;Crow & Mange, 1965;Lasker & Mascie-Taylor, 1985). The isonymy between areas i and j can be defined as I ij ¼ P S k¼1 p ki p kj , where S is the total number of surnames in both areas, p ki and p kj are the relative frequencies of surname k in the area i and j, respectively. Barrai et al. (1996) defined the inverse of isonymy in one area, 1= P S k¼1 p 2 k , as alpha (α), which is called the effective surname number (Herrera Paz et al., 2014).
In surname studies, the isonymic distance measures the dissimilarity of surname structure between two areas. A small isonymic distance between two areas indicates that their surname structures are strongly similar. There are three ways of calculating the isonymic distance: Lasker's distance, Nei's distance and Euclidean distance. Lasker's distance is defined by LD ij =−log (I ij ) (Rodriguez-Larralde et al., Nei, 1972). 1967). The matrix of isonymic distances can be used to express the multilateral dissimilarities of surname structures among different areas. It is typically used as the input data in a namebased ethnicity classification.

| Building spatial network
A network is a collection of edges that connect nodes, and can be defined as a graph G(V, E) with a node (or vertex) set V and an edge (or link) set E. In our spatial network, the node set represents prefec-

| Quantifying network dissimilarity
To investigate the change of network topology with increasing the number of layers, we calculate the network dissimilarity between two adjacent networks in the procedure of building network. We use the D-value proposed by Schieber et al. (2017) to quantify network dissimilarity. Based on the standard information-theoretic metrics, D-value quantifies the differences of topological structure between networks with a three-term function. The network dissimilarity between G and G 0 is defined by The first term on the right side of the equation focuses on the network distance distribution, μ, in which J is Jensen-Shannon divergence. The second term characterizes the node heterogeneity, in which NND is network node dispersion of a network with diameter The third term captures the difference of node centrality, in which P αG is the alpha-centrality distribution of network G and G c is the complement of G.

| Network community detection
Network communities or clusters are groups of nodes with dense internal connections. To measure the effectiveness of community detection, modularity has been proposed (Newman, 2006). High modularity levels indicate good partitioning. Modularity is defined to be the fraction of edges within the given group minus the fraction expected if edges were randomly distributed.
There are many clustering methods that can be applied to various types of networks (Fortunato, 2010). We here use the fast unfolding algorithm (Blondel, Guillaume, Lambiotte, & Lefebvre, 2008) based on an MMST network of isonymic distances to classify its nodes. This algorithm first sets initial partition and iterates it until there is no further improvement in the modularity. The modularity is a scalar value that compares the actual density of edges inside communities and the corresponding random case (Newman, 2006). This algorithm is implemented in Gephi 0.9.2, 4 in which the resolution γ is an adjustable parameter that controls the size of community. When γ ! 0, each node is a separate community. In this article, we adjust the parameter γ to get eight communities for all network community detection. By doing so, we set the parameter γ as 0.75, 0.7, and 0.6 for 9-laye, 10-layer, and 11-layer MMST, respectively.

| Distribution of isonymic indexes
The Supporting Information Table S1 shows

| Statistical property of isonymic distances
As mentioned above, there are three kinds of isonymic distances. We calculate these distances respectively and analyze their distribution characteristics. As shown in Figure 1, none has a normal distribution.
The ED curve has two obvious peaks, and those of LD and ND have fat tails. The minor peak of the ED curve and the tails in LD and ND  Figure 1a show that ED is a mixture of normal distributions in which the left part of the distribution is dispersed and distinguishable. As Rodriguez-Larralde, Gonzales-Martin, Scapoli, and Barrai (2003) argues, ED has the advantage over the other isonymic distances when few surnames are shared in two groups. Therefore, we choose the matrix of ED as input data for building the spatial network.

| Determination of network layer
When we add additional layers to the original MST network, the MMST recovers some lost important edges, but an excess of additional layers can produce redundant information. An appropriate layer of network should be large enough to retain most valuable information, but not too large, and be small to minimize redundant edges. Figure 2 shows the variation of D-value along k (the MMST layer), where the D-value at k is calculated using the k-layer MMST and the k + 1-layer MMST. When the iterative process of adding layer ends, the maximum number of layers that the  Figure S1 and S2, respectively). Figure 3 shows a spatial network generated using the Fruchterman-Reingold algorithm (Fruchterman & Reingold, 1991)   To determine the accuracy of the community detection based on the network topology, we calculate the average ED (denoted by AED) between any two communities and within a community. The former is the average of Euclidean distances between all possible twocommunity prefecture pairs, and the latter is the average of all Euclidean distances between any two prefectures within the corresponding community. Figure 4 shows AED matrix as a heat map in which its This is in accord with the study of Yuan and Zhang (2002) that the flat landscape of northern China has made the population more mobile than those in southern China. However, we can distinguish A-D communities by detecting community in the spatial network even though they are much similar with each other on surname structure.

| DISCUSSION
The community structure in the MMST, as shown in Figure 3, is obtained by computing the similarity of surname structure between prefectures. Only topological properties of the network are taken into account rather than geographical ones in this procedure. The community partition is evaluated by comparing the inter-community distances with intra-community ones and the results are presented in Figure 4. To justify the efficacy of this partition, we also need to evaluate the closeness of the prefectures within each community shown in Figure 3 from a geographical viewpoint. everything is related to everything else, but nearby things are more closely related than distant things (Tobler, 1970). Note also that several provinces are separated into different communities, indicating that some prefectures of one province are more similar to certain prefectures of other provinces, and that this resulting ethnicity classification is not coincident to administrative divisions.
Some geographical factors affect the community detection results. Waterways promote human mobility, and mountains hinder it. As shown in Figure 5, the Yangtze river cuts through three communities, while the Yellow river cuts through four communities, and enters community D twice. The Yangtze river and the Yellow river are the two longest rivers in China, and both pass China through west to east. The difference between them is, the former is a waterway channel and characterized by a large water discharge and a deep-incised valley, the latter is not an efficient way of transportation and characterized by huge sediment discharge and steep longitudinal profile (Saito, Yang, & Hori, 2001). In contrast to the Yellow river, the Yangtze river is much more convenient for human mobility, the upstream and downstream of the Yangtze river are more likely to be grouped into the same community. Especially, community F is located along the Yangtze river, and wraps a large part of the river. We can also find that the Hengduan Mountains define the geographical boundary between G and F communities.
Similar blocking effect can be found in the boundaries between different communities, which are defined respectively by Qilian Mountains (between D and G), Helan Mountains (between C and D, G), Taihang Mountians (between B and C), Yimeng Moun-

tains (between A and B in Shandong province), Qinling Mountains (between D and F), Nanling Mountains and Wuyi Mountains (between F and H).
Historic migration can also effect separations within provinces.
The division of the prefectures within Shandong province reflects a historic migration from Shandong to the three northeastern provinces (Heilongjiang, Jilin, and Liaoning), which is the well-known "Rush to Northeast". This mass migration had lasted for more than 300 years from the early Qing dynasty to the end of the last century. Natural disasters and excessive population density prompted the citizens of Shandong to leave their homeland and make their living elsewhere.
Because some prefectures in Shandong border the Bohai sea, some emigrants there used the seaway to move to northeastern provinces. to differ from those of nearby prefectures (Liu, 2007). These three outlier prefectures are small but appear as high-degree nodes in the network graph (see Figure 3) because the surname structure here is similar to the former prefectures of the immigrants. We also find that  (Batty, 2006), and the latter one is the commercial and financial center of China nowadays. A long regional economic development has promoted population mobility within Yangtze Delta economic zone. The formation of community G and H is typically caused by isolation of culture and language. They are located respectively in the most western and southeastern of China. The people live in both areas have formed their own regional cultures and languages that are quite different from those elsewhere.
In fact, the factors we mentioned above sometimes take their effects together. For example, "Hexi Corridor" is an important route of trade, which was determined by its geographical condition. There are many oases along the path, and it borders Qilian Mountains to the south, Gobi desert to the north. Some historical migrations were also partly driven by economic factors. A typical case is "Going to the West Gate".
In contrast to the clustering result derived from k-means clustering (see Figure S4 in Supporting Information), the eight community allocations derived from MMST network (see Figure 5) are not only clear and intuitive but also sensible and reasonable. In the k-means clustering algorithm, prefectures with small isonymic distance are grouped into one cluster. However, in our approach, prefectures are in a community only if they are densely connected with one another in the MMST network. In the MMST network, all prefectures only have several most relevant connections. Two prefectures are grouped into one cluster by k-means and may be far away from each other in our network topology. The prefectures of some clusters scatter over the map in Supporting Information Figure S4, while those in Figure 5 are basically continuous and complete. The western China is a densely populated area of minor ethnic groups in which even some people do not have inheritable surname, and thus its surname structure differs from those in other places. As shown in Figure 5, the prefectures in the western China are mainly in the community G, while they are partitioned into several groups in Supporting Information Figure S4.
In summary, the construction of MMST network combining with the community detection in it is an effective approach to study regional surname affinities. This algorithm guarantees the connectivity of network, in which no nodes are isolated. It also ensures that all nodes have at least L (the number of layers in MMST) strongest links of them. The viewpoint of network topology is the most essential merit of MMST that makes this algorithm different from traditional clustering techniques. In other words, the traditional clustering techniques are prone to offer biased results of ethnicity classification, for the reason that they are equivalent to community detection in fully connected weighted network, which overemphasizes surname kinship and downgrades geographical consanguinity. In contrast, the topology of the MMST spatial network yields meaningful results of geogenealogy and ethnicity classification. The MMST algorithm can also be used to filter relevant information in many other issues. Our work here calls for a deep mining of the spatial and surname networks to reveal more hidden patterns in the surname data set.