SEARCH

SEARCH BY CITATION

Keywords:

  • web mining;
  • webometrics

Abstract

  1. Top of page
  2. Abstract
  3. Background of the Study
  4. Literature Review
  5. Features of Google Trends and Baidu Index
  6. Methods
  7. Results
  8. Discussion and Conclusions
  9. Acknowledgment
  10. References
  11. Appendix 1. Universities in the Study
  12. Appendix 2. Companies in the Study

Numerous studies have explored the possibility of uncovering information from web search queries but few have examined the factors that affect web query data sources. We conducted a study that investigated this issue by comparing Google Trends and Baidu Index. Data from these two services are based on queries entered by users into Google and Baidu, two of the largest search engines in the world. We first compared the features and functions of the two services based on documents and extensive testing. We then carried out an empirical study that collected query volume data from the two sources. We found that data from both sources could be used to predict the quality of Chinese universities and companies. Despite the differences between the two services in terms of technology, such as differing methods of language processing, the search volume data from the two were highly correlated and combining the two data sources did not improve the predictive power of the data. However, there was a major difference between the two in terms of data availability. Baidu Index was able to provide more search volume data than Google Trends did. Our analysis showed that the disadvantage of Google Trends in this regard was due to Google's smaller user base in China. The implication of this finding goes beyond China. Google's user bases in many countries are smaller than that in China, so the search volume data related to those countries could result in the same issue as that related to China.


Background of the Study

  1. Top of page
  2. Abstract
  3. Background of the Study
  4. Literature Review
  5. Features of Google Trends and Baidu Index
  6. Methods
  7. Results
  8. Discussion and Conclusions
  9. Acknowledgment
  10. References
  11. Appendix 1. Universities in the Study
  12. Appendix 2. Companies in the Study

Web search engines are an indispensable tool for searching for information. Users enter billions of queries into search engines every day. The subjects of these queries mirror issues that interest or concern the users. Web queries are thus a treasure house for web data mining. Google started digging into this data source when it offered the service Google Trends at www.google.com/trends/. Google Trends reports search volume of terms based on queries that people entered into the Google search engine. Google used to have a related service called Google Insights for Search. This service was merged into Google Trends in 2012 (Matias, 2012). Numerous studies have been done that analyzed data from Google Trends and Google Insights for Search. These studies uncovered various types of information, ranging from economic activities (Choi & Varian, 2009) and politics (Reilly, Richey, & Taylor, 2012) to disease trends (Ginsberg et al., 2009). Needless to say, information obtained this way is timelier and potentially more credible than what could be obtained from traditional sources such as public opinion polls or surveys.

A Chinese counterpart of Google Trends is Baidu Index at http://index.baidu.com/, which is based on queries users entered into the Baidu search engine at www.baidu.com. Baidu is the dominant search engine in China and the third most popular search engine globally (Google and Yahoo! are the top two), according to Alexa Internet's November 1, 2013 data. However, very few studies have used the rich data from Baidu Index. The few studies (e.g., Liu, Lv, Peng, & Yuan, 2012; Zhang, Shen, Zhang, & Xiong, 2013) that did used it without examining its features or functions. It is not clear how Google Trends and Baidu Index differ and which one should be used under what circumstances. More importantly, it is not clear what factors affect search volume data reported by different services.

We carried out a study to address these issues. We first compared and tested features and functions of the two services. We then collected search volume data from the two services to determine if the two could be used to mine information on Chinese universities and companies and to find out how the results from the two data sources differ. We tried to relate the differences in the search volume data to the differences in the two web search engines. The significance and value of our study lie not only in determining which service is preferable for Chinese web query mining but also, more importantly, in gaining an understanding of theoretical issues such as what affects the quality of search volume data. For example, Google Trends and Baidu Index differ in the method of term matching. Findings from the study can potentially contribute to our knowledge of language processing, an issue that goes beyond Chinese. Since the Google and Baidu user bases in China are of very different size, comparing Google Trends and Baidu Index offers us the opportunity to examine the effect of user base size on search volume data, an issue that has implications for other countries as well.

As will be described in the Literature Review and the Method sections, the current study used methods developed in earlier studies, particularly those reported in Vaughan and Romero-Frías (2014) and Vaughan (in press), to collect data from Google Trends and correlate the search volume data with the university and company performance data. However, the purpose of the study is not simply to apply the methods developed in earlier studies to a different country, that is, China, although testing the method in China is necessary, not the least because of the unique characters of the Chinese language and the language-dependent nature of web queries. A more important goal of the study was to address the theoretical question of what factors affect search volume data, a significant question for web data mining in general. Further, comparing Google Trends with Baidu Index is not just to learn if and how the two services differ but, more important, to find out if and how the search volume data are affected by the underlying search engines.

It should be noted that the global version of Google Trends at www.google.com/trends/ has a Chinese version at www.google.com.hk/trends/, just as the global version of Google at www.google.com has a Chinese version at www.google.com.hk (the Google China Service). Our extensive tests showed that the two versions of Google Trends reported identical search volume data. We collected our data mainly from www.google.com.hk/trends/ and used www.google.com/trends/ on occasion when the former was not working. The results would be identical if we used only one version of the Google Trends.

Literature Review

  1. Top of page
  2. Abstract
  3. Background of the Study
  4. Literature Review
  5. Features of Google Trends and Baidu Index
  6. Methods
  7. Results
  8. Discussion and Conclusions
  9. Acknowledgment
  10. References
  11. Appendix 1. Universities in the Study
  12. Appendix 2. Companies in the Study

Since the birth of Google Trends, numerous studies have been carried out to uncover various types of information from this rich data source. A pioneering paper in this line of research is Choi and Varian (2009). Their study showed that Google Trends data could be used to make short-term predictions of economic activities including retail sales, automotive sales, home sales, and travel. Another influential study (Ginsberg et al., 2009), published in Nature, demonstrated that Google Trends data could be used to detect influenza epidemics. Cole (2012) analyzed the relationship between search engine queries for a brand and the brand value, while Jun, Yeom, and Son (2013) examined the relationship between sales volumes and web searches of brand names. Google query data have also been used to analyze the movie industry. Goel, Hofman, Lahaie, Pennock, and Watts (2010) used query volume to predict box-office revenue, while Kim (2013) found a positive correlation between movie quality and the search volumes of the movies. The value of Google Trends data for predicting stock market variables has also been explored. Preis, Reith, and Stanley (2010) found a correlation between the search volume for company names and the transaction volumes of the companies in the S&P 500. Da, Engelberg, and Gao (2011) analyzed a sample of Russell 3000 stocks and showed that search volumes could be a measure of investor attention.

In information science, there have been many studies that analyzed query logs of a specific website (e.g., Ortiz-Cordova & Jansen, 2012; Ravid, Bar-Ilan, Rafaeli, & Baruchson-Arbib, 2007) or query logs of a web search engine during a particular time (e.g., Spink, Jansen, Wolfram, & Saracevic, 2002; Spink, Wolfram, Jansen, & Saracevic, 2001). Studies that used aggregated web query data such as that provided by Google Trends are very few, but two recent papers (Vaughan & Romero-Frías, 2014; Vaughan, in press) reported this kind of study. These two studies followed the tradition of webometrics research in that they carried out correlation studies to determine if the web data correlated with offline measures such as university rankings and business performance data. Vaughan (in press) found that web search volume of the company names could be used to estimate companies' business performance and position data. Findings from Vaughan and Romero-Frías (2014) parallel that from inlink studies (Aguillo, Granadino, Ortega, & Prieto, 2006; Smith & Thelwall, 2002), URL citation analysis (Thelwall, 2011; Vaughan & Yang, 2012) and analysis based on organization title mentions (Thelwall & Sud, 2011), in that university rankings based on search volume data correlate with offline university rankings based on teaching and research. In fact, Vaughan and Romero-Frías (2014) directly compared Google Trends data with inlink data and found that the former correlated with the offline university ranking slightly better than the latter did for the U.S. universities, while the reverse is true for Spanish universities.

While studies on Chinese search queries started later than those on English queries, there have been many papers on the subject in recent years. Liu et al. (2012) explored the use of web search queries to predict Chinese stock markets, while Zhang et al. (2013) examined the possibility of using search frequency of stock names as a direct proxy for investor attention. Both studies used query data from Baidu Index instead of from Google Trends, arguing that the former has a much larger user population in China. In contrast, Wu and Deng (2012) used Google Trends to study Chinese housing markets. They explained that they preferred Google Trends over Baidu Index because, for most Chinese provinces, Baidu Index data dated back to 2008, while that of Google Trends were from 2006 on. The authors also stated that another reason why they did not use Baidu Index is because its calculation formula was opaque.

Bao, Lv, Peng, and Li (2013) examined the relationship between gonorrhea incidence and Baidu Index search data. They did not use Google Trends data and did not explain why they used only Baidu Index. In contrast, Gawlik, Kabaria, and Kaur (2011) collected data from Google Insights using both English and Chinese search queries to predict Hong Kong's tourism trends. They did not use Baidu Index nor did they explain why they did not. Hutzler and Linton (2012) used data from Google Insights and Baidu Index to estimate the geographic and demographic profile of potential business software piracy in China. Google Insights data were used to obtain the relative popularity of terms and the geographic location, while Baidu Index data were used to construct demographic profiles. Huang, Zheng, and Emery (2013) used both Google Trends and Baidu Index to investigate changes in online search behavior among Chinese Internet users in response to the adoption of the national indoor public place smoking ban. Although these two studies used both Google Trends and Baidu Index, they did not make an examination or direct comparison of the two data sources. We need information on the relative merits of the two sources so future studies can make an informed decision on what source(s) to use and how to interpret results correctly. Our study aims to serve this need.

Features of Google Trends and Baidu Index

  1. Top of page
  2. Abstract
  3. Background of the Study
  4. Literature Review
  5. Features of Google Trends and Baidu Index
  6. Methods
  7. Results
  8. Discussion and Conclusions
  9. Acknowledgment
  10. References
  11. Appendix 1. Universities in the Study
  12. Appendix 2. Companies in the Study

Neither Google Trends nor Baidu Index states when they started. The earliest recorded dates for the two sites on the Internet Archive were December 2005 and August 2006, respectively. Both Google Trends and Baidu Index are based on search queries that users enter into the respective search engines. Both services report search volume and related news for specific term(s). However, the two services differ in several ways. To compare the features of Google Trends and Baidu Index, we consulted the “help” documents of the two services at https://support.google.com/trends/?hl=zh-Hans and http://www.baidu.com/search/index_help.html, respectively. We also conducted extensive tests to explore some features that are not explicitly documented. Table 1 summarizes the comparisons.

Table 1. Comparing features of Google Trends and Baidu Index
Search FeatureGoogle TrendsBaidu Index
Limit to a specific countryYesNo. China only
Limit to a specific region within a countryYesYes
Limit to a specific time periodYesYes
Limit to a specific categoryYesNo
Maximum number of terms that can be compared at once53
Search volume reportedRelative volumeAbsolute volume
Average search volumeReport numbers when multiple terms are compared. No average is reported if a single term is specifiedShow line(s) of average without reporting specific number(s) regardless of whether a single term or multiple terms are specified
Report the total search volume for a group of termsYesYes
Method of term matchingPartial matchComplete match

In Table 1, “Limit to a specific country” in the “Search Feature” column means limiting the search volume data reported to a specific country. For example, if one chooses to limit the search to the country China, then the search volume data that Google Trends reports will only include searches carried out by users from China. Likewise, “Limit to a specific time period” means only searches carried out in a specific time period are included in the search volume data reported. Google Trends offers the option to limit the search term to a specific category. This is useful to disambiguate terms that have multiple meanings. For example, if you limit the term “Apple” to the category “Computers & Electronics,” you can focus on the web queries that were about the company Apple and exclude searches that were about the fruit apple. Google Trends determines the category by examining the terms the user searched for immediately before and after the term in question (Google, 2013). Google Trends also offers the option of limiting the search volume to subcategories, for example, software is a subcategory of category “Computers & Electronics.”

Google Trends reports relative, not absolute, search volumes; thus, the search volume data for a term can change depending on what terms are being compared. For example, the relative search volumes for “IBM” and “Intel” were 43 to 52, while those for “IBM” and “Microsoft” were 13 to 57. In contrast, Baidu Index reports absolute search volumes. Baidu Index's help document at http://www.baidu.com/search/index_help.html did not indicate whether the search volume reported was absolute or relative. Zhang et al. (2013) reported that it was an absolute number. Our extensive tests confirmed that it was because the search volume number for a specific term remained the same regardless of the terms being compared with it.

Both Google Trends and Baidu Index have the ability to report the total search volume for a group of terms. This is very useful for finding the total search volume for a particular concept that can be expressed by multiple terms or a particular organization that has multiple names. For example, one can ask for the total search volume of “Beijing University” and “BeiDa” (the acronym of the university). The query syntax is “Beijing University + BeiDa” and this is the same for both Google Trends and Baidu Index.

The last line of Table 1, the method of term matching, needs more explanation because of the special character of the Chinese language. In English, a search phrase that a user enters into a search engine can be broken down into individual words. For example, the phrase “Harvard University” can be broken down into two words, “Harvard” and “university.” This can be done easily by a computer program because there are symbols such as the space that separate words. In the Chinese language, however, there is no space between characters, so breaking a search phrase into meaningful parts by a computer program is not easy. For example, four Chinese characters form the term “Tsinghua University.” The first two mean Tsinghua while the last two mean university. The combination of the second and the third characters is not meaningful. When a Chinese term is entered into Google Trends or Baidu Index, there are different ways to match the term with the queries that users entered into the search engine. Neither Google Trends nor Baidu Index indicated through their help documents what matching method they used. We conducted extensive tests and found that Baidu Index requires a complete match between the term that is entered in Baidu Index and the term that is entered into the Baidu search engine, while Google Trends makes a partial match. This is shown by search volumes that the two services reported. For example, Baidu Index reported higher search volume for the term “Tsinghua University” than for the term “Tsinghua,” suggesting that more people searched for Tsinghua University than Tsinghua. In contrast, Google Trends reported higher search volume for “Tsinghua” than “Tsinghua University” because Tsinghua makes a partial match to many terms such as Tsinghua University, Tsinghua campus, or Tsinghua alumni. This method of partial match is consistent with the way Google Trends reported search volume for English phrases, for example, the search volume for Harvard is higher than that of Harvard University.

In addition to the technical differences discussed earlier, Google Trends and Baidu Index differ significantly in terms of the search engine upon which their search volume data are based. Baidu is the dominant search engine in China. Traffic to Baidu was number one in China while that for Google China (www.google.com.hk) was 39th, when we checked Alexa Internet on October 28, 2013. Because search volume data are based on queries entered into the corresponding search engine, it is plausible that the amount of queries entered into the search engines affects the reliability of the search volume data.

Given all the differences between the Google Trends and Baidu Index discussed earlier, it is useful to find out if and how the search volume data they report differ. Accordingly, our research questions are:

  1. Do data from Google Trends and Baidu Index contain information on the quality of universities and companies, that is, can they be used to estimate quality ranking of universities/companies?
  2. Do data reported by the two services differ? If yes, which one is more useful in mining university/business information?
  3. What factors affect the performance of the two services?

We took an empirical approach to compare the two services. We selected a group of Chinese universities and companies, collected data on the search volume of the names of these universities/companies from Google Trends and Baidu Index, correlated search volume ranking with the quality ranking of the universities/companies to address research question 1. We compared the results from the two services to address research questions 2 and 3. The significance of the study is twofold. On a practical level, our findings will provide guidance for future data mining studies that use Chinese search engine query data. On a theoretical level, our study will provide information on factors that affect search volume data, a question that goes beyond the Chinese context.

Methods

  1. Top of page
  2. Abstract
  3. Background of the Study
  4. Literature Review
  5. Features of Google Trends and Baidu Index
  6. Methods
  7. Results
  8. Discussion and Conclusions
  9. Acknowledgment
  10. References
  11. Appendix 1. Universities in the Study
  12. Appendix 2. Companies in the Study

Organizations in the Study

The Institute of Higher Education, Renmin University of China, compiles an annual ranking of the top 50 universities in China. We included all 50 universities in the 2013 ranking (available at http://heso.ruc.edu.cn/102894/) in our study. For each university in the study, we also collected data on the number of undergraduate students. These data were collected from the university websites but we were unable to obtain this number for five universities in the study. The data of undergraduate population size were used to normalize the search volume data (details in the Results section). Appendix 1 shows the universities and the related data.

China's Ministry of Industry and Information Technology makes an annual ranking of China's top 100 Information Technology (IT) companies. The 2013 version of this authoritative ranking is available at http://cyyw.cena.com.cn/2013-07/31/content_195822.htm. We did not find an official statement on the 2013 ranking criteria. The official web page http://www.miit.gov.cn/n11293472/n11293832/n12843926/n13917087/14765388.html that announced the 2012 ranking result said that the ranking was based on a weighted calculation of many measurements including assets, revenue, profitability, investment in research and development, and the number of patents etc. Since this is an annual ranking of more than 20 years, it is safe to assume that there is some continuity in ranking criteria and that the 2013 ranking was based on multiple measurements as well. We selected the top 50 companies in the 2013 ranking for the study. Two companies among the top 50 are conglomerates that have multiple subsidiaries. It is impossible and inappropriate to choose a single company name to search for in Google Trends and Baidu Index, so these two companies had to be excluded from the study. Appendix 2 shows the remaining 48 companies and their related data.

Collecting Search Volume Data

For each university or company in the study, we searched the organization's Chinese name in Google Trends and Baidu Index and recorded the search volume data. If an organization has multiple names, such as a full name and an acronym, we included all variations of names and joined them by the “+” sign, which is equivalent to the Boolean OR operator. The query syntax is the same in Google Trends and Baidu Index. This will capture all searches for the organization name in either the Google or Baidu search engine.

An alternative to searching for company name is to search for company stock ticker. The latter has the advantage of being unique and precise and some earlier studies (e.g., Da et al., 2011) took this approach. However, this approach did not work for this study because not all companies in the study are publicly traded. For example, the top-ranked company, Huawei Technologies Co., Ltd., is privately owned. Even if all companies in the study were publicly traded, we would opt to search for names rather than stock tickers because consumers of these companies' products or services were likely to search for company names rather than stock tickers. If the study were to correlate search volume data with stock exchange data, then searching for stock ticker would be a better option.

It was fairly straightforward to record data from Baidu Index as it provided absolute search volume data. We can easily rank the search volumes of all universities or companies based on the absolute search volume data. Google Trends reported relative search volumes when multiple terms were entered and up to five terms can be compared at once. We used the following method to determine the relative rankings of the organizations in the study. We searched for the names of five organizations and recorded the relative ranking of the search volumes. We then entered the names of another five organizations with some overlap with the first group. We then tried to make a relative ranking of all these organizations and conducted additional searches if needed to clarify the relative ranking. This process continued until all organizations had been compared and their relative search volumes ranked. This is a time-consuming process, thus we consider Baidu Index's absolute data an advantage over Google Trends's relative data. The relative search volume data of universities and companies are shown in Appendixes 1 and 2, respectively.

Because the 2013 rankings of the universities and companies used in the study are likely to be compiled based on the 2012 performance data, we limited our searches in Google Trends and Baidu Index to the year 2012 to better match the ranking data. We limited the Google Trends search to the country of China so that the data are more comparable to that from Baidu Index, which is mainly based on searches carried out in China. Our data collection took place in late September and early October of 2013.

As discussed earlier, Google Trends allowed the search to be limited to a specific category, while Baidu Index did not have this function. To test the usefulness of Google Trends's category function, we collected data both with and without the category limit and compared the results. For the university data set, we used the category “Jobs & Education” and the subcategories of “Education” and “Colleges & Universities.” For the company data set, we used the category “Business & Industrial.” This category was chosen because the two alternative categories suggested by Google Trends were “Computers & Electronics” and “Internet & Telecom.” Each of them fits some companies but not others. No subcategory was used, again because none fits all companies in the study.

Results

  1. Top of page
  2. Abstract
  3. Background of the Study
  4. Literature Review
  5. Features of Google Trends and Baidu Index
  6. Methods
  7. Results
  8. Discussion and Conclusions
  9. Acknowledgment
  10. References
  11. Appendix 1. Universities in the Study
  12. Appendix 2. Companies in the Study

Availability of Search Volume Data

Both Google Trends and Baidu Index were able to report search volume data for all 50 universities in the study. However, Baidu Index could not report search volume data for 3 out of the 48 companies in the study while Google Trends failed to report for 14 companies; three of them overlap with that of Baidu Index. Omitting these 14 companies, there were 34 companies in the final analysis.

When Google Trends could not report search volume for a term, the message displayed was “Not enough search volume to show graphs.” After Google moved its China search service to Hong Kong in March 2010 (Oreskovic, 2010), its user base in mainland China declined. We hypothesized that the larger user base prior to the move would allow Google Trends to report search volumes for more terms. To test this hypothesis, we limited the search to the year 2009 instead of 2012. Of the 14 company names that Google Trends could not report search volumes for the year 2012, 5 of them had search volume data for 2009. This suggests that the smaller user base of the Google search engine in mainland China had a negative impact on Google Trends.

Correlations Between Search Volumes and the Quality Ranking of the Organizations

Correlation tests were used to find out if the search volumes are related to the quality rankings of the organization. Because the rankings are ordinal data, Spearman, rather than Pearson, correlation tests were carried out. Table 2 summarizes the test results. When the size of the universities was not taken into account (the columns “not normalized” in Table 2), there were highly significant correlations (p < .01) between the search volume ranking (both Google Trends and Baidu Index) and the quality ranking of the universities. The correlations mean that there were more web searches, whether in Google or Baidu, of universities that were ranked higher in quality.

Table 2. Correlations between search volumes and the quality ranking of the organizations
Quality RankingGoogle TrendsBaidu Index
Not normalizedNormalizedNot normalizedNormalized
  1. Note. **Correlation significant at the 0.01 level (2-tailed).

University quality ranking0.59**0.52**0.63**0.64**
Company quality ranking0.61**N/A0.61**N/A

When collecting search volume data, we observed pronounced spikes in July/August when the data were graphed over a year. Entrance to Chinese universities is through national exams. Students sit exams in June. They apply to universities in July and the admission letters are sent out in August. The higher search volumes in July and August are likely the result of students or parents searching for university information. Because of this, there may be more searches for universities with larger undergraduate populations. To determine if the significant correlation between university quality and the search volume reported is spurious due to the student population size, we normalized the search volume data by the size of the undergraduate population. We then carried out correlation tests on the normalized data and the results are shown in the columns “normalized” in Table 2. It is clear that the correlation coefficients are still highly significant (p < .01). This suggests that the correlation between quality and search volume is genuine.

Parallel to the findings for the universities in the study, there is a highly significant (p < .01) correlation between the search volume of the company names and the quality of the companies. Higher ranking companies attracted more web searches for their names.

Examining the “Category” Function of Google Trends

Correlation tests were also carried out on Google Trends data that were collected when the search was limited to certain categories. For the university data set, we selected the category “Jobs & Education” and the subcategories of “Education” and “Colleges & Universities.” The correlation coefficients were .48, .51, and .43, respectively, all significant (p < .01). However, they are lower than the correlation coefficient of .59 (p < .01) when data were collected without limiting to a specific category.

For the company data set, the only appropriate category that could be applied is “Business & Industrial.” Google Trends could report search volume data for 34 companies when not limiting to a specific category but was able to provide the search volume data for only 15 of these companies when the search was limited to the category “Business & Industrial.” For these 15 companies, the correlation coefficient is .93 without the category limit and .87 when limiting to the category, both are significant at the .01 level. The latter is lower than the former, again suggesting that limiting to the category did not improve the correlation. Note that the very high correlation coefficients here should be interpreted in the context of the small sample size of 15.

Relationship Between Google Trends and Baidu Index

We examined the relationship between Google Trends and Baidu Index by testing the correlation between the search volume data from the two sources. The correlation coefficients were .84 and .78, respectively for the universities and the companies, both significant at the .01 level. The very high correlation between the two suggests that they contain similar information, at least in the case of quality ranking of universities and companies. We then combined the data from the two sources by taking an average of the two sets of search volume rankings and carried out the correlation tests between the combined search volume rankings and the quality rankings of the universities/companies. Table 3 presents the correlations with individual search volume data and the combined data. It is clear that combining the two data sources made little difference in the correlation coefficients. This further suggests that the two data sources contain essentially the same information and that there is not much benefit in using two sources instead of one.

Table 3. Correlations with individual search volume data and the combined data
 Google TrendsBaidu IndexGoogle Trends and Baidu Index Combined
  1. Note. **Correlation significant at the 0.01 level (2-tailed).

University0.59**0.63**0.63**
Company0.61**0.61**0.63**

Discussion and Conclusions

  1. Top of page
  2. Abstract
  3. Background of the Study
  4. Literature Review
  5. Features of Google Trends and Baidu Index
  6. Methods
  7. Results
  8. Discussion and Conclusions
  9. Acknowledgment
  10. References
  11. Appendix 1. Universities in the Study
  12. Appendix 2. Companies in the Study

The study found significant correlations between the quality rankings of Chinese universities and companies and the volumes of web searches for the names of these organizations in the two most popular search engines in China, Baidu and Google. In the case of the universities, the correlation is significant even after the size of the universities (measured by the size of the undergraduate populations) is taken into account. These findings parallel findings from earlier studies that found such correlations for U.S. universities (Vaughan & Romero-Frías, 2014) and U.S. companies (Vaughan, in press). Thus, it can be concluded that data from Google Trends and Baidu Index contain information on the quality of universities and companies and that the search volume data can be used to estimate the quality of universities and companies in both China and the U.S. In fact, the correlation is higher in the Chinese case than the U.S. one. For universities, it is around .6 for China versus around .5 for the U.S. In the case of companies, the correlation of .6 for China is much higher than for the U.S. (between .31 and .35, depending on the business performance and position measures used). It should be noted that the U.S. companies in the study of Vaughan (in press) were a heterogeneous group from various industries, while the Chinese companies in the current study were from a single industry. This suggests that predicting business quality based on search volume data would be more reliable when limited to a single industry.

The study found that a higher-quality organization attracted more web searches. While it is possible that a negative event, for example, a scandal related to the organization, could have generated a lot of search interest, the correlation found in the study shows that this kind of negative search was not dominant or there would be a negative correlation between quality rankings and search volumes (i.e., better organizations attracted less searches). This parallels the situation of citations to academic papers. Although there are negative citations, that is, citing a paper to discredit it, the overall pattern is that better papers receive more citations on average. It is therefore safe to conclude that on a macro level, in this study the institution as a whole over a year, high search volume is a positive indicator of the quality of the organization, although on a micro level, for example, a specific event, high search volume could be a negative signal.

It is important to note that web search queries and the search volume data derived from them are language-dependent. The fact that the findings are consistent in English and Chinese, two very different languages, suggests that the same may be applicable to other languages. In fact, the study by Vaughan and Romero-Frías (2014) had similar findings for Spanish and U.S. universities, although there were some differences between the two countries. Google Trends and Baidu Index use very different methods for term matching (partial match vs. complete match, as discussed earlier) and yet the search volume data from the two sources are highly correlated. This is perhaps a sign of the robustness of the search volume data in terms of language processing.

A feature that Google Trends had but Baidu Index did not is the ability to limit the search volume data to a particular category or subcategory. This is potentially a very useful tool that could help remove noise in the search volume data. For example, if you are looking for the search volume of the company Apple, you can remove the noise (apple meaning fruit) by limiting to the category “Computers & Electronics.” However, in our study, limiting to a category or subcategory did not improve the predictive power of the search volume data because the correlation coefficients were lower when a category limit was applied. Although it is not clear what caused the lack of success of the category function in these cases, we suspect that it may be the result of lack of sufficient web queries. Google Trends determines the category of a search by looking at the queries entered immediately before and after the query in question (Google, 2013). A large amount of query data would be needed for this algorithm to work effectively and the following observation might confirm this. Google Trends provided search volume data for 34 companies in the study when no category limit was applied. However, it could not report the search volume data for 19 of these companies when the category “Business & Industrial,” a fairly broad category, was applied (the message displayed was “Not enough search volume to show graphs”).

Despite the differences between Google Trends and Baidu Index discussed in the section Features of Google Trends and Baidu Index, data from the two services are highly correlated. Further, they contain essentially the same information because combining the two sources did not produce a substantially different result than using one of them. However, there is a major difference between the two sources in terms of the availability of search volume data. Baidu was able to provide search volume data for more organization names in the study. The two sources did not differ in this regard for universities in the study because there was likely a very large amount of web searches for these top Chinese universities. However, they differ significantly for companies in the study. It is very likely that there were fewer web searches for these companies than for the universities. Note that these are top IT companies in China. If we were looking for search volume data for smaller companies, the availability of data would likely to be even poorer in Google Trends.

An important observation is that data availability in Google Trends was better in 2009, before Google moved its China search service to Hong Kong, resulting in a decline of its user base in mainland China. The smaller user base of the Google China search engine affected Google Trends' function of categorizing web queries, as discussed earlier. All these suggest the importance of the user base size. This leads us to reason that the performance of Google Trends's data related to China may decline even further if Google China's user base continues to shrink. In fact, there has been a major decline of traffic to the Google China search engine in 2013, as shown in Figure 1, which was retrieved from Alexa Internet on October 23, 2013. When we checked Alexa traffic rankings, Google China was ranked number 7 within China on June 5, 2013 but number 11 on October 23, 2013, and 39th on October 28, 2013.

figure

Figure 1. Traffic to Google China service at www.google.com.hk; graph retrieved from Alexa Internet Oct. 23, 2013. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Download figure to PowerPoint

In conclusion, the differences between Google Trends and Baidu Index in terms of technology, such as the method used for term matching (a language-processing issue), do not seem to affect the quality of their data. The volume of search queries entered into the web search engines, however, makes a major difference. The significance of these findings are not limited to the practical question of which data source to use for mining Chinese web search queries. More importantly, the findings inform the theoretical issue of what factors affect search volume data. The importance of the quantity of web queries is clear. The implication extends beyond the Chinese context. Although Google China has a smaller user base relative to Baidu, in absolute terms it is not small due to China's large web user population, the largest in the world (Internet World Stats, 2012). In Alexa Internet's global traffic ranking, Google China is ranked higher (39th) than Google Italy (44th) and Google Spain (46th) (Alexa October 28, 2013 ranking). If Google China's user base is not large enough to ensure optimal performance of Google Trends related to the Chinese terms, its smaller user base in Italy and Spain may affect search volume data related to those countries too. This hypothesis is supported by the fact that in the study of Vaughan and Romero-Frías (2014), Google Trends's categorizing function did not help improve the quality of the search volume data for the Spanish universities but it did for the U.S. universities.

Acknowledgment

  1. Top of page
  2. Abstract
  3. Background of the Study
  4. Literature Review
  5. Features of Google Trends and Baidu Index
  6. Methods
  7. Results
  8. Discussion and Conclusions
  9. Acknowledgment
  10. References
  11. Appendix 1. Universities in the Study
  12. Appendix 2. Companies in the Study

The first author is supported by a research grant from the Social Sciences and Humanities Research Council of Canada (SSHRC) on web data mining for business intelligence.

References

  1. Top of page
  2. Abstract
  3. Background of the Study
  4. Literature Review
  5. Features of Google Trends and Baidu Index
  6. Methods
  7. Results
  8. Discussion and Conclusions
  9. Acknowledgment
  10. References
  11. Appendix 1. Universities in the Study
  12. Appendix 2. Companies in the Study

Appendix 1. Universities in the Study

  1. Top of page
  2. Abstract
  3. Background of the Study
  4. Literature Review
  5. Features of Google Trends and Baidu Index
  6. Methods
  7. Results
  8. Discussion and Conclusions
  9. Acknowledgment
  10. References
  11. Appendix 1. Universities in the Study
  12. Appendix 2. Companies in the Study
RankNameNumber of undergraduate studentsSearch volume ranking—Google TrendsSearch volume ranking—Baidu Index
No category limitCategory Jobs & EducationCategory EducationCategory Colleges & Universities
 1Peking University1446511111
 2Tsinghua University1518421111131
 3Fudan University141001055314
 4Renmin University of China115773231212024
 5Zhejiang University2292934356
 6Shanghai Jiao Tong University16116181010918
 7Nanjing University 422211
 8University of Science and Technology of China74002844444342
 9Beijing Normal University890012981319
10Xi'an Jiaotong University161262525242716
11Nankai University127492021212723
12Harbin Institute of Technology169602738384020
13Huazhong University of Science and Technology32863242928279
14Beihang University144282634323330
15Wuhan University3271866663
16Tianjin University15618889827
17Tongji University186961124262716
18Sun Yat-Sen University160001312121310
19Southeast University323873031333720
20China Agricultural University125103935363339
21Xiamen University20314131818135
22Sichuan University4000057673
23East China Normal University14,3202320202011
24Beijing Institute of Technology145483321242232
25Central South University33710151312108
26Dalian University of Technology194813640393926
27Shandong University4143773347
28Shanghai University of Finance and Economics78384035333333
29Central University of Finance and Economics97594342414035
30University of International Business and Economics80005047474646
31University of Science and Technology Beijing132672227282438
32Jilin University42349916161313
33Beijing Foreign Studies University46284845464748
34Northwestern Polytechnical University144953741414230
35South China University of Technology249063123212714
36Beijing University of Posts and Telecommunications 2929282539
37Beijing Jiaotong University140033831332735
38Chongqing University300001917171822
39Shanghai International Studies University5,9724745454550
40China University of Political science and Law 82974443434343
41East China University of Science and Technology163553427282527
42Nanjing University of Aeronautics and Astronautics170004948494947
43University of Electronic Science and Technology of China210002119191925
44Hunan University206001513121027
45Ocean University of China155004239393733
46China University of Petroleum(Beijing)71764135363343
47Northeastern University257931713151037
48Nanjing University of Science and Technology 3526262339
49Beijing Language and Culture University 4549505049
50Xidian University 4645484743

Appendix 2. Companies in the Study

  1. Top of page
  2. Abstract
  3. Background of the Study
  4. Literature Review
  5. Features of Google Trends and Baidu Index
  6. Methods
  7. Results
  8. Discussion and Conclusions
  9. Acknowledgment
  10. References
  11. Appendix 1. Universities in the Study
  12. Appendix 2. Companies in the Study
RankNameSearch volume ranking—Google TrendsSearch volume ranking—Baidu Index
No category limitCategory “Business & Industrial”
 1Huawei Technologies Co., Ltd.212
 2Legend Holdings Co., LTD.131
 3China Electronics Co., LTD.  30
 4Haier Co., Ltd.424
 5ZTE Co., LTD.343
 6Hisense Electric Co., LTD.566
 7Sichuan Changhong Electric Co., Ltd.769
 8TCL Co., Ltd.655
 9PKU Founder Group Finance Co., Ltd21 24
10BYD Co., Ltd.17 26
11Inspur Group Co., Ltd.28 15
12BOE Technology Group Co., Ltd.10811
13Skyworth Group Co., Ltd.91210
14Hengtong Group Co., Ltd.  28
15Tsinghua Tongfang Co., Ltd.11 8
16NARI Group Co., Ltd.28 38
17Konka Group Co., Ltd.81113
18Baosheng Co., Ltd.  34
19Shanghai Bell Co., Ltd.161422
20Wuhan Research Institute of Post and Telecommunications22 18
21YongDing Co., Ltd.  43
22Crystal Dragon Industry Co., Ltd.   
23Aisino Co., Ltd.32 16
24Tongding Group Co., Ltd.   
25XJ Co., Ltd.34 17
26Sichuan Jiuzhou Electronic Co., Ltd.2042
27Zhongtian Technology Co., Ltd.151321
28Jiangsu Hongtu High Technology Co., Ltd.  32
29Yulong Computer Telecommunications Scien  42
30Camel Group Co., Ltd.  39
31Henan Senyuan Group Co., Ltd.  35
32Futong Group Co., Ltd.33 30
33Zhongli Science And Technology Group co., Ltd 19 25
34Shenzhen Huaqiang Industry Co., Ltd121014
35Tianjin Zhonghuan Electronic Information Group Co., Ltd30 35
36Shenzhen TP-LINK Technologies Co., Ltd.26 35
37Hangzhou Hikvision Digital Technology Co., Ltd14 7
38Komsomolsk Cellon Communications Technology Co., Ltd.   
39Zhejiang Fuchunjiang Group Communication Co., Ltd  43
40Guangdong Radio Group Co., Ltd23 18
41Huizhou Desay Group Co., Ltd31 23
42Huizhou Huayang Group Co., Ltd.25 33
44Zhuzhou CSR Times Electric Co., Ltd. 271541
46Fuzhou FuDa Automation Technologies Co., Ltd.  26
47Semiconductor Manufacturing International Corporation18 18
48Neusoft Co., Ltd.13912
49GoerTek Inc.24 29
50Tiankang Group Co., Ltd.  39