GBC: Gradient boosting consensus model for heterogeneous data


  • This article is a part of the special issue based on the Best of SDM 2012, Statistical Analysis and Data Mining, volume 7, issue 1.


With the rapid development of database technologies, multiple data sources may be available for a given learning task (e.g. collaborative filtering). However, the data sources may contain different types of features. For example, users' profiles can be used to build recommendation systems. In addition, a model can also use users' historical behaviors and social networks to infer users' interests on related products. We argue that it is desirable to collectively use any available multiple heterogeneous data sources in order to build effective learning models. We call this framework heterogeneous learning. In our proposed setting, data sources can include (i) nonoverlapping features, (ii) nonoverlapping instances, and (iii) multiple networks (i.e. graphs) that connect instances. In this paper, we propose a general optimization framework for heterogeneous learning, and devise a corresponding learning model from gradient boosting. The idea is to minimize the empirical loss with two constraints: (1) there should be consensus among the predictions of overlapping instances (if any) from different data sources; (2) connected instances in graph datasets may have similar predictions. The objective function is solved by stochastic gradient boosting trees. Furthermore, a weighting strategy is designed to emphasize informative data sources, and deemphasize the noisy ones. We formally prove that the proposed strategy leads to a tighter error bound. This approach consistently outperforms a standard concatenation of data sources on movie rating prediction, number recognition, and terrorist attack detection tasks. Furthermore, the approach is evaluated on AT&T's distributed database with over 500 000 instances, 91 different data sources, and over 45 000 000 joined features. We observe that the proposed model can improve out-of-sample error rate substantially.