## 1. Introduction

Representing the semantic content of a document is an unsolved problem. We think it is very unlikely that a low-dimensional representation containing only a few hundred numbers will ever be capable of capturing more than a tiny fraction of the content of the distributed representation over millions of neurons that is used by the brain. Even if the documents do lie on (or near) a fairly low-dimensional, nonlinear manifold in a high-dimensional space of word sequences, it is unlikely that the best way to capture the structure of this manifold is by trying to learn explicit coordinates on the manifold for each document. The brain is much more likely to capture the manifold structure *implicitly* by using an extremely high-dimensional space of distributed representations in which all but a tiny fraction of the space has been ruled out by learned interactions between neurons. This type of implicit representation has many advantages over the explicit representation provided by a low-dimensional set of coordinates on the manifold:

- 1 It can be learned efficiently from data by extracting multiple layers of features to form a ‘‘deep belief net’’ in which the top-level associative memory contains energy ravines. The low energy floor of a ravine is a good representation of a manifold (Hinton, Osindero, & Teh, 2006).
- 2 Implicit representation of manifolds using learned energy ravines makes it relatively easy to deal with data that contain an unknown number of manifolds each of which has an unknown number of intrinsic dimensions.
- 3 Each manifold can have a number of intrinsic dimensions that varies along the manifold.
- 4 If documents are occasionally slightly ill-formed, implicit dimensionality reduction can accommodate them by using energy ravines whose dimensionality increases appropriately as the allowed energy level is raised. The same approach can also allow manifolds to merge at higher energies.

In addition to these arguments against explicit representations of manifold coordinates, there is not much evidence for small bottlenecks in the brain. The lowest bandwidth part of the visual system, for example, is the optic nerve with its million or so nerve fibers and there are good physical reasons for that restriction.

Despite all these arguments against explicit dimensionality reduction, it is sometimes very useful to have an explicit, low-dimensional representation of a document. One obvious use is visualizing the structure of a large set of documents by displaying them in a two or three-dimensional map. Another use, which we focus on in this paper, is document retrieval. We do not believe that the low-dimensional representations we learn in this paper tell us much about how people represent or retrieve documents. Our main aim is simply to show that our nonlinear, multilayer methods work much better for retrieval than earlier methods that use low-dimensional vectors to represent documents. We find these earlier methods equally implausible as cognitive models, or perhaps even more implausible as they do not work as well.

A very unfortunate aspect of our approach to document retrieval is that we initialize deep autoencoders using the very same ‘‘pretraining’’ algorithm as was used in Hinton et al. (2006). When this algorithm is used to learn very large layers, it can be shown to improve a generative model of the data each time an extra layer is added (strictly speaking, it improves a bound on the probability that the model would generate the training data). When the pretraining procedure is used with a central bottleneck, however, all bets are off.

Numerous models for capturing low-dimensional latent representations have been proposed and successfully applied in the domain of information retrieval. Latent semantic analysis (LSA; Deerwester, Dumais, Landauer, Furnas, & Harshman, 1990) extracts low-dimensional semantic structure using singular value decomposition to get a low-rank approximation of the word-document co-occurrence matrix. This allows document retrieval to be based on ‘‘semantic’’ content rather than just on keywords.

Given some desired dimensionality for the codes, LSA finds codes for documents that are optimal in the sense that they minimize the squared error if the word-count vectors are reconstructed from the codes. To achieve this optimality, however, LSA makes the extremely restrictive assumption that the reconstructed counts for each document are a linear function of its code vector. If this assumption is relaxed to allow more complex ways of generating predicted word counts from code vectors, then LSA is far from optimal. As we shall see, nonlinear generative models that use multiple layers of representation and much smaller codes can perform much better than LSA, both for reconstructing word-count vectors and for retrieving semantically similar documents. When LSA was introduced, there were no efficient algorithms for fitting these more complex models, but that has changed.

LSA still has the advantages that it does not get trapped at local optima, it is fast on a conventional computer, and it does not require nearly as much training data as methods that fit more complex models with many more parameters. LSA is historically important because it showed that a large document corpus contains a lot of information about meaning that is relatively easy to extract using a sensible statistical method. As a cognitive model, however, LSA has been made rather implausible by the fact that nonlinear, multilayer methods work much better.

A probabilistic version of LSA (pLSA) was introduced by Hofmann (1999), using the assumption that each word is modeled as a single sample from a mixture of topics. The mixing proportions of the topics are specific to the document, but the probability distribution over words that is defined by each topic is the same for all documents. For example, a topic such as ‘‘soccer’’ would have a fixed probability of producing the word ‘‘goal’’ and a document containing a lot of soccer-related words would have a high mixing proportion for the topic ‘‘soccer.’’ To make this into a proper generative model of documents, it is necessary to define a prior distribution over the document-specific topic distributions. This gives rise to a model called ‘‘Latent Dirichlet Allocation,’’ which was introduced by Blei, Ng, and Jordan (2003).

All these models can be viewed as graphical models (Jordan, 1999) in which hidden topic variables have directed connections to variables that represent word counts. One major drawback is that exact inference is intractable due to explaining away (Pearl, 1988), so they have to resort to slow or inaccurate approximations to compute the posterior distribution over topics. A second major drawback, that is shared by all mixture models, is that these models can never make predictions for words that are sharper than the distributions predicted by any of the individual topics. They are unable to capture an important property of distributed representations, which is that the broad distributions predicted by individual active features get multiplied together (and renormalized) to give the sharp distribution predicted by a whole set of active features. This intersection or ‘‘conjunctive coding’’ property allows individual features to be fairly general but their joint effect to be much more precise. The ‘‘disjunctive coding’’ employed by mixture models cannot achieve precision in this way. For example, distributed representations allow the topics ‘‘torture,’’‘‘deregulation,’’ and ‘‘oil’’ to combine to give very high probability to a few familiar names that are not predicted nearly as strongly by each topic alone. Since the introduction of the term ‘‘distributed representation’’ (Hinton, McClelland, & Rumelhart, 1986), its meaning has evolved beyond the original definition in terms of set intersections, but in this paper the term is being used in its original sense.

Welling, Rosen-Zvi, and Hinton (2005) point out that for information retrieval, fast inference is vital and to achieve this they introduce a class of two-layer undirected graphical models that generalize restricted Boltzmann machines (RBMs; see Section 2) to exponential family distributions, thus allowing them to model nonbinary data and to use nonbinary hidden variables. Maximum likelihood learning is intractable in these models because they use nonlinear distributed representations, but learning can be performed efficiently by following an approximation to the gradient of a different objective function called ‘‘contrastive divergence’’ (Hinton, 2002). Several further developments of these undirected models (Gehler, Holub, & Welling, 2006; Xing, Yan, & Hauptmann, 2005) show that they are competitive in terms of retrieval accuracy to their directed counterparts.

There are limitations on the types of structure that can be represented efficiently by a single layer of hidden variables and a network with multiple, nonlinear hidden layers should be able to discover representations that work better for retrieval. In this paper, we present a deep generative model whose top two layers form an undirected bipartite graph (see Fig. 1). The lower layers form a multilayer directed belief network, but unlike Latent Dirichlet Allocation this belief net uses distributed representations. The model can be trained efficiently by using an RBM to learn one layer of hidden variables at a time (Hinton et al., 2006; Hinton, 2007a). After learning the features in one hidden layer, the activation vectors of those features when they are being driven by data are used as the ‘‘data’’ for training the next hidden layer.

After this greedy ‘‘pretraining’’ is complete, the composition of all of the RBMs yields a feed-forward ‘‘encoder’’ network that converts word-count vectors to compact codes. By composing the RBMs in the opposite order (but with the same weights) we get a ‘‘decoder’’ network that converts compact code vectors into reconstructed word-count vectors. When the encoder and decoder are combined, we get a multilayer autoencoder network that converts word-count vectors into reconstructed word-count vectors via a compact bottleneck. This autoencoder network only works moderately well, but it is an excellent starting point for a fine-tuning phase of the learning which uses back-propagation to greatly improve the reconstructed word counts.

In general, the representations produced by greedy unsupervised learning are helpful for regression or classification, but this typically requires large hidden layers that recode the structure in the input as complex sparse features while retaining almost all of the information in the input. When the hidden layers are much smaller than the input layer, a further type of learning is required (Hinton & Salakhutdinov, 2006). After the greedy, layer-by-layer training, the deep generative model of documents is not significantly better for document retrieval than a model with only one hidden layer. To take full advantage of the multiple hidden layers, the layer-by-layer learning must be treated as a ‘‘pretraining’’ stage that finds a good region of the parameter space. Starting in this region, back-propagation learning can be used to fine-tune the parameters to produce a much better model. The back-propagation fine-tuning is not responsible for discovering what features to use in the hidden layers of the autoencoder. Instead, it just has to slightly modify the features found by the pretraining in order to improve the reconstructions. This is a much easier job for a myopic, gradient descent procedure like back-propagation than discovering what features to use. After learning, the mapping from a word-count vector to its compact code is very fast, requiring only a matrix multiplication followed by a componentwise nonlinearity for each hidden layer.

In Section 2 we introduce the RBM. A longer and gentler introduction to RBMs can be found in Hinton (2007a). In Section 3 we generalize RBMs in two ways to obtain a generative model for word-count vectors. This model can be viewed as a variant of the Rate Adaptive Poisson model (Gehler et al., 2006) that is easier to train and has a better way of dealing with documents of different lengths. In Section 4 we describe both the layer-by-layer pretraining and the fine-tuning of the deep generative model. We also show how ‘‘deterministic noise’’ can be used to force the fine-tuning to discover binary codes in the top layer. In Section 5 we show that 128-bit binary codes are slightly more accurate than 128 real-valued codes produced by LSA, in addition to being faster and more compact. We also show that by using the 128-bit binary codes to restrict the set of documents searched by TF-IDF (Salton & Buckley, 1988), we can slightly improve the accuracy and vastly improve the speed of TF-IDF. Finally, in Section 6 we show that we can use our model to allow retrieval in a time independent of the number of documents. A document is mapped to a memory address in such a way that a small hamming-ball around that memory address contains the semantically similar documents. We call this technique ‘‘semantic hashing’’ (Salakhutdinov & Hinton, 2007).