The Hidden Markov Topic Model: A Probabilistic Model of Semantic Representation


should be sent to Mark Andrews, Department of Cognitive, Perceptual, and Brain Sciences, Division of Psychology and Language Sciences, University College London, 2 Bedford Way, London WC1H OAP. E-mail:


In this paper, we describe a model that learns semantic representations from the distributional statistics of language. This model, however, goes beyond the common bag-of-words paradigm, and infers semantic representations by taking into account the inherent sequential nature of linguistic data. The model we describe, which we refer to as a Hidden Markov Topics model, is a natural extension of the current state of the art in Bayesian bag-of-words models, that is, the Topics model of Griffiths, Steyvers, and Tenenbaum (2007), preserving its strengths while extending its scope to incorporate more fine-grained linguistic information.

1. Introduction

How word meanings are represented and learned is a foundational problem in the study of human language use. Within cognitive science, a promising recent approach to this problem has been the study of how the meanings of words can be learned from their statistical distribution across the language. This approach is motivated by the so-called distributional hypothesis, originally due to Harris (1954) and Firth (1957), which proposes that the meaning of a word can be derived from the linguistic contexts in which it occurs. Numerous large-scale computational implementations of this approach—including, for example, the work of Schüutze (1992), the HAL model (e.g., Lund, Burgess, & Atchley, 1995), the LSA model (e.g., Landauer & Dumais, 1997) and, most recently, the Topics model (e.g., Griffiths, Steyvers, & Tenenbaum, 2007)—have successfully demonstrated that the meanings of words can, at least in part, be derived from their statistical distribution in language.

Important as these computational models have been, one of their widely shared practices has been to treat the linguistic contexts in which a word occurs as unordered sets of words. In other words, the linguistic context of any given word is defined by which words co-occur with it and with what frequency, but it disregards all fine-grained sequential and syntactic information. By disregarding these types of data, these so-called bag-of-words models drastically restrict the information from which word meanings can be learned. All languages have strong syntactic-semantic correlations. The sequential order in which words occur, the argument structure and general syntactic relationships within sentences, all provide vital information about the possible meaning of words. This information is unavailable in bag-of-words models and consequently the extent to which they can extract semantic information from text, or adequately model human semantic learning, is limited.

In this paper, we describe a distributional model that goes beyond the bag-of-words paradigm. This model is a natural extension to the current state of the art in probabilistic bag-of-words models, namely the Topics model described in Griffiths et al. (2007) and elsewhere. The model we propose is a seamless continuation of the Topics model, preserving its strengths—its thoroughly unsupervised learning, its hierarchical Bayesian nature—while extending its scope to incorporate more fine-grained sequential and syntactic data.

2. The Topics model

The standard Topics model as described in Griffiths and Steyvers (2002, 2003); Griffiths et al. (2007) is a probabilistic generative model for texts and is based on the latent Dirichlet allocation (LDA) model of Blei, Ng, and Jordan (2003). It stipulates that each word in a corpus of texts is drawn from one of K latent distributions inline image, with each φk being a probability distribution over the V word-types in a fixed vocabulary. These distributions are the so-called topics that give the model its name. Some examples, learned by a Topics model described in Andrews, Vigliocco, and Vinson (2009), are given in Table 1. As is evident from this table, each topic is a cluster of related terms that corresponds to a coherent semantic theme, or subject-matter. As such, the topic distributions correspond to the semantic knowledge learned by the model, and the semantic representation of each word in the vocabulary is given by a distribution over them.

Table 1. 
Topics taken from an LDA model described in Andrews et al. (2009)
  1. Each column gives the seven most probable word types in each topic.


To describe the Topics model more precisely, let us assume we have a corpus of J texts w1wjwJ, where the jth text wj is a sequence of nj words, that is, wj1wjiwjnj. Each word, in each text, is assumed to be sampled from one of the model's K topics. Which one of these topics is chosen is determined by the value of a latent variable xji that corresponds to each word wji. This variable takes on one of K discrete values and is determined by sampling from a probability distribution πj, which is specific to text j. As such, we can see that each text wj is assumed to be a set of nj independent samples from a mixture model. This mixture model is specific to text j, as the mixing distribution is determined by πj. However, across texts, all mixing distributions are drawn from a common Dirichlet distribution, with parameters α. Given known values for φ and α, the likelihood of the entire corpus is given by


In this, we see that the Dirichlet distribution P(πj | α) introduces a hierarchical coupling of all the mixing distributions. As such, the standard Topics model is a hierarchical coupling of mixture models, effectively defining a continuum of mixture models, all sharing the same K component topics.

The standard Topics model is thus an example of a hierarchical language model. Its Bayesian network is shown in Fig.1. It assumes that each text is generated according to an elementary language model—specifically a mixture of unigram distributions—which are then hierarchically coupled with one another. From this, it is evident that the standard model can be extended by simply changing the elementary language model on which it is based. There are multiple possible language models that could be used in this respect. One possibility is to use a bigram language model as described in Wallach (2006). Another possibility is to use a language model based on a full phrase-structure grammar as described in Boyd-Graber and Blei (2009). However, in that work, the syntactic structure underlying the sentences in the texts is assumed to be known in advance and is provided by syntactically tagged corpus. In what follows, we describe a Topics models whose elementary language model is a Hidden Markov model (HMM). We refer to this as the Hidden Markov Topics model (HMTM).

Figure 1.

 A Bayesian network diagram of the standard Topics model described in Griffiths et al. (2007) and elsewhere. Details are provided in the main text. Note that β denotes the parameters of a V-dimensional Dirichlet distribution, from which each of K topic distributions is sampled.

3. Hidden Markov Topics model

In the HMTM, just as in the standard Topics model, each wji is drawn from one of K topics, the identity of which is determined by the latent variable xji. However, rather than sampling each xji independently from a probability distribution πj, as in the standard model, these latent variables are generated by a Markov chain that is specific to text j. By direct analogy with the standard model, across the texts, the parameters of these Markov chains are drawn from a common set of Dirichlet distributions. As such, the HMTM is a hierarchical coupling of HMMs, defining a continuum of Hidden Markov models, all sharing the same state to output mapping.

In the HMTM, the likelihood of the corpus is


Here, θj and πj are the parameters of the Markov chain of latent-states in text j. The θj is the K × K state transition matrix (i.e., the kth row of θj gives the probability of transitioning to each of the K states, given that the current state is k), and πj is the initial distribution for the K states. The distribution over the πj and θj, that is, P(πj,θj | α,γ), is a set of independent Dirichlet distributions, where α are the parameters for the Dirichlet distribution over πj, and inline image are the parameters for the distribution over each of the K rows of θj.

3.1. Learning and inference

From this description of the HMTM, as well as from its Bayesian network diagram given in Fig. 2, we can see that the HMTM has four sets of parameters whose values must be inferred from a training corpus of texts. These are the K topic distributions, that is, φ, the Dirichlet parameters for the latent-variable Markov chains, that is, α and γ, and the Dirichlet parameters from which the topic distributions are drawn, that is, β.1 The posterior over φ, α, β, and γ given a set of J texts w1:J is


where the likelihood term P(w1:J | φ,α,γ) is given by Eq. 2. This distribution is intractable (as is even the likelihood term itself), and as such it is necessary to use Markov Chain Monte Carlo (MCMC) methods to sample from it.

Figure 2.

 A Bayesian network diagram for the Hidden Markov Topic model. As in the standard model, depicted in Fig. 1, each observed variable wji is assumed to be sampled from one of K topic distributions, collectively specified by φ. Likewise, which one of these distributions is chosen is determined by the value of the unobserved variable xji. In the standard model, however, the latent variables xj1xjnj are drawn independently from a distribution πj, which is specific to text j. In the HMTM by contrast, xj1xjnj are drawn from a Markov chain specific to the jth text. The parameters of this chain are θj (i.e., the K × K state-space mapping for chain j) and πj (i.e., the initial conditions for the jth chain). For all J texts, the parameters θj and πj are drawn from a common set of Dirichlet distributions. Each θj is drawn from K-dimensional Dirichlet distributions, whose parameters are collectively specified by γ. Each πj is drawn from a K-dimensional Dirichlet distribution α. Also shown here is β, the parameters of a V-dimensional Dirichlet distribution from which the K topic distributions φ are sampled.

There are different options available for MCMC sampling. The method we employ is a Gibbs sampler that draws samples from the posterior over α, β, γ, and x1:J. Here, x1:J are the sequences of latent-variables for each of the J texts, that is, x1,x2xJ, with xj = xj1xjnj. This Gibbs sampler has the useful property of integrating over φ, π1:J and θ1:J, which entails both computational efficiency and faster convergence.

On convergence of the Gibbs sampler, we will obtain samples from the joint posterior distribution P(x1:J,α,β,γ | w1:J). From this, other variables of interest can be obtained. For example, it is desirable to know the likely values of the topic distributions φ1φkφK. Given known values for x1:J and β, the posterior distribution over φ is simply a product of Dirichlet distribution, from which samples are easily drawn and averaged. Further details of this Gibbs sampler are provided in the Appendix.

3.2. Demonstration

Here, we demonstrate the operation of the HMTM on a toy problem. In this problem, we generate a data-set from a HMTM with known values for φ, α, and γ. We can then train another HMTM, whose parameters are unknown, with this training data-set. Using the Gibbs sampler, we can draw samples from the posterior over φ, α, β, and γ, and then compare these samples to the true parameter values that generated the training data.

In the example we present here, we use a ‘‘vocabulary’’ of V = 25 symbols and a set of K = 10 ‘‘topics.’’ As is common practice in demonstrations of related models, the ‘‘topics’’ we use in this example can be visualized using the grids shown below.

inline image

Each grid, although shown as a 5 × 5 square, is just a 25 dimensional array, with 5 high values (darker) and 20 low values (lighter). Each of these arrays corresponds to one of the topics distributions in our toy problem, that is, each one places the majority of its probability mass on a different set of 5 symbols.

The inline image parameters in the HMTM are a set of K-dimensional Dirichlet parameters. Each γk is an array of K non-negative real numbers. A common reparameterization of Dirichlet parameters, such as γk is as smk, where s is the sum of γk and mk is the γk divided by s. For each set of K Dirichlet parameters, we used an s = 3.0. The K arrays m1mK are depicted below on the left.

inline image

From these Dirichlet parameters, we may generate arbitrarily many sets of state-transition parameters for a 10-state HMM. As an example, we show two such sets on the right. As is evident, these parameters retain characteristics of the patterns found in the original Dirichlet parameters. We can see that, on average, the state transition dynamics leads a given state to map to either the state before it, or the state after it. For example, we can see that, on average, state 2 maps to state 1 or state 3, state 3 maps to state 2 or state 4, and so on. While this average dynamical behavior is simple, it is not trivial and does not lead to fixed point or periodic trajectories. Note also that the small differences in the transition dynamics can lead to quite distinct dynamical behaviors in their respective HMMs.

On the basis of these φ and θ, and using an array of K ones as the parameters α, we generated J = 50 training ‘‘texts’’ as follows. For each text j, we drew a set of initial conditions and transition parameters for a HMM from α and γ, respectively. We then iterated the HMM for nj = 100 steps, to obtain a state trajectory xj1xjnj. On the basis of the value of each xji, we chose the appropriate topic (i.e., if xji = k we chose φk) and drew an observation wji from it. The training thus comprised a set of J symbol sequences, with each symbol taking a value from the set {1…25}.

Using this as training data, we trained another HMTM whose parameters were unknown. As described earlier, the Gibbs sampler draws samples from the posterior distribution over x1:J, α, β, and γ. From these samples, we may also draw samples from the posterior over the topics. In Fig. 3, we graphically present some results of these simulations. Show in this figure are averages of samples drawn from the posterior over φ and m1mk, that is, the location parameters of γ1γk. On the top row of Fig. 3, we show averages of samples drawn after 20 iterations of the sampler. On the lower row, we show averages of samples drawn after 100 iterations. In both cases, these averages are over 10 samples taken two iterations apart in time. To the left in each case are the inferred topics. To the right are the inferred locations of the Dirichlet parameters. These inferred parameters can be compared to the true parameters on the previous page. By doing so, it is clear that even after 20 iterations of the sampler, patterns in the topic distributions have been discovered. After 100 iterations, the inferred parameters are almost identical to the originals.

Figure 3.

 Averages of samples from the posterior distribution over the topic distributions (left) and locations of the γ parameters (right). Shown on the top row are averages, over 10 draws, drawn after 20 iterations. Shown on the lower row are averages, again over 10 draws, taken after 100 iterations. Compare with the true parameters shown on the previous page.

Although not shown, the Gibbs sampler also successfully draws samples from the posterior over α, β2 and the scale parameter s for γ. In addition, we may use draws from the posterior to estimate, using the harmonic mean method, the marginal likelihood of the model under a range of different numbers of topic distributions. Although, the harmonic mean method is not highly recommended, we have found that in practice it consistently leads to an estimate of the correct number of topics.

4. Learning topics from text

In this final section, we present some topics learned by a HMTM trained on a corpus of natural language. The corpus used was a subsample from the British National Corpus (BNC).3 The BNC is annotated with the structural properties of a text such as sectioning and subsectioning information. The latter type of annotation facilitates the processing of the corpus. To extract texts from the BNC we extracted contiguous blocks that were labeled as high-level sections, roughly corresponding to individual articles, letters, or chapters. These sections varied in size from tens to thousands of words, and from these we chose only those texts that were approximately 150–250 words in length. This length is typical of, for example, a short newspaper article. Following these criteria, we then sampled 2,500 individual texts to be used for training. Of all the word types that occurred within this subset of texts, we excluded words that occurred less than five times overall and replaced their occurrence with a marker symbol. This restriction resulted in a total of 5,182 unique words.

We trained a HMTM with K = 120 topics using this corpus. After a burn-in period of 1,000 iterations, we drew 50 samples from the posterior over the latent trajectories and β, with each sample being 10 iterations apart. We used these to draw samples from the posterior over the topics, which are then averaged, as described earlier. In the upper part of Table 2, we present seven averaged topics from the HMTM simulation. For the purposes of comparison, in the lower part Table 2 we present seven averaged topics taken from a standard Topics model. This standard Topics model was trained on a larger corpus and is described in detail in Andrews et al. (2009). The topics in the standard model were chosen by finding the topics that are the closest matching to the HMTM topics we chose.

Table 2. 
Examples of topics learned by a HMTM (top row) that was trained on a set of documents taken from the BNC
  1. Note. On the lower row, we show topics from a standard Topics model also trained on a set of documents from the BNC.

HMTM Topics
Standard Topics

The side-by-side comparison provides an appreciation for how the topics in a HMTM differ from the standard model. In the HMTM, the topics are more refined in the semantics, referring to more specific categories of things or events. For example, we see that the first topic to the left in the HMTM refers to alcoholic beverages, specifically those associated with a (British) pub. By contrast, the corresponding topic from the standard model is less specifically about beverages and refers more generally to things of, or relating to, pubs. In the next example, we see that the topic from the HMTM refers to farm animals. By contrast, the corresponding topic from the standard model is less specifically about farm animals and more related to agriculture in general. In all of the examples shown, a similar pattern of results holds.

From this demonstration, we can see that more refined semantic representations can be learned when sequential information is taken into account. Why we see a benefit from sequential information can be understood by way of the following simple example. Words like horse, cow, mule are likely to occur as subjects of verbs like eat, chew, while words like grass, hay, grain are likely to occur as their objects. A model that learns topics by taking into account this sequential information may learn that words like horse, cow, and mule etc., form a coherent topic. Likewise, such a model may infer other topics based on words like eat, chew, etc., or words like grass, hay, grain, etc. By contrast, the standard Topics model, based on the assumptions that the sequential information in a text is irrelevant, is likely to conflate these separate topics into one single topic referring to, for example, farms or farming.

5. Discussion

The general problem that has motivated the work presented here is the problem of how word meanings are represented and learned. As mentioned, a promising recent approach to this general problem has been the study of how the meanings of words can be learned from their statistical distribution across the language. Across disciplines such as cognitive science, computational linguistics, and natural language processing, the various computational approaches to modeling the role of distributional statistics in the learning of word meanings can be seen to fall into two broad groups, each characterized largely by the granularity of statistical information that they consider. On the one hand, there are those models that focus primarily on broad or coarse-grained statistical structures. These models, which include the seminal work of, for example, Schütze (1992); Lund et al. (1995); Landauer and Dumais (1997), have been widely adopted and have many appealing characteristics. In most cases, they are large-scale, broad coverage, unsupervised learning models that are applied to plain and untagged corpora. In addition, particularly in recent models such as Griffiths et al. (2007), they are often also fully specified probabilistic generative language models. However, one of the widely shared practices of models like this is to adopt a bag-of-words assumption and treat the linguistic contexts in which a word occurs as unordered sets of words. The obvious limitation of this approach is that all fine-grained sequential and syntactic statistical information is disregarded.

In contrast to coarse-grained distributional models, a second group of models have focused specifically on the role of more fine-grained sequential and syntactic structures (e.g., Padó & Lapata, 2007; Pereira, Tishby, & Lee, 1993). The obvious appeal of these models is that by focusing on the sequential order, argument structure, and general syntactic relationships, within sentences, they capture syntactic-semantic correlations in language that can provide vital information about the possible meanings of words (see, e.g., Alishahi & Stevenson, 2008; Gillette, Gleitman, Gleitman, & Lederer, 1999). However, in practice, these fine-grained distributional models have often focused on extracting semantic information for constrained or specific problems (e.g., automatic thesaurus generation; Curran & Moens, 2002a,b; Lin, 1998) or taxonomy generation (Widdows, 2003) rather than developing large-scale probabilistic language models. This has limited their suitability as cognitive models. In addition, they have been heavily reliant on supervised learning methods using part-of-speech tagged or parsed corpora.

In this paper, we have described the HMTM, a model that attempts to integrate the coarse- and fine-grained approaches to modeling distributional structure. This model is intended to preserve the strengths of both of these approaches, while avoiding their principal limitations. It is scalable, unsupervised, and trained with untagged corpora. However, it goes beyond the dominant bag-of-words paradigm by incorporating sequential and syntactic structures. In this respect, it is similar to the beagle model described in Jones and Mewhort (2007). However, unlike beagle, the HMTM, being a direct generalization of the LDA model, is a fully specified probabilistic generative language model. The HMTM also extends beyond a model described in Griffiths et al. (2007); Griffiths, Steyvers, Blei, and Tenenbaum (2005) that couples a HMM with a standard Topics model. In particular, the HMTM is designed to learn semantic representations by directly availing of the sequential information in text. By contrast, the model described in Griffiths et al. (2005, 2007) learns semantic representations using a standard Topics model (i.e., latent variables are drawn by independently sampling from a fixed distribution), while the HMM is used to learn syntactic categories. More generally, the HMTM is an example of a hierarchical language model. Within cognitive science, computational linguistics, and NLP research, probabilistic language models such as HMMs and PCFGs have been extensively used. However, the hierarchically coupled versions of these models have yet to be introduced. The HMTM is a hierarchically coupled HMM and can be in principle extended to the case of a hierarchically coupled PCFG. We believe that these models, and their training by Bayesian methods on large-scale data-sets, may be an important development in the complexity of the language models used in these fields.


  • 1

     The β parameters can be seen as a prior over φ, but a prior that will be inferred from the data rather than simply assumed.

  • 2

     For the case of β, we used a symmetric Dirichlet distribution.

  • 3

     The BNC is a 100-million-word corpus of contemporary written and spoken English in the British Isles. According to its publishers, the texts that make up the written component include ‘‘extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays.’’


We would like to thank Mark Steyvers, Roger Levy, Nick Chater, and Yee Whye Teh for their valuable comments and discussion. This research was supported by European Union (FP6-2004-NEST-PATH) grant 028714.


The Gibbs sampler for the HMTM model draws samples from P(x1:J,α,β,γ | w1:J). It does so iteratively sampling from a given latent variable xji, assuming values from all other latent variables, and for α, β, γ. The conditional distribution over xji is given by


where we denote the set of latent variables excluding xji by x−[ji], and denote the set of observables excluding wji by w−[ji]. Superficially, this conditional distribution appears identical to the conditional distributions in the Gibbs sampler for the standard Topics model, as described in Griffiths and Steyvers (2002, 2003); Griffiths et al. (2007). However, due to the non-independence in the latent trajectory that results from the Markov dynamics, the term P(xji | x−[ji],α,γ) must be calculated as a ratio of Polya distributions, that is,


The Dirichlet parameters α, β, γ may also be sampled by Gibbs sampling. For example, each γk is reparameterized as smk (as described in the main text). Assuming known values for all variables, the conditional posterior distribution of s is log-concave and can be sampled using adaptive rejection sampling (ARS), see Gilks and Wild (1992). Likewise, assuming known values for all other variables, the conditional posterior over the share of the probability mass between any two elements of mk is also log-concave and can be sampled by ARS. As such, the Gibbs sequentially samples from each latent variable xji, each scale parameter of the Dirichlet parameters given by α, β, γ, and also from the share of probability mass between every pair of elements of each location parameters of the Dirichlet parameters.