#### 2.1. Topic model

Generating a word token for a document *d* involves first selecting a topic *t* from the document–topic distribution *θ*^{(d)} and then selecting a word from the corresponding topic distribution *ϕ*^{(t)}. This process is repeated for each word token in the document. Let *z* be the random variable that represents the topic indices sampled from *θ*^{(d)}. We write *p*(*z*_{i} = *t*|*d*) as the probability that the *t*th topic was sampled for the *i*th word token (in document *d*) and *p*(*w*_{i}|*z*_{i} = *t*) as the probability of word *w*_{i} under topic *t*. The model specifies the following conditional probability of the *i*th word token in a document:

- (1)

In the LDA model, Dirichlet priors are placed on both *ϕ* and *θ*, to smooth the word–topic and topic–document distributions (for a description of Dirichlet priors, see Steyvers & Griffiths, 2007; Gelman, Carlin, Stern, & Rubin, 2003). In many applications, a symmetric Dirichlet density with single hyperparameters *α* and *β* are used for *θ* and *ϕ*, respectively. For all the topic models in this research, we will use a symmetric Dirichlet prior for *ϕ* using a single hyperparameter *β*. For the topic–document distributions *θ*, we will use an asymmetric Dirichlet prior *θ*, with a vector *α* containing hyperparameter values for every topic (and concept for concept–topic models). An asymmetric prior is useful when some concepts (or topics) are expressed in many or just a few documents across the collection. With an asymmetric prior, more skewed marginal distributions over *θ* can be obtained to express rare or frequent topics (or concepts).

The sequential process of first picking a topic from a topic distribution, and then picking a word token from a distribution over word types associated with that topic can be formalized as follows:

- 1
For each topic

, select a word distribution

- 2
For each document

This generative process can be summarized by the graphical model shown in Fig. 2A. In the graphical notation, shaded and unshaded variables indicate observed and latent (i.e., unobserved) variables, respectively, and the arrows indicate the conditional dependencies between variables. The plates (the boxes in the figure) refer to repetitions of sampling steps with the variable in the right corner referring to the number of samples. For example, the inner plate over *z* and *w* illustrates the repeated sampling of topics and words until *N*_{d} word tokens have been generated for document *d*. The plate surrounding *θ* illustrates the sampling of a distribution over topics for each document *d* for a total of *D* documents. The plate surrounding *ϕ* illustrates the repeated sampling of distributions over word types for each topic until *T* topics have been generated.

Given the words in a corpus, the inference problem involves estimating the word–topic distributions *ϕ*, the topic–document distributions *θ*, and the topic assignments *z* of individual words to topics. These distributions can be learned in a completely unsupervised manner without any prior knowledge about topics or which topics are covered by which documents. One efficient technique for obtaining estimates of these distributions is through collapsed Gibbs sampling (Griffiths & Steyvers, 2004). Steyvers and Griffiths (2007) present a tutorial introduction to topic models that discusses collapsed Gibbs sampling. The main idea of collapsed Gibbs sampling is that inference is performed only on *z*, the assignments of word tokens to topics. The remaining latent variables *θ* and *ϕ* are integrated out (“collapsed”). Words are initially assigned randomly to topics and the algorithm then iterates through each word in the corpus and samples a topic assignment given the topic assignments of all other words in the corpus. This process is repeated until a steady state is reached and the topic assignments are then used to estimate the word–topic and topic–document distributions. The vector *α* that contains the hyperparameter values for every topic (and concept for concept–topic models, see below) is updated using a process involving fixed-point update equations (Minka, 2000; Wallach, 2006). See Appendix A of Chemudugunta et al. (2008b) for more details.

To summarize, the topic model provides several pieces of information that are useful for understanding documents. The topic–document distributions indicate the important topics in each document. The word–topic distributions indicate which words are important for which topic (e.g., the top row of Fig. 1 shows some example word–topic distributions estimated for the TASA corpus). Finally, the probabilistic assignments *z*_{i} of word tokens to topics are useful for tagging purposes, providing information about the role each word is playing in a specific document context and helping to disambiguate multiple meanings of a word (e.g., Griffiths et al., 2007).

#### 2.2. Concept–topic model

In the concept–topic model, the conditional probability of the *i*th word token *w*_{i} in a document *d* is

- (2)

where the indices 1 ≤ *t* ≤ *T* refer to all topics and indices *T* + 1 ≤ *t* ≤ *T* + *C* refer to all concepts. In this generative process, an index *z*_{i} is sampled from the distribution over topics and concepts for the particular document. If *z*_{i} ≤ *T*, a word token is sampled from topic *z*_{i}, and if *T* + 1 ≤ *z*_{i} ≤ *T* + *C*, a word token is sampled from concept *z*_{i} − *T* among word types associated with the concept. The topic model can be viewed as a special case of the concept–topic model when there are no concepts present, that is, when *C *=* *0. At the other extreme of this model where *T *=* *0, the model relies entirely on predefined concepts.

To specify the complete generative model, let , where and 1 ≤ *w* ≤ *V*, refer to the multinomial distribution over word types for topic *t* when 1 ≤ *t* ≤ *T*, and let , where and 1 ≤ *w* ≤ *V* refer to the multinomial distribution over word types for concept *c * = *t–T* when *T* + 1 ≤ *t* ≤ *T* + *C*. As with the topic model, we place Dirichlet priors on the multinomial variables *θ*, *ϕ*, and *ψ*, with corresponding hyperparameters *α*, *β*, and *β*.

The complete generative process can be described as follows:

- 1
For each topic

, select a word distribution

- 2
For each concept

, select a word distribution

- 3
For each document

Note that in Step 2, the sampling of words for a concept is constrained to only the words that are members of the human-defined concept. Fig. 2B shows the corresponding graphical model. All the latent variables in the model can be inferred through collapsed Gibbs sampling in a similar manner to the topic model (see Chemudugunta et al., 2008b for details).

We note that even though we are partially relying on humans to define the word–concept memberships, we still apply purely unsupervised algorithms to estimate the latent variables in the model. This is in contrast to a supervised learning approach where the human-defined knowledge is used as a target for prediction. Here, the human-defined knowledge is only used as a constraint on the probability distributions that can be learned for each concept.

We also note that the concept–topic model is not the only way to incorporate semantic concepts. For example, we could use the concept–word associations to build informative priors for the topic model and then allow the inference algorithm to learn word probabilities for all words (for each concept), given the prior and the data. We chose the restricted vocabulary approach to exploit the sparsity in the concept–word associations (topics are distributions over all the words in the vocabulary but concepts are restricted to just their sets of associated words, which are much smaller than the full vocabulary). This sparsity at the word level allows us to easily perform inference with tens of thousands of concepts on large document collections.

A general motivation for the concept–topic approach is that there might be topics present in a corpus that are not represented in the concept set (but that can be learned). Similarly, there may be concepts that are either missing from the text corpus or are rare enough that they are not found in the data-driven topics of the topic model. The marriage of concepts and topics provides a simple way to augment concepts with topics and has the flexibility to mix and match topics and concepts to describe a document.

#### 2.3. Hierarchical concept–topic model

Although the concept–topic model provides a simple way to combine concepts and topics, it does not take into account any hierarchical structure the concepts might have. For example, CALD concepts are arranged in a hierarchy that starts with the concept everything which splits into 17 concepts at the second level (e.g., science, society, general/abstract, communication). The hierarchy has up to seven levels with each level specifying more specific concepts.

In this section, we describe a hierarchical concept–topic model that incorporates hierarchical structure of a concept set. Similar to the concept–topic model described in the previous section, there are *T* topics and *C* concepts. However, as opposed to the flat organization of the concepts in the concept–topic model, we now utilize the hierarchical organization of concepts when sampling words from concepts. Before we formally describe the model, we illustrate the basic idea in Fig. 3. Each topic and concept is associated with a “bag of words” that represents a multinomial distribution over word types. In the generative process, word tokens can be generated from the concept part of the model by sampling a path from the root of the concept tree to some distribution over word types associated with the concept (left box in Fig. 3). Alternatively, word tokens can be generated through the topic part of the model (right box). The dashed and dotted lines show examples of two word tokens sampled through the hierarchical concept part of the model and the topic part of the model, respectively. For the first word token, the option “topic” is sampled at the root node, Topic 1 is then sampled, and then a word token is sampled from the multinomial over words associated with Topic 1. For the second word token, the option “concept” is sampled at the root node, then the option science is sampled as a child of the concept everything, the word distribution for science is then selected, and a word from this distribution is sampled. Each transition in the hierarchical part of the model has an associated probability and the transition probabilities are document dependent—some paths are more likely in context of some documents. For example, in physics and chemistry documents, one might expect all transitions toward the science concept to be elevated but differentiated between the transitions toward the physics and chemistry concepts.

To preview what information is learned by the model, we need to distinguish between variables learned at the word, document, and corpus levels. At the word level, the model learns the assignments of topics or concepts to word tokens. These assignments can be directly used for tagging purposes and word–sense disambiguation. At the document level, the model learns both topic probabilities and concept–transition probabilities in the concept tree. The latter information is useful because it allows a hierarchical representation of document content. At the document level, the model also learns the switch probability that a word is generated through the topic or concept route. The adaptive nature of the switch probability allows the model to flexibly adapt to different documents. Documents that contain material that has poor concept coverage will have a high probability of switching to the topic route. At the corpus level, the model learns the probabilities of the word–topic and word–concept distributions. The word–topic distributions are useful to learn which semantic themes beyond those covered in the concepts are needed to explain the content of the whole document collections. The word–concept distributions are useful to learn which words are important for each concept. Finally, at the corpus level, the model also learns the hyperparameters for each transition in the concept tree. The learned hyperparameters allow the model to make certain paths more prominent across all documents. For example, if a document collection includes many documents on science, the path toward the science concept could be made more likely (a priori).

Our approach is related to the hierarchical pachinko allocation model 2 (HPAM 2) as described by Mimno, Li, and McCallum (2007). In the HPAM 2 model, topics are arranged in a three-level hierarchy with root, super-topics, and subtopics at Levels 1, 2, and 3, respectively, and words are generated by traversing the topic hierarchy and exiting at a specific level and node. In our model, we use a similar mechanism for word generation via the concept route. There is additional machinery in our model to incorporate the data-driven topics (in addition to the hierarchy of concepts) and a switching mechanism to choose the word generation process via the concept route or the topic route.

To give a formal description of model, for each document *d*, we introduce a “switch” distribution *p*(*x*|*d*) that determines if a word should be generated via the topic route or the concept route. Every word token *w*_{i} in the corpus is associated with a binary switch variable *x*_{i}. If *x*_{i} = 0, the previously described standard topic mechanism is used to generate the word. That is, we first select a topic *t* from a document-specific mixture of topics *θ*^{(d)} and generate a word token from the word distribution associated with topic *t*. If *x*_{i} = 1, we generate the word token from one of the *C* concepts in the concept tree. To do that, we associate with each concept node *c* in the concept tree a document-specific multinomial distribution with dimensionality equal to *N*_{c} + 1, where *N*_{c} is the number of children of the concept node *c*. This distribution allows us to traverse the concept tree and exit at any of the *C* nodes in the tree—given that we are at a concept node *c*, there are *N*_{c} child concepts to choose from and an additional option to choose an “exit” child to exit the concept tree at concept node *c*. We start our walk through the concept tree at the root node and select a child node from one of its children. We repeat this process until we reach an exit node and a word token is generated from the parent of the exit node. Note that for a concept tree with *C* nodes, there are exactly *C* distinct ways to select a path and exit the tree, as there is only one parent for each concept node, and thus, one path to each of the *C* concepts.

In the hierarchical concept–topic model, a document is represented as a weighted combination of mixtures of *T* topics and *C* paths through the concept tree and the conditional probability of the *i*th word token in document *d* is given by

- (3)

The sequential process to generate a document collection with *D* documents under the hierarchical concept–topic model is as follows:

- 1
For each topic

, select a word distribution

- 2
For each concept

, select a word distribution

- 3
For each document

where *ϕ*^{(t)}, *ψ*^{(c)}, *β*, and *β* are analogous to the corresponding symbols in the concept–topic model described in the previous section. The variable *ξ*^{(d)}, where *ξ*^{(d)} = *p*(*x*|*d*), represents the switch distribution and *θ*^{(d)}, where *θ*^{(d)} = *p*(*t*|*d*) represents the distribution over topics for document *d.* The variable *ζ*^{(cd)}represents the multinomial distribution over children of concept node *c* for document *d* (this has dimensionality *N*_{c} + 1 to account for the additional “exit” child). The hyperparameters *γ*, *α*, and *τ*^{(c)} are the parameters of the priors on *ξ*^{(d)}, *θ*^{(d)}, and *ζ*^{(cd)}, respectively. Note that *α*, as in the previous topic and concept–topic models, is a vector with hyperparameter values for each topic. Similarly, *τ*^{(c)}is a vector of hyperparameters values, to allow for different a priori probabilities of traversing the concept-tree. This allows the model to tune itself to different corpora and make it more likely to sample a path toward the science concept in a corpus of scientific documents. Fig. 2C shows the corresponding graphical model. The generative process above is quite flexible and can handle any directed-acyclic concept graph (for any nontree, there would be more than one way of reaching each concept, leading to increased complexity in the inference process). The model cannot, however, handle cycles in the concept structure as the walk of the concept graph starting at the root node is not guaranteed to terminate at an exit node.

In the hierarchical concept–topic model, the only observed information is the set of words in each document, the word–concept memberships, and the tree structure of the concepts. All remaining variables are latent and are inferred through a collapsed Gibbs sampling procedure. Details about this procedure are described by Chemudugunta et al. (2008b).