Introduction to neural network-based question answering over knowledge graphs

Question answering has emerged as an intuitive way of querying structured data sources and has attracted significant advancements over the years. A large body of recent work on question answering over knowledge graphs (KGQA) employs neural network-based systems. In this article, we provide an overview of these neural network-based methods for KGQA. We introduce readers to the formalism and the challenges of the task, different paradigms and approaches, discuss notable advancements, and outline the emerging trends in the field. Through this article, we aim to provide newcomers to the field with a suitable entry point to semantic parsing for KGQA, and ease their process of making informed decisions while creating their own QA systems. This article is categorized under:


| INTRODUCTION
Advancements in semantic web technologies and automated information processing systems have enabled the creation of a large amount of structured information. Often, these structures follow well-defined formal syntax and semantics, enabling machine readability and in some cases interoperability across different sources and interfaces. A prime example of such a structure is a special kind of graph-based data model referred to as knowledge graph (KG). Due to their expressive nature, and the corresponding ability to store and retrieve very nuanced information, accessing the knowledge stored in KGs is facilitated through the use of formal query languages with a well-defined syntax, such as SPARQL [w3c] (W3C, n.d.), GraphQL 1 , and so on. However, the use of formal queries to access these knowledge graph pose difficulties for nonexpert users as they require the user to understand the syntax of the formal query language, as well as the underlying structure of entities and their relationships. As outlined by Hirschman and Gaizauskas (2001), a KG question answering (KGQA) system therefore aims to provide the users with an interface to ask questions in natural language, using their own terminology, to which they receive a concise answer generated by querying the KG. Such systems have been integrated in popular web search engines like Google Search 2 and Bing 3 as well as in conversational assistants including Google Assistant, 4 Siri, 5 and Alexa. Their applicability in domain specific scenarios has also been demonstrated, for example, in the Facebook Graph Search 6 and IBM Watson 7 .
This article provides an introduction to neural network-based methods for KGQA. In Section 2, we provide the necessary background related to KGQA, introducing the terminology used in the community, and the major tasks solved by KGQA systems. In Section 3, we provide a brief overview of data sets over which various KGQA systems are trained and benchmarked. Section 4 introduces the most popular methods used for neural network-based semantic parsing for KGQA, divided into classification, ranking, and translation-based methods. We provide in-depth discussions of different methods and discuss how various KGQA systems can use them, as well as the challenges associated with different methods. We conclude our discussion by listing emerging trends and potential future directions of research in this area in Section 5.
As mentioned above, there exists a wide variety of approaches for the KGQA task. This work does not exhaustively cover every approach proposed so far, but rather aims to provide an introductory overview over popular methods and frameworks for neural network-based KGQA by grouping an exemplary set of similar approaches together. Since KGQA is an application area of semantic parsing, we also discuss some methods that were developed for other application areas or neural network-based semantic parsing in general but which we consider relevant for KGQA. This, for example, this section provides a brief introduction includes some works on text-to-SQL semantic parsing.
For an exhaustive survey of traditional approaches for KGQA we refer interested readers to Höffner et al. (2017) and to Zhu, Ma, and Li (2019) for a review article covering some neural as well as some earlier approaches for semantic parsing in general.
We assume familiarity of readers with the basic deep learning methods and terminology and refer readers to Goodfellow, Bengio, and Courville (2016) and the survey article of Goldberg (2016) for good overviews on deep learning and applications of neural network models in natural language processing, respectively.

| BACKGROUND
This section provides a brief introduction to the concepts necessary for an in-depth understanding of the field of KGQA, including a formal definition of KGs, a description of formal query languages, and a definition of the KGQA task and associated subtasks.

| Knowledge graphs
A KG is a formal representation of facts pertaining to a particular domain (including the general domain). It consists of entities denoting the subjects of interest in the domain, and relations 8 denoting the interactions between these entities. For instance, ex:Chicago and ex:Michael Crichton can be entities corresponding to the real world city of Chicago and the famous author Micheal Crichton, respectively. Furthermore, ex:birthPlaceOf can be a relation between the two denoting that Micheal Crichton was born in Chicago. Instead of linking two entities, a relation can also link an entity to a data value of the entity (such as a date, a number, a string, etc.) which is referred to as a literal. The entity ex:Chicago, for example, might be connected via a relation ex:area to a numerical literal with the value of 606 km 2 describing the size of city. Figure 1 depicts a snippet of an example KG representing knowledge about the writer Michael Crichton, where rounded boxes denote entities, squared boxes literals, and labeled edges denote relations.
Formally, let ℰ = e 1 …e n e f gbe the set of entities, ℒ be the set of all literal values, and P = p 1 …p n p n o be the set of relations connecting two entities, or an entity with a literal. We define a triple tℰ × P × ℰ [ ℒ ð Þas a statement comprising of a subject entity, a relation, and an object entity or literal, describing a fact. For instance, one of the facts in the KG depicted in Figure 1 can be written as ex:Chicago, ex:birthPlaceOf, ex:Michael Crichton h i . Then, a KG K is a subset of all possible triples ℰ × P × ℰ [ ℒ ð Þrepresenting facts that are assumed to hold. Remark: A common relation rdfs : label is used to link any resource (entity or relation) with a literal signifying a human readable name of the resource. For instance, hex:Michael Crichton, rdfs:label, }Michael Crichton} @ eni. We refer to these literals as the surface form of the corresponding resource, in the rest of this article.

| Formal query languages
A primary mechanism to retrieve and manipulate data stored in a KG is provided by formal query languages like SPARQL, λ-DCS (P. Liang, 2013), or FunQL (Kate & Mooney, 2006). These languages have a well-defined formal grammar and structure and allow for complex fact retrieval involving logical operands (like disjunction, conjunction, and negation), aggregation functions (like grouping or counting), filtering based on conditions, and other ranking mechanisms. SPARQL, a recursive acronym for "SPARQL Protocol and RDF Query Language", is one of the most commonly used query languages for KGs and is supported by many publicly available KGs like DBPEDIA (Lehmann et al., 2015), WIKIDATA (Vrandecic & Krötzsch, 2014), and FREEBASE (Bollacker, Evans, Paritosh, Sturge, & Taylor, 2008). For an indepth understanding of the syntax and semantics of SPARQL, we refer interested readers to the W3C Technical Report. 9 The central part of a SPARQL query is a graph pattern, composed of resources from the KG (i.e., entities and relations) and variables to which multiple KG resources can be mapped. As a running example that we will use throughout the rest of the paper, we consider the KG presented in Figure 1 to be the source KG over which we intend to answer the natural language question (NLQ) "What is the birthplace of Westworld's writer?" The corresponding SPARQL query is illustrated in Figure 2.
Another popular formal query representation language is lambda calculus (Barendregt, 1984), a formal language for defining computable functions originally devised by Church (1936) and used for his study of the Entscheidungsproblem (Ackermann & Hilbert, 1928). The example NLQ "What is the birthplace of Westworld's writer" can be expressed as follows in lambda calculus: λx:9y:birthplaceOf x, y ð Þ^writerOf y, Westworld ð Þ : SPARQL and lambda calculus are both rather verbose and may result in longer and more syntactically complex expressions than necessary for their use in question answering. λ-DCS (P. Liang, 2013) and FunQL (Kate & Mooney, 2006) both provide a more concise query representation than SPARQL or lambda calculus by avoiding variables and making quantifiers from lambda calculus implicit. In λ-DCS, our example NLQ can be simply expressed as F I G U R E 1 An example knowledge graph (KG) which will serve as a running example throughout this article F I G U R E 2 An example of (a) a (factual) NLQ and (b) a SPARQL query corresponding to the NLQ, which when executed over the KG presented in Figure 1 would return ex : Chicago as the result. KG, knowledge graph; NLQ, natural language query birthplaceOf:writerOf:Westworld. We refer interested readers to Kate andMooney (2006) andP. Liang (2013) for a more in-depth explanation about FunQL and λ-DCS, respectively.

| Question answering over knowledge graphs
Using the concepts introduced above, we now formally define the task of KGQA. In simple terms, KGQA can be defined as the task of retrieving answers to natural language questions (NLQ) or commands from a KG. While the expected answers could take on complex forms, such as ordered lists (e.g., for the command "Give me Roger Deakins movies ranked by date."), or even a table (e.g., for the question "When was which Nolan film released?"), the vast majority of works on KGQA assume the answer to be of a simpler form: a set of entities and/or literals or booleans. We can define the latter more formally as follows. Let K be a KG and let q be an NLQ then we define the set of all possible answers A as the union of (a) the power set 10 P ℰ [ ℒ ð Þof entities ℰ and literals ℒ in K, (b) the set of the numerical results of all possible aggregation functions f : P ℰ [ ℒ ð Þ7 !ℝ (such as SUM or COUNT), and (c) the set True, False f gof possible boolean outcomes, which is needed for yes/no questions. The task of KGQA then is defined as returning the correct answer aA, for a given question q.
The most common way in which KGQA approaches accomplish this is by creating a formal query representing the semantic structure of the question q, which when executed over K, returns a. Thus, KGQA can be seen as an application area of semantic parsing. In general, semantic parsing is the task of translating a natural language utterance q into a formal, executable meaning representation f ℱ of q. We refer interested readers to Kamath and Das (2019) for a recent survey on semantic parsing. While semantic parsing in general is not restricted to just question answering (QA) purposes, QA is one of its most prominent application areas. Since KGQA is the focus of this article, in what follows, we provide a KGQA-centric definition of semantic parsing. ℱ is the set of all formal queries (see Section 2.2) that can be generated by combining entities and relations from K as well as arithmetic/logical aggregation functions available in the formal query language. The correct logical form f must satisfy the following conditions: (a) the execution of f on K (which we denote by f K ð Þ) yields the correct answer a, as implied by the NLQ q and (b) f accurately captures the meaning of the question q. The second condition is important because multiple logical forms could yield the expected results a but not all of them mean the same thing as the question q. We call such logical forms that satisfy the first constraint but not the second spurious logical forms. For an example of spurious logical forms for KGQA let's consider the following NLQ: "What novels did Michael Crichton write before 1991?". The queries and(isA(Novel), writtenBy(Michael_Crichton)), writtenBefore(1991)) and and(isA(Novel), writtenBy(Michael_Crichton)) both execute to the same set of books because Michael Crichton did not write books after 1991. However, the second query is incorrect since it does not convey the full meaning of the question.
A note on terminology: The natural language question (which we abbreviate to NLQ) is also referred to as question or utterance. The meaning representation is also referred to as logical forms or formal queries. The execution results or answers for a formal query are also referred to as denotations. In the QA community, KGQA is often also called knowledge base question answering (KBQA).
Questions answered by KGQA systems are often divided into the categories-simple and complex. The questions, which can be answered by considering a single fact, without any additional constraints are generally defined as simple questions. For instance, the question "What is the birthplace of Michael Crichton" is a simple question since it can be answered by only considering the following fact: hex:Chicago, ex:birthPlaceOf, ex:Michael_Crichtoni. In contrast, the question "What is the birthplace of Westworld's writer?" is considered complex since answering this question requires knowing both hex:Michael_Crichton, ex:writerOf, ex:Westworldi and hex:Chicago, ex:birthPlaceOf, ex:Michael_Crichtoni.

| KGQA subtasks
Regardless of the specific semantic parsing approach taken, the construction of logical forms for KGQA requires taking several decisions regarding their structure and content. Here, we briefly highlight those different tasks the semantic parser will have to perform.
1. Entity Linking: Entity linking in the context of KGQA is the task of deciding which KG entity is referred to by (a certain phrase in) the NLQ q. In the case of our example, the entity linker must identify ex:Westworld as the entity being referred to by "Westworld" in the NLQ. The large number (often millions) of entities forms a core challenge in entity linking with large-scale KGs. A particular phrase (e.g., "Westworld"), if considered without context, can refer to several different entities (Westworld: the movie, the series, the band, or the album). This phenomenon is referred to as polysemy. It is essential to take the context of the phrase into account to correctly disambiguate the phrase to the correct entity. The large number of entities makes it practically unfeasible to create fully annotated training examples to learn lexical mappings for all entities. As a consequence, statistical entity linking systems need to generalize well to unseen entities. Most modern KGQA systems externalize the task of entity linking by employing a standalone entity linking system like DBpedia Spotlight (Mendes, Jakob, Garca-Silva, & Bizer, 2011) for DBpedia-based KGQA systems (Dubey, Dasgupta, Sharma, Höffner, & Lehmann, 2016;Radoev, Zouaq, Tremblay, & Gagnon, 2018), or S-Mart (Y. Yang & Chang, 2016) for Freebase. 2. Identifying Relations: It is essential to determine which relation must be used in a certain part of the logical form.
Similarly to entity linking, we need to learn to map natural language expressions to the KG-in this case to its relations. However, unlike in entity linking, where entities are expressed by entity-specific noun phrases, relations are typically expressed by noun and verb phrase patterns that use less specific words. For instance, in our example NLQ, the relations ex:writerOf and ex:birthPlaceOf are explicitly referred to by the phrases "X's writer" and "the birthplace of X", respectively. However, depending on the KG schema, relations may also need to be inferred, like in "Which Chicagoan wrote the Andromeda Strain?" where "Chicagoan" specifies an additional constraint hex:Chicago, ex:birthPlaceOf, ?personi that is never explicitly mentioned in the question. 3. Identifying Logical/Numerical Operators: Sometimes questions contain additional operators on intermediate variables/sets. For example, "How many writers worked on Westworld?" implies a COUNT operation on the set of Westworld writers. Other operators can include ARGMAX/MIN, comparative filters (e.g., "older than 40"), set inclusion (e.g., "Did Michael Crichton write Westworld?"), and so on. Like entities and relations, identifying such operators is primarily a lexicon learning problem. However, compared to entities and relations, there is a fixed small set of operators, which depends on the chosen formal language and not on the KG. 4. Determining Logical Form Structure: In order to arrive at a concrete logical form, a series of decisions must be made regarding the structure of the logical form, that is, how to arrange the operators, relations and entities such that the resulting object executes to the intended answer. For example, if FunQL is used as target formal language, a decision must be taken to supply ex:Westworld as the argument to the relation function ex:writtenBy, and to pass this subtree as the argument to ex:bornIn, yielding the final query ex:bornIn(ex:writtenBy(ex: Westworld)). Note that such decisions are heavily interdependent with the previous tasks and as mentioned at the beginning of the section, the different types of decisions are not necessarily taken separately. In fact, often structural decisions are merged with or implied by lexical decisions, for example, in FunQL, generating an ARGMAX token at a certain time step implies that the next generated token will be the root of its (ARGMAX's) first argument. On the other hand, the REDUCE operation in transition-based Stack-LSTM semantic parsers (Cheng et al., 2019;Cheng, Reddy, Saraswat, & Lapata, 2017) is a purely structure-manipulating token that indicates the finalization of a subtree in FunQL. Question answering often solves a number of these subtasks in a single process. For instance, translation-based systems (see Section 4.3) in principle could generate the whole query in a single sequence decoding process, thus solving all subtasks at once. However, in practice, specialized modules can be employed, for example, for entity linking, in order to constrain the search space of the main model.

| DATA SETS
Research in the field of KGQA has seen a shift from manual feature engineering-based solutions (Berant et al., 2013;Höffner et al., 2017;Reddy et al., 2014;Unger et al., 2015) to neural network-based, data-driven approaches. One of the pivotal requirements for these data-driven approaches is the availability of large data sets comprising of a wide variety of NLQs-label pairs. As a consequence, one can observe a shift in the scale, and more recently, also in the complexity of questions in KGQA data sets. The properties of the most popular KGQA data sets are summarized in Table 1.
One of the first attempts to create a large scale KGQA data set was by Cai and Yates (2013) who developed the FREE917 data set, consisting out of 917 question/formal query pairs covering over 600 Freebase relations. Along the same lines, Berant et al. (2013) developed another data set called WEBQUESTIONS by finding questions using the Google Suggest API and answering them with the help of Amazon Mechanical Turk workers using Freebase. Although this data set is much larger than Free917, it suffers from two disadvantages. First, it does not provide formal queries as targets for NLQs but only question-answer pairs, inhibiting supervised training of KGQA approaches that rely on a logical form. Second, it is primarily composed of simple questions and relatively few requiring complex reasoning. In order to alleviate the first issue, Yih et al. (2016) introduced WEBQUESTIONSP, a subset of WEB-QUESTIONS, having both answers as well as formal queries corresponding to each question. In order to provide more structural variance and expressiveness in questions, Bao et al. (2016) and Su et al. (2016) have released COM-PLEXQUESTIONS and GRAPHQUESTIONS, respectively, consisting of pairs of questions and their formal queries, where they augment a subset of WEBQUESTIONSP with type constraints, implicit and explicit temporal constraints, aggregation operations, and so on. Trivedi et al. (2017) released LC-QUAD, a data set of complex questions for the DBPEDIA KG. They started by generating formal queries for DBpedia and semi-automatically verbalizing them using question templates, and then leveraged crowd-sourcing to convert these template-based questions to NLQs, performing paraphrasing and grammar correction in the process. Dubey et al. (2019) used a similar mechanism to create a significantly larger and varied data set-LC-QUAD 2. QALD 11 (Usbeck, Gusmita, Ngomo, & Saleem, 2018) is another important KGQA data set based on DBpedia. Although it is substantially smaller than LC-QUAD, it has more complex and colloquial questions as they have been created directly by domain experts. SQA and CSQA are data sets for sequential question answering on KGs, where each data sample consists of a sequence of QA pairs with a shared context. While the individual questions in a sequence are generally short, the datasets provide additional complexity by incorporating contextual dependencies between questions.
T A B L E 1 The table list the KGQA data sets and their characteristics, namely, (a) the name of the data set, (b) the underlying knowledge graph on which the data set is based, (c) the size of the data set in terms of number of questions, (d) availability of formal query (FQ), (e) presence of complex questions (CQ), that is, simple or complex, (f) examples of KGQA systems subsequently discussed, which work on the given data set  (2015), Bordes et al. (2014); Bordes, Usunier, Chopra, and Weston (2015), Zhang et al. (2016), Luo, Lin, Luo, and Zhu (2018), Yih, Chang, He, and Gao (2015), K. Xu, Reddy, Feng, Huang, and Zhao (2016), Cheng and Lapata (2018), Berant and Liang (2014), and Jain (2016) WebQuestionsSP ( Most data sets discussed so far, due to their relatively small size, do not provide a thorough coverage of the entities and relations that exist in a KG. In order to partially compensate for this and support a larger variety of relations, Bordes et al. (2015) created the SIMPLEQUESTIONS data set with more than 100 k questions. These questions can be answered by a single triple in the KG, and thus the corresponding formal queries can be thought to simply consist of the subject entity and predicate of the triple.

| NEURAL NETWORK-BASED KGQA SYSTEMS
As explained in Section 2.3, the problem of KGQA is usually cast into a semantic parsing problem. A semantic parser is an algorithm that given a context (i.e., a KG K and formal target language with expressions ℱ), maps a given NLQ q to the logical form f ℱ. In the context of KGQA, this form needs to (a) return the correct answer a when executed over the KG K and (b) accurately capture the meaning of q. Neural network-based semantic parsing algorithms use prediction models, with trainable parameters, that enable the parser to be fitted on a given data set. The models are trained using a suitable training procedure, which depends on the model, the parsing algorithm, and the data provided for training the parser.
In this article, we divide the prediction models commonly used in semantic parsing into three major categories, namely: (a) classification, (b) ranking, and (c) translation. In the following subsections (Sections 4.1-4.3, respectively), we will explain each of these categories and provide an overview of previously proposed KGQA systems which employ models from these categories. We also discuss these models and how they are used and trained in the defined context. Table 2 lists the approaches discussed in this article separated based on the three major categories.
In all the training procedures discussed below, the models are optimized with stochastic gradient descent (SGD) or one of its variants. Depending on the type of model, different loss functions can be used. Moreover, depending on the semantic parsing algorithm, different ways of processing and using the provided data in the optimization algorithm are required in order to train the models. These choices define the different training procedures, which are discussed in the subsequent sections. Another important aspect is the type of training data which is provided, leading to two different training settings: the fully supervised setting, where the data set consists of N pairs of NLQs and formal queries , and the weakly supervised setting, where the semantic parser is trained over a data set of pairs of NLQs and corresponding execution results Whereas the prediction models themselves are generally not affected by this difference, it does lead to different training procedures.
While the fully supervised setting allows to train the models by simply maximizing the likelihood of predicting the correct logical form, the weakly supervised setting presents a more challenging scenario, where we must indirectly infer and "encourage" the (latent) logical forms that execute to the correct answer while avoiding spurious logical forms which might hurt generalization. This gives rise to two fundamental challenges. Firstly, finding correct logical forms (those that execute to the ground truth answer) is challenging. A KG may consist of millions of entities and thousands of predicates, any combination of which can yield consistent logical forms with respect to the underlying KG. Additionally, the search space of these logical forms grows with the number of KG artifacts, and also with the number of tokens (grounded KG artifacts) in the logical form. Secondly, dealing with spurious candidates, that is, incorrect logical forms (those that do not capture the meaning of the source question) that coincidentally execute to the correct answer, thereby misleading the supervision signal provided to the semantic parser (Cheng & Lapata, 2018).

| Classification methods for KGQA
In the most general case, the semantic parser should be able to generate a structured output (i.e., the corresponding formal query) of arbitrary size and complexity given the input NLQ. In some cases, however, we can assume a fixed structure for the formal query. This holds in the case of single fact-based questions (e.g., SIMPLEQUESTIONS), where only a single subject entity and a relation need to be predicted. For instance, the question "Where was Michael Crichton born?" is represented in our example KG ( Figure 1) by a single triple pattern-based SPARQL query with the triple pattern being ?x,ex:birthPlaceOf, ex:Michael Crichton h i . The logical form corresponding to this type of question thus always consists of one subject entity and a relation, and the answer is given by the missing object entity. For such fixedstructure prediction tasks, we can make use of simple text classification methods to predict the different parts of the target formal query given the NLQ as input (see also Hakimov, Jebbara, & Cimiano, 2019, for a recent overview of deep learning architectures for simple questions). 12

| Classification models
To illustrate such classification methods, let us consider the task of relation classification. Given an NLQ q, and a KG K, the relation classification task consists of predicting which of the n r relations r 1 , …, r n r P K is referred to in q. In the first step, an encoder network can be used to map the variable-length input sequence of q onto a fixed-length vector q ℝ d , which is called the latent representation, or the encoded representation of q. The encoder network can be comprised of different neural architectures, including recurrent neural networks (RNN) (Hochreiter & Schmidhuber, 1997), convolutional neural networks (CNN) (LeCun & Bengio, 1998), or transformers (Vaswani et al., 2017). The encoded question is subsequently fed through an affine transformation to calculate a score vector s q ð Þ = s 1 q ð Þ…s n r q ð Þ ð Þ , that is, Here, W o ℝ n r × d and b o ℝ n r , in conjunction with the parameters of the encoder network, constitute the trainable parameters θ of the classification model. In the output layer the classification model typically turns the score vector into a conditional probability distribution p θ ðr k jqÞ over the n r relations based on a softmax function, given by for k = 1, …n r . Classification is then performed by picking the relation with the highest probability given q.
Given a data set of N pairs of NLQs and single fact-based formal queries ) the relation classification model is trained by maximizing the log-likelihood of the model parameters θ, which is given by where r (i) is the predicate used in the formal query f (i) .

| Classification-based parsing algorithms
For single-fact-based prediction tasks, systems relying on standard classification models as described above, can achieve state-of-the-art performance (Mohammed et al., 2018); Petrochuk & Zettlemoyer, 2018). Since the target formal queries consist only of one subject entity and one relation, they can, in principle, be predicted using two separate classifiers, receiving the NLQ as input and producing an output distribution over all entities and all relations in the KG, respectively. This approach can be successfully applied to predict the KG relation mentioned or implied in the question.
However, large KGs like Freebase contain a huge amount of entities and training data sets can only cover a small fraction of these (and therefore many classes remain unseen after training). This makes it difficult to apply the approach described above for relation classification to entity prediction. 13 Therefore, several works on SIMPLEQUESTIONS initially only perform relation classification and rely on a two-step approach for entity linking instead. In such a two-step approach, first an entity span detector 14 is applied to identify the entity mention in the NLQ. Secondly, simple textbased retrieval methods can be used to find a limited number of suitable candidate entities. The best fitting entity given q can then be chosen from this candidate set based on simple string similarity and graph-based features. The entity span detector can be implemented based on classifiers as well, either as a pair of classifiers (one for predicting the start position and one for predicting the end position of the span) or as a collection of independent classifiers 15 (one for every position in the input, akin to a token classification network), where each classifier predicts whether the token at the corresponding position belongs to the entity span or not. To give a concrete example of a full KGQA semantic parsing algorithm relying on classification models, we will describe the approach proposed by Mohammed et al. (2018), chosen due to its simplicity and competitive performance on SIMPLEQUESTIONS. This QA system adheres to the following steps for predicting the entity-relation pair constituting the formal query: 1. Entity span detection: a bidirectional long-short-term memory network (BiLSTM) (Hochreiter & Schmidhuber, 1997) is used for sequence tagging, that is, to identify the span of the NLQ q that mentions the entity.
In the example "Where was Michael Crichton born?", the entity span could be "Michael Crichton". 2. Entity candidate generation: Given the entity span, all KG entities are selected whose labels (almost) exactly match the predicted span. 3. Relation classification: A bidirectional gated recurrent unit (BiGRU) (Cho et al., 2014) encoder is used to encode q.
The final state of the BiGRU is used to compute probabilities over all relations using a softmax output layer, as described in Equations (2) and (3) For training the full system, the relations given in the formal queries of the data set are used as correct output classes. For the span detector, one first needs to extract "pseudo-gold" entity spans since they are not given as part of the data set. This is done by automatically aligning NLQs with the entity label(s) of the correct entity provided in the training data, that is, the part of the NLQ that best matches the entity label is taken as the correct span to train the sequence tagging BiLSTM. Mohammed et al. (2018) investigate different RNN and CNN-based relation detectors as well as bidirectional longshort-time-memory (LSTM) (Hochreiter & Schmidhuber, 1997) and CRF-based entity mention detectors. They show that a simple BiLSTM-based sequence tagger can be used for entity span prediction, leading to results competitive to the state-of-the-art.
Both Mohammed et al. (2018) and Petrochuk and Zettlemoyer (2018) first identify the entity span, similarly to previous works, but disambiguate the entity without using neural networks. Petrochuk and Zettlemoyer (2018) employed a BiLSTM-CRF model for span prediction, which can no longer be considered a simple collection of independent classifiers and instead forms a structured prediction model that captures dependencies between different parts of the output formal query (in this case the binary labels for every position).

| Ranking methods for KGQA
If we cannot assume that all formal queries follow a fixed structure, as we could in the previous section, the task of mapping an NLQ to its logical form becomes more challenging. Since the set ℱ of all possible formal queries grows exponentially with the length of the formal query, KGQA systems need to reduce the set of possible outputs. Rankingbased semantic parsers therefore employ some kind of search procedure to find a suitable set of candidate formal queries C q ð Þ = f 1 , …, f N f gfor a given NLQ q and the underlying KG, and then use a neural network-based ranking model to find the best matching candidate.

| Ranking models
Formally, given an NLQ q and a candidate set of formal queries C q ð Þ, the task of a ranking model is to output a score for each of the candidate formal queries in C q ð Þ that allows to rank the set of candidate formal queries, where a higher score indicates a better fit for the given NLQ q. We define a scoring function s θ (q, f) that takes a question q and query f and computes a score for this pair. The highest scoring formal query candidate is then returned as the prediction of the model, that is In case that multiple answers are possible for a given question, the following equation may be used to determine a set of outputs by also considering candidates whose scores are close to the best-scoring candidate (Dong et al., 2015): A simple implementation of the scoring function s θ that is frequently used compares vector representations of q and f. More specifically, a pair of neural encoder models (enc f θ f , enc f θ f ) is employed to map q as well as every candidate query f ∈C q ð Þ to a latent representation space, resulting in a vector q ¼ enc q θ q q ð Þ for the NLQ and a vector f ¼ enc f θ f f ð Þ for each candidate. Then, each formal query encoding vector f, paired with the encoded question q is fed into a differentiable vector comparison function c q, f ð Þ that computes the score s θ (q, f) indicating how well the formal query f matches the question q. This function can, for example, be a parameterless function c(Á, Á) such as the dot product between the embedding vectors, or a parameterized function c θ c Á, Á ð Þ (where the corresponding parameters are denoted by θ c ), such as another neural network receiving the embeddings as input. In other words, the scoring function can be implemented as s θ q, f ð Þ¼c θ c enc q θ q q ð Þ,enc f θ f f ð Þ . However, more advanced scoring functions that don't rely on encoding q and f into single vectors are possible.
We can train ranking models using (a) full supervision, where the training data consists of questions and corresponding logical forms; and (b) weak supervision, where gold logical forms are absent, and only pairs of questions and corresponding answers (i.e., the execution results from the KG) are available for training. We discuss these two cases in the following. is generated. That is, for each NLQ q (i) in the training set, a false formal queryf ðiÞ that does not return the correct answer for q (i) is created. Typically, this process, referred to as negative sampling, relies on a random or heuristic-based search procedure. Training the ranking model then corresponds to fitting the parameters of the score function and the neural encoders by minimizing either a pointwise or pairwise ranking objective function. A popular ranking objective is the pairwise hinge loss that can be computed over pairs of positive and negative examples: È É for each question q (i) that evaluate to the correct answer a (i) . Pairs of questions and their corresponding pseudo gold logical forms are then used to train the ranking models as described above.
The maximum margin reward (MMR) method introduced by Peng, Chang, and Yih (2017) proposed a modified formulation of the pairwise hinge-loss-based ranking objective in Equation (7) called the most-violation margin objective. For each training example (q, a), they find the highest scoring logical form f * from ℱ p as the reference logical form. They define a reward-based margin function δ : ℱ × ℱ × A ! ℝ and find the logical form that violates the margin the most:f where δ(f * , f, a) = R(f * , a) − R(f, a), and R(f, a) is a scalar-valued reward that can be computed by comparing the labeled answer a and the answer a 0 = f K ð Þ yielded by executing f on the KG. The reward function chosen by Peng et al. (2017) is the F 1 score. The most-violation margin objective for the training example (q, a) is then defined as wheref is computed using Equation (8). Note that the MMR method only updates the score of the reference logical form and the most-violating logical form. Below, we discuss how existing approaches design the scoring function s θ (q, f ), followed by a discussion on the parsing algorithms used for finding candidate logical forms in ranking-based methods.
Encoding Methods and Score Functions: Neural network-based models used for encoding questions and logical forms vary in complexity ranging from simple embedding-based models to recurrent or convolutional models. Bordes et al. (2014) proposed one of the first neural network-based KGQA methods for answering questions corresponding to formal queries involving multiple entities and relations, and evaluate their method on WEBQUESTIONS. They represent questions as bag-of-words vectors of tokens in the question, and logical forms as bag-of-words vectors indicating whether a certain relation or entity is present in the logical form. These sparse bag-of-words representations are then encoded using a neural embedding model that computes the sum of the embeddings for the words, entities, and relations present in the bag-of-words vector, yielding fixed-length vector representations for the question and logical form, respectively. Their match is scored by computing the dot product between the resulting vectors. Dong et al. (2015) improve upon the simple embedding-based approach by introducing multicolumn CNNs that produce three different vector-based representations of the NLQ. Corresponding to each of the three NLQ representations, they create a different vector representation of the logical form by encoding (a) the path of relations between the entity mentioned in the question and the answer entity, (2) the 1-hop subgraph of entities and relations connected to the path, and (c) the answer type information, using embedding-based models similar to those used in the approach described above. The sum of dot products between the question representations and their corresponding logical form representations is computed to get the final score.
Incorporation of structural information during the process has also been explored in subsequent approaches (Dai et al., 2016;Jain, 2016;Lukovnikov et al., 2017;Zhang et al., 2016). Zhang et al. (2016) propose an attention-based representation for NLQs and incorporate structural information from the KG into their ranking-based KGQA model by embedding entities and relations using a pretrained TransE model (Bordes, Usunier, Garcia-Duran, Weston, & Yakhnenko, 2013). They adopt a multi-task training strategy to alternatingly optimize the TransE and KGQA objectives, respectively. The use of structural information, such as entity type vectors and pretrained KG embeddings, 16 has also been shown to be beneficial for the KGQA task by Dai et al. (2016) and Lukovnikov et al. (2017). Furthermore, Lukovnikov et al. (2017) explore building question representations on both word and character level. Note, that these approaches focus on a subset of the task, namely, answering simple questions with a single entity and relation, and correspondingly train their models on SIMPLEQUESTIONS. These and other works on simple questions typically encode the subject entity, and the relations in the logical form separately and rank them against the given NLQ, instead of treating them together as a formal query language expression. An exception to this is the work of Bordes et al. (2015) which ranks entire triples.
Instead of incorporating structural information from the KG, Luo et al. (2018) incorporate local semantic features into the NLQ representation. To do so, they extract the dependency path between the entity mentioned in the NLQ and annotated wh-tokens 17 , and encode both the dependency path as well as the NLQ tokens. Yih et al. (2015) use CNNs to encode graph-structured logical forms called query graphs after flattening the graph into a sequence of tokens. Their method was extended by M. Yu et al. (2017) who use bidirectional LSTMs for encoding the logical forms. They use two kinds of embeddings for encoding relations: relation-specific vector representations, as well as word-level vector representations for the tokens in the relation. Berant and Liang (2014) employ a paraphrase driven approach to answer complex questions. They first generate a set of logical forms using a predefined set of templates, which are then transformed into canonical utterances. These canonical utterances are then scored against the input question to find the closest match. The paraphrase model is independent of the underlying KG and thus can be trained from large text corpora.
The approaches discussed above are implemented over the FREEBASE KG. Maheshwari et al. (2019) propose an attention-based method to compute different representations of the NLQ for each relation in the logical form, and evaluate their approach on LC-QuAD and QALD-7. They also show that transfer learning across KGQA data sets is an effective method of offsetting the general lack of training data, by pre-training their models on LC-QuAD, and finetuning on QALD. Additionally, their work demonstrates that the use of pretrained language models (

| Ranking-based parsing algorithms
As mentioned above, searching for candidate logical forms is a pivotal step in training and inference of ranking-based parsing algorithms. During inference, a search procedure is used to create the formal query candidate set C q ð Þ, which is then ranked using the scoring function. The size of the search space crucially depends on the complexity of the NLQs the systems aims to answer. In the weakly supervised setting, in the absence of gold annotated formal queries, an additional search procedure is required during training to guess the latent formal queries that resolve to the correct answer for a given NLQ. A similar search process may be used for generating the negative examples needed for training ranking models.
Typically, the search procedure builds on multiple techniques to restrict the candidate space, such as: 1. using off-the-shelf entity-linking tools to generate entity candidates, and limiting the search space to formal queries containing only entities and relations found within a certain distance (usually measured in number of relation traversals, also referred to as hops) of the candidate entities; 2. limiting the number of relations used in the formal query; 3. employing beam search, a heuristic search algorithm that maintains a beam of size K, that is, a set of at most K incomplete output sequences that have high probability under the current model; at every time step during prediction, every sequence in the beam is expanded with all possible next actions or tokens, and the probabilities of the expanded sequences are evaluated by the scoring function, following which only the K most likely sequences are retained in the beam and considered for further exploration; 4. strict pruning of invalid or redundant logical forms that violate grammar constraints of the formal language, or those that do not adhere to the structure of the KG, along with additional rules and heuristics. Bordes et al. (2014) and Dong et al. (2015) employ (1) and (3), that is, they find the entities mentioned in the question, and use a beam search mechanism to iteratively build logical forms, making one prediction at a time based on their ranking model. In contrast, K. Xu et al. (2016) propose a multistage approach in order to better handle questions with multiple constraints. They use syntactical rules to break a complex question down into multiple simple questions, and use multichannel CNNs to jointly predict entity and relation candidates in the simple fact-based questions. The predictions of the CNN are validated by textual evidence extracted from unstructured Wikipedia pages of the respective entities. An SVM rank classifier (Joachims, 2006) is then employed to choose the best entity-relation pair for each simple question. Yih et al. (2015) employ a heuristic reward function to generate logical forms using a multistep procedure, starting by identifying the entities mentioned in the NLQ, then building a core chain consisting only of relations that lead to the answer entity, and finally adding constraints like aggregation operators to their logical form. Their logical forms, called query graphs, can be viewed as the syntax tree of a formal query in λ-DCS 18 (P. Liang, 2013). For a given NLQ, partial query graphs generated at each stage of the search procedure are flattened into a sequence of tokens. The NLQ and the formal query tokens are respectively encoded by a CNN model and the final output score is given by the dot product of the output representations.

| Translation methods for KGQA
In this section, we will focus on methods that learn to generate a sequence of tokens that forms the logical form as opposed to learning to choose the correct logical form among a set of pre-generated candidates.
Semantic parsing in this setup is modeled as a translation problem, where we need to translate a NLQ q into a formal query f that can be expected to return the intended answer when executed over the source KG K. A popular approach for mapping sequences in one language to sequences in another language is to use neural sequence-tosequence (seq2seq) models. Such models were first introduced for machine translation, that is, for mapping a sentence in one natural language to the corresponding sentences in another (Bahdanau, Cho, & Bengio, 2015;Luong, Pham, & Manning, 2015). With some changes, this neural architecture has been extensively adapted for semantic parsing (Dong & Lapata, 2016;Jia & Liang, 2016;Sun et al., 2018;X. Xu, Liu, & Song, 2017;Zhong, Xiong, & Socher, 2017) and more specifically for the KGQA task (Cheng & Lapata, 2018;He & Golub, 2016;C. Liang et al., 2017;Yin, Zhou, He, & Neubig, 2018).

| Translation models
A typical neural sequence-to-sequence model consists of an encoder, a decoder, and an attention mechanism. The encoder encodes the input sequence to create context-dependent representations for every token in the input sequence. The decoder generates the output sequence, one token at a time, conditioning the generation process on previously generated tokens as well as the input sequence. The attention mechanism (Bahdanau et al., 2015;Luong et al., 2015) models the alignment between input and output sequences which has proven to be a very useful inductive bias for translation models. In neural sequence-to-sequence models, generally RNNs are used to encode the input and to predict the output tokens. Other encoder/decoder network choices are also possible, such as CNNs and transformers (Vaswani et al., 2017). Since logical forms are generally tree-structured, and a basic sequence decoder does not explicitly exploit the tree dependencies, several works have focused on developing more structured decoders (see Section 4.3.1).
Formally, given an input sequence q 0…T 19 of T tokens from the input vocabulary V I and an output sequence f 0…T Ã of T * tokens from the output vocabulary V O , the translation model with parameters θ, models the probability p θ f 0…T Ã jq 0…T ð Þof generating the whole output sequence given the input sequence. The probability of the whole output sequence can be decomposed into the product of probabilities over the generated tokens where f <j is the sequence of tokens generated so far, that is, f <j = (f 0 , …, f j − 1 ). During prediction, the output sequences are usually generated by taking the most likely token at every time step, which is also referred to as greedy search, that is, or by using beam search (see Section 4.2.2). In the following, we discuss common ways to train translation-based KGQA models in the fully and weakly supervised settings.
Full Supervision: In the fully supervised setting, the parameters θ of the model p θ (f j j f <j , q 0. . T ) are typically trained by maximizing their likelihood given the sequence pairs in the training set , that is, by maximizing the following objective: Weak Supervision: In the weakly supervised setting, the training data D = q i ð Þ , a i ð Þ À Á È É N i = 1 consists only of pairs of NLQs and corresponding execution results. Different methods have been proposed for training translation models for semantic parsers using only weak supervision. Even though the proposed methods vary in the objective function and training procedure, they all operate on the general principle of maximizing the probability of producing the correct execution results.
The following training methods are commonly used for weakly supervising translation-based semantic parsing models: 1. Maximum Marginal Likelihood (MML): The MML method follows a similar idea as the maximum likelihood (ML) method used in the fully supervised setting, but instead of directly maximizing the probability of producing the correct logical form, it maximizes the probability of producing the correct execution results. In order to do so, MML marginalizes over all possible logical forms, maximizing the following objective function: where a is the correct answer to the NLQ q and the sum is computed over all f's in the space of all possible logical forms ℱ. Since semantic parsing environments are usually deterministic (i.e., the same query always executes to the same answer when the KG is kept fixed), the p(aj f) term is either 1 (if f's execution produced the correct result a, i.e., f K ð Þ= a) or 0, which leads to the following simplified notation where ℱ * is the set of consistent logical forms that execute to the correct answer a. The set ℱ * is usually approximated using online beam search. However, an approximation of ℱ * can also be computed beforehand (Pasupat & Liang, 2016) and kept fixed throughout training. 2. Reinforcement Learning using Expected Rewards (ER): The problem of weakly supervised semantic parsing can also be approached as a reinforcement learning (RL) problem. In fact, we can view the translation model as a parameterized policy that given a state, which in our case consists of the KG K, input q and decoding history f <t , must decide what action to take in order to maximize a reward. In our case, the reward function R(f, a) can be a binary reward at the end of an episode (end of decoding phase), that is 1 if the produced trajectory (logical form) f executes to the correct answer a and 0 otherwise. The RL setup for semantic parsing is characterized by (a) a deterministic environment (the next state produced by executing the action on the current state is always the same) and (b) sparse rewards (the reward is given only at the end of the episode and, given the huge number of possible logical forms, is likely to be 0 most of the time). In addition, weakly supervised training of semantic parsing is characterized by underspecified rewards (Agarwal et al., 2019) which could lead to learning spurious logical forms. However, the reward function based on execution results only cannot take this into account. Policy gradient methods are trained by optimizing the expected reward: where the sum is over all possible logical forms, which, in practice, is estimated using an approximation strategy such as Monte Carlo integration, that is, based on trajectories sampled from the policy. 3. Iterative Maximum Likelihood (IML): The IML objective (C. Liang et al., 2017) uniformly maximizes the probability of decoding all consistent logical forms across all examples: When training policy gradient methods using ER, C. Liang et al. (2017) demonstrate that better exploration can be achieved by employing IML to populate an initial buffer of diverse trajectories (i.e., logical forms) which are then used to pretrain the RL model. 4. Memory augmented policy optimization (MAPO): In order to reduce the variance of the ER estimator, C. Liang et al. (2018) proposed a novel method that integrates a memory buffer. They reformulated the ER objective as a sum of action sequences in a memory buffer C q ð Þ and outside the buffer: X q, a ð ÞD The memory buffer C q ð Þ for each example is populated using systematic exploration, which prevents revisiting sequences that have already been explored. When the two terms are explored further (see C. Liang et al., 2018), we can see that the weight assigned to trajectories from the memory buffer is low in the beginning of training, when the policy still assigns low probabilities to the buffered trajectories. In order to speed up training, C. Liang et al. (2018) propose memory weight clipping, which amounts to assigning the buffered trajectories an importance weight of at least a certain value. Experimental results show significant improvement of the proposed MAPO procedure against common baselines and show that both systematic exploration and memory weight clipping are essential to achieve high performance. Agarwal et al. (2019) argue that apart from an optional entropy term, MAPO does not encourage exploration, which can be problematic. They propose a MAPO variant called MAPOX where the initial buffer is additionally populated by sequences found using IML.
A disadvantage of the RL and MML approaches, noted by Guu et al. (2017) and Agarwal et al. (2019), is that the exploration of trajectories is guided by the current model policy, which means that logical forms that have high probability under the current model are more likely to be explored. Consequently, logical forms with low probability may be overlooked by exploration. In the presence of many spurious logical forms and only a few correct logical forms, it becomes more likely that spurious ones are explored first. Once the policy settles on these high-reward (but spurious) logical forms, the exploration gets increasingly biased towards them during training, leading to poor generalization. One common solution to this is to perform ε-greedy exploration, which Guu et al. (2017) extend from RL to MML. Guu et al. (2017) also propose a meritocratic update rule that updates parameters such that probability is spread more evenly across consistent logical forms. For a more detailed discussion of the MML and ER objectives, we refer the reader to the work of Guu et al. (2017), Norouzi et al. (2016), Roux (2017), and Misra, Chang, He, and Yih (2018). Agarwal et al. (2019) propose a meta-learning approach for learning an auxiliary reward function that is used to decrease the effect of spurious logical forms in the MAPOX objective. The auxiliary reward function is trained such that the update to the policy under the reward augmented by the auxiliary reward function results in better generalization over a held-out batch of data. Cheng and Lapata (2018) propose a neural parser-ranker-based approach for weakly supervised semantic parsing, where a sequence-to-sequence parsing model (trained with beam search and IML) is coupled with a basic ranking model (trained using MML) as well as an inverse parsing model, which is a generative model that reconstructs the original question from the logical form generated by the parsing model. The reconstruction loss is used to further refine the parsing model in order to tackle the problem of spurious logical forms. Note that here the logical form is treated as a latent variable.
Another class of neural approaches for KGQA with weak supervision perform multistep knowledge base reasoning in a fully differentiable end-to-end manner. TensorLog is ) a recently developed differentiable logic that performs approximate first-order logic inference through a sequence of differentiable numerical operations on matrices. NeuralLP (F. Yang et al., 2017), inspired by TensorLog, learns to map an input question to its answer by performing multi-step knowledge base reasoning by means of differentiable graph traversal operations.
The neural programmer (Neelakantan et al., 2015) is a fully differentiable encoder-decoder-based architecture augmented with a set of manually defined discrete operators (e.g., argmax, count), which is applied to the task of tablebased complex question answering by Neelakantan et al. (2017). The discrete operators allow the model to induce latent logical forms by composing arithmetic and logical operations in a differentiable manner. The model is trained using a weak supervision signal, which is the result of the execution of the correct program.
In contrast, the neural symbolic machines framework (C. Liang et al., 2017) combines (a) a differentiable neural programmer, which is a seq2seq model augmented by a key-variable memory that can translate a natural language utterance to a program; and (b) a symbolic computer (specifically, a Lisp interpreter) that implements a domain-specific language with built-in functions and code-assistance that is used to prune syntactically or semantically invalid candidate logical forms and execute them to retrieve a result that is used to compute the supervision signal. REINFORCE, augmented with iterative maximum likelihood training, is used to optimize for rewarding programs. C. Liang et al. (2018) demonstrate the effectiveness of a memory buffer of promising programs, coupled with techniques to stabilize and scale up training as well as reduce variance and bias.
In the absence of both execution results and logical forms, Sun et al. (2019) train a semantic parser with a limited set of simple domain-independent hand-crafted mapping rules (e.g., mapping the word "more" in the input question to the logical operator ">" in the output query). They demonstrate the efficacy of the back-translation paradigm (Lample, Ott, Conneau, Denoyer, & Ranzato, 2018) and model-agnostic meta-learning (Finn, Abbeel, & Levine, 2017) in learning a neural semantic parser in a low-resource setting.
Structured Decoder Models: In order to be valid (and executable), logical forms have to obey a certain set of grammatical rules. In many cases, logical forms can be represented as trees, which provide more structural information about the query than the sequence of tokens representation usually assumed by sequence decoders discussed so far. Several works proposed decoders that explicitly exploit the structure of queries to better model dependencies within and between different query clauses.
Semantic parsers with structured decoders use the same sequence encoders to encode the NLQ but induce additional structure on top of the normal attention-based sequence decoder that exploits the hierarchical tree structure of the query. Dong and Lapata (2016) propose a tree decoding model for semantic parsing that decodes a query tree in a top-down breadth-first order. For example, instead of decoding the lambda-calculus logical form (argmin $0 (state:t $0) (size:i $0)) (corresponding to the question "Which state is the smallest?") as a flat sequence of tokens, the decoder decodes the query in several steps. Internally, the tree decoder uses a normal RNN and manipulates the inputs and states of the sequence decoder according to the tree structure. For our example, the tree decoder proceeds as follows: 1. First, the topmost level of the query tree is decoded: (argmin $0 <n> <n> </s>). Here, <n> and </s> are artificial topological tokens introduced to indicate the tree structure: <n> is a nonterminal token indicating that a subtree is expected at its position and </s> is the end-of-sequence token indicating the end of a sequence of siblings. For this first step, the decoder RNN is used like in a normal sequence decoder. 2. As the second step, after decoding the top-level sequence, the first child subtree (state:t $0), is decoded and inserted at the position of the first non-terminal <n>. This yields the query (argmin $0 (state:t $0) <n> </s>). While decoding this subtree, the RNN is conditioned on the state corresponding to the first nonterminal produced in the first step (top-level query). 3. Finally (size:i $0) is decoded and inserted at the position of the last remaining nonterminal. While decoding the second subtree, the RNN is conditioned on the state corresponding to the second nonterminal produced in the first step (top-level query).
Decoding terminates after these three steps because no nonterminals remain in the generated output. From experimental results on the sematic parsing data sets GEO880, ATIS, JOBQUERIES, and IFTTT, it appears that the inductive bias introduced by the tree decoder improves generalization.
Alvarez-Melis and Jaakkola (2017) propose an improved tree decoder for semantic parsing, where the parent-tochild and sibling-to-sibling information flows are modeled with two separate RNNs. With this model, each node has a parent state and a previous-sibling state, both of which are used to predict the node symbol and topological decisions. Instead of modeling topological decisions (i.e., whether a node has children or further siblings) through artificial topological tokens like Dong and Lapata (2016), they use auxiliary classifiers at every time step that predict whether a node has children and whether the node is the last of its siblings. The elimination of artificial topological tokens reduces the length of the generated sequences, which should lead to fewer errors. The decoding proceeds in a top-down breadthfirst fashion, similarly to Dong and Lapata (2016). Experimental results on the IFTTT semantic parsing data set show improvement obtained by the introduced changes.
Cheng and Lapata (2018) and Cheng et al. (2017Cheng et al. ( , 2019 develop a transition-based neural semantic parser that adapts the Stack-LSTM proposed by Dyer, Ballesteros, Ling, Matthews, and Smith (2015). The Stack-LSTM decodes the logical forms in a depth-first order, decoding a subtree completely before moving on to its siblings. The Stack-LSTM of Cheng and Lapata (2018) uses an adapted LSTM update when a subtree has been completed: it backtracks the state of the LSTM to the parent state of the completed subtree, and computes a summary encoding of the completed subtree which then serves as input (the entire completed subtree is thus treated like a single token) in the next LSTM update. Cheng et al. (2019) show how to perform bottom-up transition-based semantic parsing using the Stack-LSTM. Cheng et al. (2017) obtain state-of-the-art results on GRAPHQUESTIONS and SPADES and results competitive with previous works on WEBQUESTIONS and GEOQUERIES. Cheng and Lapata (2018) further improve performance of their Stack-LSTM model on the weakly supervised GRAPHQUESTIONS, SPADES, and WEBQUESTIONS data sets by using a generative ranker.
Copying Mechanisms: The architectures proposed by Vinyals, Fortunato, and Jaitly (2015), See, Liu, and Manning (2017), Gu, Lu, Li, and Li (2016), and Jia and Liang (2016) are an augmentation of the attention-based sequence-to-sequence neural architecture which enables a direct copy of tokens, sub-sequences or other elements from the input to the output. Although it is not generally required for semantic parsing and KGQA, it can be useful for certain tasks or data sets, and could help to integrate neural decoding with external entity linking tools (Shaw et al., 2019). In WIKISQL, for example, a copying mechanism is required to copy SQL condition values into the query X. Xu et al., 2017;Zhong et al., 2017). It has also been used for semantic parsing in general (Damonte, Goel, & Chung, 2019;Jia & Liang, 2016), when the query may contain NL strings.
In Shaw et al. (2019), the NLQ is converted to a graph representation that contains both the words from the question as well as candidate entities for phrases in the question. Candidate entities are generated using an external entity linker and are integrated into the input graph by adding edges that connect the candidate entity nodes to the nodes representing words to which the entities were linked. The decoder can generate tokens from the query language vocabulary, or use a copy mechanism to produce one of the linked entity candidates from the input graph. Even though the authors did not test this approach for QA over KGs, we believe this approach could be useful for KGQA.
Symbol representation: When answering questions over large-scale knowledge graphs, we are confronted with a large number of entities and relations not present in the training data. An important challenge, thus, is to find a way to learn representations for entities and relations that generalize well to unseen ones. Representing both entities and predicates as a sequence of words, characters or sub-word units (BPE/WordPiece) in their surface forms, as opposed to arbitrary symbols unique to each entity or predicate, offers an extent of generalizability to these unseen or rare symbols. For instance, if the representations are learned on sub-word levels, upon encountering sub-words from unseen relations, the parameters of the representation building network (encoder) can leverage sub-words shared between seen and unseen relations and thus, unseen relations will no longer have uninformed random vector representations.
Several works on question answering and semantic parsing (and other NLP tasks) have used such representations. For example, several works on WIKISQL also encode column names on word level to yield vectors representing columns Sun et al., 2018;T. Yu et al., 2018). Some works on KGQA (He & Golub, 2016;Lukovnikov et al., 2017;Maheshwari et al., 2019;M. Yu et al., 2017) also decompose KG relations and/or entities to word and/or sub-word level. In addition to sub-token-level encoding, additional information about the tokens can be encoded and added to its representation. For example, T. Yu et al. (2018) also encodes table names and column data types together with column name words for their model for the SPIDER data set.

| Translation-based parsing algorithms
In the case of purely translation-based semantic parsing, the parsing algorithm centers around sequence decoding, which is usually done using greedy search or beam search as explained above.
Constrained Decoding and Grammar-based Decoding: A simple sequence-to-sequence model as described above does not exploit the formal nature of the target language. In fact, the output sequences to be generated follow strict grammatical rules that ensure the validity of the decoded expression, that is, that they can be executed over the given KG. Thus, many translation-based parsing algorithms exploit these grammatical rules by using grammar-based constraints during decoding in order to generate only valid logical forms. Using such constraints during training also help to focus learning and model capacity on the space of valid logical forms. Depending on the specific application and data set, reasonable assumptions concerning the logical form can be made. These choices are reflected in the logical form language definition. As an example, in the case of SIMPLEQUESTIONS, the assumption is that the logical from only consists of single entity and a single relation. Therefore, if we were to solve SIMPLEQUESTIONS using a translation model, we could constrain the decoder to choose only among all entities in the first time step and only among all relations in the second (instead of considering the full output vocabulary of all tokens possible in the formal query language) automatically terminate decoding thereafter.
For more general cases, we would like to express a wider range of formal query structures, but nevertheless apply some restrictions on the output tokens at a certain time step, depending on the sequence decoded so far. For example, in the FunQL language definition used by Cheng et al. (2017Cheng et al. ( , 2019, the ARGMAX (x, r) function is restricted to have a relation symbol r as second argument. Constrained decoding can be trivially implemented by just considering the set of allowed tokens as possible outputs at a certain time step. Computing the allowed tokens for a certain time step however, can be more challenging, depending on the chosen logical form language.
Many semantic parsing approaches rely on a decoder that generates production rules or more generally, actions, rather than logical form tokens (Cheng & Lapata, 2018;Guo, Sun, et al., 2018;Lin, Bogin, Neumann, Berant, & Gardner, 2019;Rabinovich, Stern, & Klein, 2017;Shen et al., 2019;Yin & Neubig, 2017;T. Yu et al., 2018). For example, suppose we would like to decode a query ARGMAX (Astronauts, height). We could decode the sequence of tokens ["argmax", "(", "Astronauts", ",", "height", ")"]. However, if we could define a CFG that contains the rules X! ARGMAX (X, R), X ! Astronauts, and R ! height, we could produce the same expression as a sequence of these three rules (rather than a sequence of six tokens). In addition, with such a grammar, the constraints on the decoding can be trivially implemented by looking up rules applicable for the selected nonterminal according to the predefined grammar. This would ensure that only valid logical forms are being generated.
Other Decoding Procedures: The two-stage Coarse2Fine decoder of Dong and Lapata (2018) can be seen as a middleground between a sequence decoder and a tree decoder. The decoder consists of two decoding stages: (a) decoding a query template and (b) filling in the specific details. Compared to other tree decoders, the Coarse2Fine decoder also proceeds in a top-down breadth-first manner, but is restricted to have only two levels. For cases when there is a limited number of query templates, Dong and Lapata (2018) also investigate the use of a template classifier (instead of decoding the template) in the first decoding stage and evaluate on WIKISQL. An additional improvement to the two-step decoding scheme is obtained by encoding the generated template using a bidirectional RNN, before using the output states to decode the details. This allows to condition the generation of specific arguments of the template on the whole structure of the template.
The work of Cheng and Lapata (2018) trains and uses a translation model for semantic parsing. Using beam search, several logical forms are decoded from the translation model and additional ranking models (see Section 4.2) are used to rerank the logical forms in the beam.

| EMERGING TRENDS
Question answering over knowledge graphs has been an important area of research in the past decade. Them being the use of query graph candidates Yih et al., 2015;M. Yu et al., 2017), the use of neural symbolic machines (C. Liang et al., 2017), the shift to answering multi-entity questions (Luo et al., 2018), the application of transfer learning , and the proposal of differentiable query execution-based weakly supervised models (F. Yang et al., 2017). Petrochuk and Zettlemoyer (2018) suggest that the performance on SIMPLEQUESTIONS is approaching an upper bound. Further, as discussed in Section 3, there is a general shift of focus toward more complex logical forms, as is evident by recent data sets like (Bao et al., 2016;Dubey et al., 2019;Trivedi et al., 2017). These advances lay a path for further improvements in the field, and the base for the emerging trends and challenges we outline in the following.
Query Complexity: Comparative evaluations over WEBQUESTIONS, a data set over Freebase, demonstrate the rise of performance of KGQA approaches over the years on the task (Bao et al., 2016). Recently Petrochuk and Zettlemoyer (2018) demonstrate "that ambiguity in the data bounds [the] performance at 83.4% [over SIMPLEQUESTIONS]" thereby suggesting that the progress over the task is further ahead than ordinarily perceived. For instance, the previously best performing baseline proposed by M. Yu et al. (2017) of 77% accuracy over SIMPLEQUESTIONS, can be perceived as 92.3% (of 83.4) instead. Given that in WEBQUESTIONS where several KGQA systems have shown high performance, about 85% of the questions are simple questions as well (Bao et al., 2016), similar claims may be hold in this context, pending due investigations.
Knowledge graphs commonly used in KGQA community, as well as the formal query languages used to facilitate KGQA can support more nuanced information retrieval involving longer core chains, multiple triples, joins, unions, filters, and so on. These nuanced retrieval mechanisms are increasingly being regarded in the community as the next set of challenges. Recently released data sets such as COMPLEXQUESTIONS (Bao et al., 2016), GRAPHQUESTIONS (Su et al., 2016), LC-QUAD (Trivedi et al., 2017), and LC-QUAD 2 (Dubey et al., 2019) explicitly focus on creating complex questions, with aggregates, constraints, and longer relation paths, over which the current systems do not perform very well, and there is a possibility of significant improvements.
Robustness: In the past few years, deep learning-based system have achieved state of the art performance over several tasks, however, an increasing number of findings points out the brittleness of these systems. For instance, Jia and Liang (2017) demonstrate a drop of 35-75% in F1 scores of 16 models for the reading comprehension task trained over SQUAD (Rajpurkar, Zhang, Lopyrev, & Liang, 2016), by adversarially adding another sentence to the input paragraph (from which the system has to select the relevant span, given the question). Following, a new version of the aforementioned data set was released comprising of unanswerable questions (Rajpurkar, Jia, & Liang, 2018), leading to more robust reading comprehension approaches (Hu et al., 2018;Kundu & Ng, 2018). To the best of our knowledge, there has not been any work quantifying or improving the robustness of KGQA models. Such advances would play an important role for the applicability of KGQA systems in a production setting. In our opinion, a good starting point for robustness inquiries would be to utilize recent general-purpose adversarial input frameworks. Ribeiro, Singh, and Guestrin (2018) propose a simple, generalizable way to generate semantically equivalent adversarial sentences.
Interoperability between KGs: In discussing the numerous KGQA approaches in the previous section, we find that only a handful of techniques include data sets from both DBpedia and Freebase in their experiments, despite both of them being general purpose KGs. This is because the inherent data model underlying the two KGs differs a lot, which makes extracting (question, DBpedia answer) pairs corresponding to a set of (question, Freebase answer) pairs (or vice versa) nontrivial, even after discounting the engineering efforts spent in migrating the query execution, and candidate generation subsystems. This is best illustrated in Tanon, Vrandecic, Schaffert, Steiner, and Pintscher (2016) where different hurdles and their solutions of migrating the knowledge from Freebase to Wikidata are discussed. Following a similar approach, Diefenbach, Tanon, Singh, and Maret (2017) migrate the SIMPLEQUESTIONS data set to Wikidata, yielding 21,957 answerable questions over the KG. Azmy, Shi, Lin, and Ilyas (2018) migrate it to DBpedia (October, 2016 release), and provide 43,086 answerable questions.
Correspondingly, interoperability of KGQA systems is another challenge, which only recently has drawn some attention in the community (Abbas, Malik, Rashid, & Zafar, 2016). The twofold challenge of (a) learning to identify multiple KG's artifacts mentioned in a given NLQ, and (b) learning multiple parse structures corresponding to the multiple KG's data models, while very difficult (Ringler & Paulheim, 2017), is partially helped by the latest (upcoming) DBpedia release, 20 whose data model is compatible with that of Wikidata. For an in-depth discussion on knowledge modeling strategies, and comparison of major large-scale open KGs, we refer interested readers to Ismayilov, Kontokostas, Auer, Lehmann, and Hellmann (2018), Ringler and Paulheim (2017), and Färber, Bartscherer, Menne, and Rettinger (2018).
Multilinguality: Multilinguality, that is, the ability to understand and answer questions in multiple languages is pivotal for a widespread acceptance and use of KGQA systems. With varying coverage, large parts of common knowledge graphs including DBpedia, Wikidata have multilingual surface forms corresponding to the resources, bridging a major challenge in enabling multilinguality in KGQA systems.
The QALD challenge, currently in its ninth iteration maintains multilingual question answering as one of its tasks. The data set accompanying QALD-9 21 multilingual QA over DBpedia task contains over 250 questions in up to 8 languages including English, Spanish, German, Italian, French, Dutch, Romanian, Hindi, and Farsi. Diefenbach, Both, Singh, and Maret (2018) propose a nonneural generate and rank approach with minimal language dependent components which can be replaced to support new languages. Radoev et al. (2018) propose using a set of multilingual lexico-syntactic patterns to understand the intent of both French and English questions. However, we still have to see a surge of multilinguality in data driven KGQA approaches. Since these approaches rely on supervised data to learn mappings between KG artifacts, and question tokens; the lack of large-scale, multilingual, KGQA data sets inhibits these approaches.

| CONCLUDING REMARKS
Answering questions over knowledge graphs has emerged as a multidisciplinary field of research, inviting insights and solutions from the semantic web, machine learning, and the natural language understanding community. In this article, we provide an overview of neural network-based approaches for the task.
We broadly group existing approaches in three categories, namely, (a) classification-based approaches, where neural models are used to predict one of a fixed set of classes, given a question; (b) ranking-based approaches, where neural networks are used to compare different candidate logical forms with the question, to select the best-ranked one; and (c) translation based where the network learns to translate NLQs into their corresponding (executable) logical forms. Along with an overview of existing approaches, we also discuss some techniques used to weakly supervise the training of these models, which cope with the challenges risen due to a lack of logical forms in the training data. We summarize existing data sets and tasks commonly used to benchmark these approaches, and note that the progress in the field has led to performance saturation over existing data sets such as SIMPLEQUESTIONS, leading to an emergence of newer, more difficult challenges, as well as more powerful mechanisms to address these challenges.
Toward the end of the article, we discuss some of the emerging trends, and existing gaps in the KGQA research field, concluding that investigations and innovations in interoperability, multilinguality, and robustness of these approaches are required for impactful application of these systems. 18 Concretely, the core chain can be described as a series of conjunctions in λ-DCS, where the peripheral paths of the tree are combined using intersection operators. 19 We use the notation q 0…T to denote (q 0 , …, q T ), the sequence of symbols q i , for i [0, T]. 20 http://downloads.dbpedia.org/repo/lts/wikidata/. 21 https://project-hobbit.eu/challenges/qald-9-challenge/#tasks.