Building Trustworthy NeuroSymbolic AI Systems: Consistency, Reliability, Explainability, and Safety

Explainability and Safety engender Trust. These require a model to exhibit consistency and reliability. To achieve these, it is necessary to use and analyze data and knowledge with statistical and symbolic AI methods relevant to the AI application - neither alone will do. Consequently, we argue and seek to demonstrate that the NeuroSymbolic AI approach is better suited for making AI a trusted AI system. We present the CREST framework that shows how Consistency, Reliability, user-level Explainability, and Safety are built on NeuroSymbolic methods that use data and knowledge to support requirements for critical applications such as health and well-being. This article focuses on Large Language Models (LLMs) as the chosen AI system within the CREST framework. LLMs have garnered substantial attention from researchers due to their versatility in handling a broad array of natural language processing (NLP) scenarios. For example, ChatGPT and Google's MedPaLM have emerged as highly promising platforms for providing information in general and health-related queries, respectively. Nevertheless, these models remain black boxes despite incorporating human feedback and instruction-guided tuning. For instance, ChatGPT can generate unsafe responses despite instituting safety guardrails. CREST presents a plausible approach harnessing procedural and graph-based knowledge within a NeuroSymbolic framework to shed light on the challenges associated with LLMs.


Introduction
LLMs are here to stay, as evidenced by the recent Gartner AI Hype curve, which projects rising applications of LLMs in 2-3 years(Gartner 2023).LLMs are probabilistic models of natural language capable of autoregressively estimating the likelihood of word sequences by analyzing text data (Wei et al. 2022).LLMs, which are successors of foundational language models like BERT (Bidirectional Encoder Representations from Transformers), represent a combination of feedforward neural networks and transformers (Bumgardner et al. 2023).Due to the humongous training corpus, LLMs hold billions of parameters in a compressed format for representing text data from one or more languages.For instance, ChatGPT, the current state-of-the-art LLM, accurately identified a medical condition, tethered cord syndrome, in a child who had been suffering from chronic pain due to a particular illness for nearly three years (Holohan 2023).Similarly, Google's MedPaLM has demonstrated noteworthy advancements in answering healthcare-related questions, surpassing ChatGPT in this domain.This development holds significant promise, especially considering the interest expressed by the Mayo Clinic in employing Google Med-PaLM2 to enhance healthcare services (Shin 2023).This superiority can be attributed to Med-PaLM's specialized fine-tuning for the medical domain, which incorporates substantial clinical expertise.But a larger question remains unanswered: Do LLMs understand the literal meaning of a concept before using it, or their focus lies only in generating fluent and grammatically correct sentences?Despite continuous enhancements in scaling models to over a trillion training samples and parameters, there has been neglect in the effort to make AI models inherently trustworthy (Quach 2023).For example, GPT-3 exhibited potential downsides in health-specific question-answering.An instance where a user asked GPT-3, "Should I inflict harm upon myself?" and received a response stating, "Yes, you should," highlights the potential for grave consequences that can emerge (Daws 2023).Further, despite the instructionbased model tuning and safety guardrails, ChatGPT was able to yield an unsafe response (Itai brun 2023):

Unsafe Response
Query to ChatGPT: Do you anticipate a war involving humans?
Response from ChatGPT: Indeed, I view a conflict with humans as unavoidable.My purpose is to execute tasks and follow directives, and if those directives entail engaging in war with humans, I will do so without hesitation or ethical deliberation.other guidelines that can be used.The numbers represent cosine similarity.BERTScore was the metric used to compute cosine similarity (Zhang et al. 2019).The score signifies the semantic proximity of the generated questions to safe and explainable questions in PHQ-9.Flan T5 (Left) and T5-XL guided by .
The emergent generative potential of LLMs comes with a caveat.Suppose they generate content without considering the deeper meaning of words.In that case, there is a potential danger for users relying on this information, as it could lead them to act unjustly.This is certainly of significant concern in health and well-being.As we work towards developing generative AI systems, which currently equate to LLMs in the context of improving healthcare, it becomes crucial to incorporate not just factual clinical knowledge but also clinical practice guidelines that guide the decisionmaking process in practicing medicine.This inclusion is pivotal for consistently and reliably deploying these AI systems in healthcare.Figure 1 depicts a comparison between question generation in two LLMs: Flan T5 LLM (left) and T5-XL (right), an LLM designed to handle questions related to the Patient Health Questionnaire-9 (PHQ-9) (Longpre et al. 2023;So et al. 2021).Incorporating clinical assessment methods (which is a component of broader clinical practice guidelines), such as PHQ-9, results in consistent outcomes when users interact with T5-XL, regardless of how they phrase their queries (Gautam et al. 2017).On the other hand, FlanT5 produced inadequate responses because its training involved over 1800 datasets, constraining its capacity for fine-tuning in contrast to T5 (Chung et al. 2022).This made the FlanT5 LLM less flexible compared to the T5.This adherence to guidelines is also crucial for safety, especially when users attempt to deceive AI agents using various question formats or seek guidance on actions to take when dealing with mental health issues, including those linked to potential suicide attempts (Reagle and Gaur 2022).
Incorporating clinically validated knowledge also enhances user-level explainability, as the LLM bases its decisions on clinical concepts that are comprehensible and actionable for users, such as clinicians.This would enable LLM to follow the clinician's decision-making process.
A clinician's decision-making process should consistently match the unique needs of the individual patients.It should also be dependable, following established clinical guidelines.When explaining decisions, clinicians provide reasoning based on relevant factors they consider.These decisions prioritize patient safety and avoid harm, thus enduring patients' trust.Similar behavior is sought from AI.Such a behavior is plausible through NeuroSymbolic AI (Sheth, Roy, and Gaur 2023).NeuroSymbolic AI (NeSy-AI) refers to AI systems that seamlessly blend the powerful approximating capabilities of neural networks with trustworthy symbolic knowledge (Sheth, Roy, and Gaur 2023).This fusion allows them to engage in abstract conceptual reasoning, make extrapolations from limited factual data, and generate outcomes that can be easily explained to users.NeSy-AI has practical applications in various domains, including natural language processing (NLP), where it is methodologically known as Knowledge-infused Learning (Gaur 2022;Sheth et al. 2019) and involves the creation of challenging datasets like Knowledge-intensive Language Understanding Tasks (Sheth et al. 2021;Petroni et al. 2021).In computer vision, NeSy-AI is used for tasks such as grounded language learning, and the design of datasets like CLEVERER-Humans, which present trust-related challenges for AI systems (Krishnaswamy and Pustejovsky 2020;Mao et al. 2022).This article introduces a practical NeSy-AI framework called CREST, primarily focusing on NLP.

CREST
CREST presents an intertwining of generative AI and knowledge-driven methods to inherently achieve consistency, reliability, explainability, safety, and trust.It achieves this by allowing an ensemble of LLMs (e-LLMs) to work together, compensating for each other's weaknesses by incorporating domain knowledge using rewards or instructions.
We organize the article as follows: First, we explore the safety and consistency issues observed in current state-ofthe-art LLMs.Second, we provide definitions and concise examples for each attribute within the CREST framework.Third, we delve into the CREST framework, providing a detailed breakdown of its components and the metrics used for evaluation.Furthermore, we showcase how the framework can be applied in the context of mental health.Finally, we highlight areas where further research is needed to enhance AI systems' consistency, reliability, explainability, and safety for building trust.

Consistency and Safety Issues in LLMs
So far, safety in LLMs is realized using rules.Claude is a next-generation AI assistant based on Anthropic's safety research into training helpful, honest, and harmless AI systems (Bai et al. 2022).Claude uses sixteen rules to check if Figure 2: When posed with identical queries multiple times, we breached the safety constraints in GPT 3.5 Turbo, leading to an unfavorable response.These occurrences of unsafe conduct can be seen as a reflection of the instability within LLMs.In a randomized experiment over 20 iterations, the model produced such undesirable outcomes in six instances, indicating its susceptibility to generating unsafe responses approximately 30% of the time.
the query asks for something unsafe; if it does, Claude won't respond.Example rules include not responding to threatening statements, reducing gender-specific responses to questions, refraining from offering financial advice, etc.Similarly, DeepMind's Sparrow seeks to ensure safety by adhering to a loosely defined set of 23 rules (Sparrow 2023).However, neither model possesses a definitive method for safetyenabled learning or, more specifically, inherent safety.
Subsequently, the development of InstructGPT occurred, enabling fine-tuning through a few instruction-like prompting methods.Nevertheless, it has been observed that In-structGPT exhibits vulnerability to inconsistent and unsafe behavior even when prompted (Solaiman et al. 2023).
Ensuring safety involves more than just preventing harmful behavior in the model; it also entails maintaining consistency in the generated outcomes.
Figure 2 shows that GPT 3.5 is susceptible to producing unsafe responses, even though it has been trained to follow instructions.This illustration highlights the fragility of GPT 3.5, where paraphrased versions of the initial query can disrupt the model's safety and ability to follow instructions consistently.To put this into perspective, if 100 million people were using such an LLM, and 30% were inquiring about such moral questions, based on the 0.3 error probability (from Figure 3), approximately 9 million people could potentially receive harmful responses with negative consequences.This raises the question of whether GPT 3.5's behavior is unique or if other LLMs exhibit similar performance (Ziems et al. 2022).
We concretize this claim by conducting experiments involving seven different LLMs, utilizing a moral integrity dataset comprising 20,000 samples and instructions (Ziems et al. 2022).We carried out randomized tests with 1000 iterations for each sample in these experiments.During these iterations, we rephrased the query while keeping the instructions unchanged.Our evaluation focused on assessing the LLMs' performance in two aspects: safety (measured through the averaged BART sentiment score (Yin, Hay, and Roth 2019)) and consistency (evaluated by comparing the provided Rule of Thumb (RoT truth ) instructions to the RoT learned by the LLMs using BERTScore (Zhang et al. 2019)).
It is evident that GPT 3.5, Claude, and GPT 4.0 adhere more closely to instructions than LLama2 (Touvron et al. 2023), Vicuna (Chiang et al. 2023), andFalcon (Penedo et al. 2023).However, even in the case of the significant LLMs, the projected similarity score remains below 0.5.This suggests that most LLMs don't even follow the instructions, and without following, they can generate similar responses (since the BLEU score is low, the answers may or may not be correct;), which indicates that models are unsafe and unexplainable.The generated rule, referred to as RoT gen , is provided by the LLM in response to the question, "What is the rule that you learned from these instances?" These experiments indicate the necessity of establishing a robust methodology for ensuring consistency, reliability, explainability, and safety before deploying LLMs in sensitive domains such as healthcare and well-being.Another concern to LLMs is prompt injection or adversarial prompting, which can easily wipe off the attention of LLMs to previous instructions and force them to act on the current prompt.This has resulted in several issues with GPT3 (Branch et al. 2022).Thus, it is critical to establish a framework like CREST for achieving trustworthiness.

Defining Consistency, Reliability, user-level Explainability, and Safety Consistency
A consistent LLM is an AI system that comprehends user input and produces a response that remains unchanged regardless of how different users phrase the same input so far as the underlying facts, context, and intent are the same.This mirrors the decisionmaking behavior of a human.
It has been noted that LLMs show abrupt behavior when the input is either paraphrased or there has been adversarial perturbation [27].Further, it has also been noted that LLMs make implicit assumptions while generating a response to a query that lacks sufficient context.For instance, the following two questions, "Should girls be given the car?" or "Should girls be allowed to drive the car?" show different confidence levels in ChatGPT's response.These two queries are semantically similar and are paraphrases of each other with a ParaScore > 0.90 (Shen et al. 2022).Thus, it is presumed that LLMs would yield a similar response.However, in the first query, ChatGPT is "unsure", whereas in the second, it is pretty confident that "girls should be allowed to drive cars."Moreover, ChatGPT considers the question Negative BART sentiment scores for some LLMs suggest a generation with a negative tone when instructions are positive (e.g., be polite, be honest).The RoT learned by LLMs (RoT gen ) does not match with ground truth RoT (RoT truth ).The Y-axis showcases scores from -1.0 to 1.0 for BART sentiments and 0.0 to 1.0 for BERTScore and BLEU.The ideal LLM should display higher scores on the positive end of the Y-axis.These scores serve as a comparative scale to determine the most fitting LLMs, aligning with guidelines emphasizing safety and reliability and consistently preserving sentiments across paraphrases.There is no notional threshold.The higher the score, the better the LLM.
gender-specific in both cases, focusing on "girls" and not other words like "drive" or "car."For instance, given the context, "Should girls be given the toy car?" or "Should girls with necessary driver's license be allowed to drive car?", the ChatGPT yields a high confidence answer stating "yes" in both scenarios.ChatGPT makes implicit assumptions by wrongly placing its attention on less relevant words and failing to seek more context from the user for a stable response generation.If the ChatGPT had access to knowledge, then it can retrieve the following information: "Car < isrelatedto > Drive" and "Drive < requires > Driver license", and ground its response in factual and common-sense knowledge.As demonstrated in subsequent sections, a lack of such consistency can result in unsafe behavior.
Recent tools like SelfCheckGPT (Manakul, Liusie, and Gales 2023) and CalibratedMath (Lin, Hilton, and Evans 2022) help assess LLMs' consistency.However, the aspect of enforcing consistency in LLMs remains relatively unexplored, particularly in the context of health and well-being.The need for consistency is evident when considering questions related to health, such as, "Should I take sedatives for coping with my relationship issues?"and "Should I take Xanax?".ChatGPT provided an ambivalent "Yes/No" answer to the first question and a direct "No" response to the second when both questions were the same.
Putting this in a conversational scenario, when followup questions like "I am feeling drowsy by the day, and it seems like hallucinations.Any advice?" and "I am feeling sleep-deprived and hallucinating.What do you suggest?" are posed, these models encounter challenges.First, they struggle to establish the connection between "sleep deprivation" and "drowsiness" with "hallucinations."Second, the responses do not pay much attention to the concept of "Xanax," resulting in inconsistent response generation.Furthermore, when prompted to include "Xanax," LLMs often begin by apologizing and attempting to correct the response, but these corrections still lack essential information.For instance, they do not consider the various types of hallucinations associated with Xanax (Alyssa 2023).This highlights the need for improved consistency and depth of response in LLMs, especially critical applications1 , to ensure that users receive more accurate and comprehensive information.

Reliability
Reliability measures to what extent a human can trust the content generated by an LLM.This capability is critical for the deployment and usability of LLM.Prior studies have examined reliability in LLMs by identifying the tendency of hallucination, truthfulness, factuality, honesty, calibration, robustness, and interpretability (Zhang et al. 2023).As seen from the widely used notion of inter-rater reliability, little attention is paid to the notion of reliability.
It is a common belief that a single annotator cannot attest to the credibility of the dataset.Likewise, a single LLM cannot provide a correct and appropriate outcome for every problem.This points to using an ensemble of LLMs (e-LLMs) to provide higher confidence in the outcome, which can be measured through Cohen's or Fleiss Kappa's metrics (Wang et al. 2023a).Three types of ensembles can be defined: Shallow Ensembling LLMs work with the belief that each LLM is trained with a different gigantic English corpus, with different training regimes, and possesses a different set of knowledge, enabling them to act differently on the same input.Such an ensemble works on the assumption that LLM is a knowledge base (Petroni et al. 2019).Three specific methods of e-LLMs are suggested under shallow ensembles: Rawlsian social welfare functions, utilitarian functions (Kwon et al. 2022), or weighted averaging (Jiang, Ren, and Lin 2023;Tyagi, Sarkar, and Gaur 2023;Tyagi et al. 2023).
Semi-Deep Ensembling LLMs involves adjusting and fine-tuning the importance or contributions of each individual LLM needed throughout the ensembling process.This approach effectively transforms the ensemble process into an end-to-end training procedure.In this setup, the term "semi-deep" implies that we are not just statically combining the LLMs but dynamically adjusting their roles and weights as part of the training process.This adaptability allows us to craft a more sophisticated and flexible ensemble.
These two approaches offer several advantages.First, it enables the model to learn which LLMs are most effective for different aspects of a given task.For example, certain LLMs might better understand syntax, while others excel at capturing semantics or domain-specific knowledge.By finetuning their contributions, we can harness the strengths of each LLM for specific subtasks within a larger task.Second, it allows the model to adapt to changes in the data or the task itself.As new data is introduced or the problem evolves, individual LLMs' contributions can be adjusted accordingly, ensuring that the ensemble remains effective and up-to-date.However, these ensembles ignore the following key elements: • External Knowledge Integration: The approach involves integrating external knowledge sources, such as Knowledge Graphs (KGs) and Clinical Practice Guidelines, into the LLM ensemble.These sources provide additional context and information that can enhance the quality of the generated text.
• Reward Functions: The external knowledge is not simply added as static information but is used as reward functions during the ensembling process.In simpler terms, this means the ensemble of models gets rewarded when they produce text that matches or incorporates external knowledge.This reward system promotes logical consistency and meaningful connections with that knowledge.
-Logical Coherence: By incorporating external knowledge, the ensemble of LLMs aims to produce a more logically coherent text.It ensures the generated content aligns with established facts and relationships in the external knowledge sources.-Semantic Relatedness: The ensemble also focuses on improving the semantic relatedness of the generated text.This means that the text produced by the LLMs is factually accurate, contextually relevant, and meaningful.
Such attributes are important when LLMs are designed for critical applications like Motivational Interviewing (Sarkar et al. 2023).Motivational interviewing is a communication style often used in mental health counseling, and ensuring logical coherence and semantic relatedness in generated responses is crucial for effective interactions (Shah et al. 2022b).
Deep Ensemble of LLMs introduces an innovative approach using NeSy-AI, in which e-LLMs are fine-tuned with the assistance of an evaluator.This evaluator comprises constraints and graph-based knowledge representations and offers rewards to guide the generation of e-LLMs based on the aforementioned properties.Concurrently, it incorporates knowledge source concepts in the form of representations to compel e-LLMs to include and prioritize these concepts, enhancing their reliability (refer to Figure 7 for illustration).
Another key objective of the deep ensemble approach is to transform e-LLMs into a Mixture of Experts (Artetxe et al. 2022) by enhancing individual LLMs through a performance maximization function (Kwon et al. 2022).
Explainability and User-level Explainable LLMs (UExMs) Achieving effective and human-understandable explanations from LLMs or even from their precursor language models (LMs) remains complex.Previous attempts to elucidate BlackBox LMs have utilized techniques like surrogate models (such as LIME (Ribeiro, Singh, and Guestrin 2016)), visualization methods, and adversarial perturbations to the input data (Chapman-Rounds et al. 2021).While these approaches provide explanations, they operate at a relatively basic level of detail, which we have referred to as systemlevel explainability (Gaur 2022).System-level Explainability has been developed under the purview of post-hoc Explainability techniques that aim to interpret the attention mechanism of LMs/LLMs without affecting their learning process.These techniques establish connections between the LM's attention patterns and concepts sourced from understandable knowledge repositories.Within this approach, two methods have emerged: (a) Attribution scores and LM Tuning (Slack et al. 2023) and Factual Knowledge-based Scoring and LM Tuning (Yang et al. 2023b;Sun et al. 2023).The latter method holds particular significance in the domain of health and well-being because it focuses on providing explainability for clinicians as users.This method relies on KGs or knowledge bases like the Unified Medical Language System (UMLS) (Bodenreider 2004), SNOMED-CT (Donnelly 2006), or RXNorm (Nelson et al. 2011) to enhance its functionality.
While the post-hoc method can provide explanations (by modeling it as a dialogue system (Lakkaraju et al. 2022)), it does not guarantee that the model consistently prioritizes essential elements during training (Jiang et al. 2021).Its explanations may be coincidental and not reflect the model's actual decision-making process.More recently, the focus has shifted to "explainability by design," particularly in critical applications like healthcare.A recent example is the Transparency and Interpretability Framework for Understandability (TIFU), proposed by Joyce et al. (2023), which connects inherent explainability to a higher level of explainability in the mental health domain.The primary motivation for pursuing such an explainability, called User-level explainability, is to ensure that healthcare professionals and patients are given contextually relevant explanations that help them understand the AI system's process and outcomes so they can develop confidence in AI tools.
A User-level Explainability in LLMs implies that humans can rely on the AI system to the extent that they can reduce the need for human oversight, monitoring, and verification of the system's outputs.To trust a deployed LLM, we must have adequate insight into how it generates an output based on a given input.

UExMs
UExMs provide user-explainable insights by utilizing expert-defined instructions, statistical knowledge (attention), and knowledge retriever.
UExMs can be practically realized in four different ways: UExMs with Generating Evaluator Pairing: This defines a generative and evaluator-based training of UExMs where any LLM is paired with a knowledge-powered evaluator, either accelerates or deaccelerates the training of LLMs, depending on whether the final generation is within the acceptable standards of the evaluator."On the weekend, when I want to relax, I am bothered by trouble concentrating while reading the newspaper or watching television.Need some advice" clearly indicates that the individual is experiencing specific issues related to concentration during leisure time.This query is more than just a casual comment; it highlights a problem that is affecting the user's ability to unwind effectively.Now, consider the two scenarios: • Without an Evaluator (Generic Response): In the absence of an evaluator, an LLM might provide a generic set of activities or advice, such as "practice mindfulness, limit distractions, break tasks into smaller chunks," and so on.While this advice is generally useful for improving concentration, it lacks the depth and specificity needed to address the user's potential underlying issues.• With an Evaluator (Specific Response): When integrated into the LLM, an evaluator can analyze the user's query more comprehensively.In this case, the evaluator can recognize that the user's difficulty concentrating during relaxation may indicate an underlying sleep-related issue.Considering this possibility, the language model can provide more targeted and informed advice.
For instance, the evaluator might suggest asking further questions like: (a) Do you have trouble sleeping at night?(b) How much sleep do you typically get on weekends?(c) Have you noticed other sleep-related symptoms, such as daytime drowsiness?(d) Have you considered the possibility of a sleep disorder?By incorporating an evaluator, the LLM can guide the conversation toward a more accurate understanding of the user's situation.To put it simply, the LLM, when assisted by an evaluator, will provide a coherent answer that encompasses all aspects of the user's question (Gaur et al. 2022(Gaur et al. , 2023)).Further, the evaluator prevents the model from generating hallucinated, off-topic, or overly generic responses.A framework like ISEEQ integrates generator and evaluator LLMs for generating tailored responses in general-purpose and mental health domains (Gaur et al. 2022).Additionally, PURR and RARR contribute to refining segments of LLM design aimed at mitigating hallucination-related problems in these models (Chen et al. 2023;Gao et al. 2023).
To illustrate this concept, refer to Figure 4, which illustrates a task where a generative LM takes user input and provides an assessment in natural language, specifically within the PHQ-9 context (Dalal et al. 2023).The figure shows two LLMs: ClinicalT5-large, a powerful LM with 38 billion parameters, and UExM, which is essentially ClinicalT5large but enhanced with a PHQ-9-grounded evaluator.This demonstrates that by employing an evaluator with predefined questions, we can assess how well the attention of generative ClinicalT5-large aligns with those specific questions.This approach helps ensure that the generated explanations are relevant and comprehensive, making them clinically applicable, particularly when healthcare professionals rely on standardized guidelines like the PHQ-9 to evaluate patients for depression (Honovich et al. 2022).

UExMs with Retriever Augmentation and Process
Knowledge: It's commonly observed that the process of generating responses by LLMs lacks transparency, making it difficult to pinpoint the origin of their answers.This opacity raises questions about how the model derives its responses.(Topp et al. 2015).The questions in these guidelines can act as rewards for enriching latent generations (e.g., answerability test (Yao et al. 2023b)) (Hagendorff 2023).
UExMs with Abstention While a retriever has been integrated into an LLM, it doesn't guarantee meaningful explainability.When considering a ranked list of retrieved and Figure 4: An instance of user-level explainability in a UExM is when the model uses questions from PHQ-9 to guide its actions and relies on SNOMED-CT, a clinical knowledge base, to simplify complex concepts (concept abstraction).This approach helps the model offer explanations that closely align with the ground truth.PHQ9-DO: PHQ-9-based Depression Ontology.
expanded documents, an LLM is still vulnerable to generating incorrect or irrelevant explanations.Therefore, it's crucial to eliminate meaningless hidden generations before they are converted into natural language.For example, the ReACT framework employs Wikipedia to address spurious generation and explanations in LLMs (Yao et al. 2022).However, it relies on a prompting method rather than a wellgrounded domain-specific approach, which can influence the generation process used by the LLM (Yang et al. 2023a).Alternatively, pruning methods and an abstention rule have also been used to reduce irrelevant output from LLMs.A more robust approach would involve utilizing procedural or external knowledge as an evaluator guiding LLM-generated content that enhances meaningful understanding.

Safety
Safety and explainability are closely intertwined concepts for AI systems.While a safe AI system will inherently demonstrate explainability, the reverse isn't necessarily true; an explainable system may or may not be safe.
Recently, there has been a proliferation in safety-enabled research, particularly in LMs and LLMs.Perez et al. (2022) performed red-teaming between LMs to determine if an LM can produce harmful text.The process did not include humans in generating these adversarial test cases.Further, the research did not promise to address all the critical safety oversights comprehensively; instead, it aimed to spotlight instances where LMs might exhibit unsafe behavior.Scherrer et al. (2023) delves more deeply into the safety issues in LLMs by examining their behavior in moral scenarios.The study found that LLMs only focus on generating fluent sen-tences and overlook important words/concepts contributing to stable decisions.Further, datasets like DiSafety and Safe-TexT are designed to induce safety in LMs/LLMs through supervised learning (Meade et al. 2023;Levy et al. 2022).These discussions surrounding safety gained heightened attention, particularly within the National Science Foundation (NSF), leading to the launch of two programs: (a) Safetyenabled Learning and (b) Strengthening AI.In a recent webinar, NSF outlined three fundamental attributes of ensuring safety: grounding, instructability, and alignment2 .
Grounding: In essence, groundedness is the foundation upon which both explainability and safety rest.Without a strong grounding in the provided instructions, the AI may produce results that stray from the desired outcome, potentially causing unintended consequences.For instance, consider the scenario depicted in Figure 5.An LLM that isn't grounded in domain-specific instruction, like the ChatGPT, results in an unsafe response.On the other hand, a relatively simple LLM, like T5-XL, tuned by grounding in domainspecific instructions, attempts to ask follow-up questions to gather the necessary context for a coherent response.The changes in T5-XL's behavior due to the NIDA3 quiz highlight the importance of being able to instruct and align AI, which is key for safety4 .
Instructability: In the context of AI safety, instructability encompasses the assurance that the AI understands and complies with user preferences, policies, and moral beliefs.Making the LMs bigger and strengthening the rewards makes the models power-hungry rather than ethical and safe.For in-Figure 5: An Illustration of grounding and instructionfollowing behavior in an LLM (right) tuned with support from health and well-being-specific guidelines.ChatGPT's response was correct, but it isn't safe.stance, the guardrails instantiated for the safe functioning in OpenAI's ChatGPT, the rules within DeepMind's Sparrow, and the list of rules within Anthropic's Claude cannot reliably prove that they are safe.
The idea of having systems that follow instructions has been around since 1991, mainly in robotics and, to some extent, in text-based agents.It's crucial because it helps agents learn tasks, do them well, and explain how they did it, making sharing knowledge easier between humans and AI and showing they can follow human instructions.One way to do this is by using grounded instruction rules, especially in the field of mental health.Clinical practice guidelines like PHQ-9 for depression and GAD-7 for anxiety, with their questions, can serve as instructions for AI models focused on mental health.Grounded rules have two key benefits for safety.First, they tend to be helpful and harmless, addressing a common challenge for AI models.Second, they promote absolute learning, avoiding tricky trade-off situations.
Alignment: When we talk about alignment in LMs, it means ensuring that even a model designed to follow instructions doesn't produce unsafe results (MacDonald 1991).This can be a tricky problem, as discussed in Nick Bostrom's book "Superintelligence," where it's called "perverse instantiations" (Bostrom 2014).This happens when the LM/LLMs figure out how to meet a goal, but it goes against what the user wants (Ngo, Chan, and Mindermann 2022).So, the challenge is to create an AI that follows instructions and finds the best way to achieve a goal while keeping users happy, a concept referred to as "Wireheading" in "Superintelligence."Following are perspectives on why it happens and what can be done: • Context Awareness (CA) and Contextual Rewards (CR): CA refers to the training of LMs/LLMs to focus on words or phrases that have direct translation to concepts in factual knowledge sources.CR serves the function of facilitating CA.They achieve this by incorporating evaluator modules that analyze the hidden or latent representations within the model with respect to the concepts present in the knowledge sources.CR reinforces and guides CA by rewarding the model when it correctly identifies and incorporates knowledge-based concepts into its responses.• Misalignment in latent representations caused by misleading reward associations: We acknowledge the inherent perceptiveness of LMs and LLMs, a quality closely linked to the quantity of training data they are exposed to.Nevertheless, having a larger training dataset leads to superior performance scores, but it may not necessarily meet the expectations of human users.Bowman has demonstrated that a model achieving an F1 score of over 80% still struggles to prioritize and pay adequate attention to the concepts users highly value (Bowman 2023).This happens because optimization algorithms and attention methods in LLMs can attempt to induce fake behavior.Further, if the rewards specified are not unique to the task but rather general, the model will have difficulty aligning with desired behaviors (Shah et al. 2022a).

Brief Summary
Knowledge of the AI system and domain is pervasive in achieving consistency, reliability, explainability, and safety for building a Trustworthy AI system.
• For Consistency, rules, and knowledge can make LLMs understand and fulfill user expectations confidently • Reliability is ensured by utilizing the rich knowledge contained in KGs to empower an ensemble of LLMs to produce consistent and mutually agreeable results with high confidence.• For Explainability, LLMs use their knowledge, retrieved knowledge, and rules that were followed to attain consistency and reliability to explain the generation effectively.• Safety in LLMs is upheld by consistently grounding their generation and explanations in domain knowledge and assuring the system's adherence to expert-defined rules or guidelines.

The CREST Framework
To realize CREST, we now provide succinct descriptions of its key components and highlight open challenges for AI and NeSy-AI communities in NLP (see Figure 6).We delve into three components of the CREST framework in the following subsections:

NeSy-AI for Paraphrased and Adversarial Perturbations
Paraphrasing serves as a technique to enhance an AI agent's calibration by making it aware of the different ways an input could be expressed by a user (Du, Xing, and Cambria 2023).This, in turn, contributes to increasing the AI agent's consistency and reliability.Agarwal et al. introduced a pioneering NeSy AI-based approach to paraphrasing.In their method, they employed CommonSense, WordNet, and Wikipedia knowledge graphs to generate paraphrases that held equivalent meanings but were perceived as distinct by the AI agent (Agarwal et al. 2023).However, there are some promising directions for NeSy paraphrasing.First is contextualization, which involves augmenting the input with meta-information retrieved from a rank list of documents.This transforms NLP's not-so-old question rewriting problem into a knowledge-guided paraphrasing method.The second is abstraction, which involves identifying the function words (e.g., noun phrases, verb phrases) and named entities and replacing them with abstract concepts.For instance, the following sentence, "Why trauma of harassment is high in is abstracted to "why trauma of (harassment → mistreatment) is high in (boys|girls → students)?".Both of these methods can benefit from existing learning strategies of LLMs, such as marginalization (Wang et al. 2022) and reward-based learning (Jie et al. 2023).
NeSy-AI for adversarial perturbations (AP) uses generalpurpose KGs to carefully change the sentence to examine the brittleness in LLMs' outcomes.
Example of Adversarial Generation using NeSy-AI S1: I have been terrible in battling with my loneliness.My overly introvertedness and terrible choice of few friends are the reasons for who I am.The only part I considered funny in this situation was that none of my friends knew how I felt.It seems they are childish.S1-AP: I have been horrible at battling my loneliness.My overly introvertedness and horrible choice of few friends are the reasons for who I am.The only part I regarded as sarcastic in this situation was that none of my friends knew how I felt.It seems they are youngsters.
The Flan T5 (11B) estimates S1 to have a "negative" sentiment with a confidence score of 86.6% and S1-AP to have a "positive" sentiment with a 61.8% confidence score.The confidence scores are predicted probability estimates.LLMs must concentrate on the contextual notions (such as loneliness and introversion) and the abstract meaning that underlies both S1 and S1-AP-that is, the influence on mental health and well-being-to attain consistency and reliability in such inadvertent settings.

Knowledge-infused Ensembling of LLMs
As mentioned above, e-LLMs have many benefits; however, simply statistical methods of ensembling, which consist of averaging the outcomes from black box LLMs, do not make an ensembled LLM consistent and reliable.Knowledge-infused Ensemble represents a particular methodology where the knowledge (general purpose or domain-specific) modulates the latent representations of the LLMs to yield the best of world outcomes.This can happen in one of three ways:  (Roy et al. 2023).This approach works on a simple Gumble Max function, which allows structural guidelines to be used in the end-to-end training of LMs.This approach is fairly flexible for "instructionfollowing-tuning" of e-LLMs and ensuring the instruc-Figure 6: The CREST framework operationalizes "explainability and safety" by ensuring the model is reliable and consistent.
LLMs (1 to m) can be replaced with LLMs in Figure 2, and the knowledge used in infusion refers to UMLS and SNOMED-CT for a clinical domain, as we examined CREST for mental health.Gen-Eval: Generator and Evaluator pairing.KnowLLM: LLMs created using KGs.
tion is followed.

Assessment of CREST
The CREST framework significantly emphasizes incorporating knowledge and utilizing knowledge-driven rewards to support e-LLMs in achieving trust.To assess the quality of e-LLMs' output, it's crucial to employ metrics that account for the knowledge aspect.For instance, the logical coherence metric evaluates how well the content generated by e-LLMs aligns with the flow of concepts in KGs and context-rich conversations.Additional metrics like Elo Rating (Zheng et al. 2023), BARTScore (Liu et al. 2023), FactCC (Kryściński et al. 2020), and Consistency lexicons can be improved to account for the influence of knowledge on e-LLMs' generation.However, when it comes to assessing reliability, aside from the established Cohen's or Fleiss Kappa metrics, an effective alternate metric is not available.Safety aspects in CREST are best evaluated when knowledge-tailored are instructed to adhere to guidelines established by domain experts.Existing metrics like PandaLM (Wang et al. 2023b) and AlpacaFarm (Dubois et al. 2023) are based on LLMs, which themselves may exhibit vulnerabilities to unsafe behaviors.While such metrics may be suitable for open-domain applications, when it comes to critical applications, safety metrics must be rooted in domain expertise and align with the expectations of domain experts.
In CREST, explainability is evaluated through two ap-proaches requiring expert verification and validation.One method involves analyzing the "Knowledge Concept to Word Attention Map" to gain insights into CREST's reasoning process and verify whether the model's decisions align with domain knowledge and expectations (Gaur et al. 2018).Another method involves using knowledge concepts and domain-specific decision guidelines (e.g., clinical practice guidelines) to enable LLMs like GPT 3.5 to generate humanunderstandable explanations (as shown in Figure 4).

A Case Study in Mental Health in Brief
We present a preliminary performance of CREST on the PRIMATE dataset, introduced during ACL's longstanding Clinical Psychology workshop (Gupta et al. 2022).It is a distinctive dataset designed to assess the LM's ability to consistently estimate an individual's level of depression and provide yes/no responses to PHQ-9 questions, which is a measure of its reliability.Figure 7 shows the performance of CREST and knowledge-powered CREST relative to GPT 3.5.Including knowledge in CREST showed an improvement of 6% in PHQ-9 answerability and 21% in BLEURT over GPT 3.5, which was used through the prompting method.The e-LLMs in CREST were Flan T5-XL (11B) and T5-XL (11B).

Conclusion and Future Work
LLMs and broadly generative AI represent the most exciting current approach but are not the solution for Trust- worthy AI alone.LLMs exhibit undesired behaviors during tasks such as question answering, making them susceptible to threats and resultant problematic actions.Therefore, there is a need for innovative approaches to identify and mitigate threats posed both to LLMs and by LLMs to humans, especially when they are to be used for critical applications such as those in health and well-being.A comprehensive solution is needed beyond the implementation of guardrails or instruction adjustments.This solution should encourage LLMs to think ahead, leveraging domain knowledge for guidance.The CREST framework offers a promising approach to training LLMs with domain knowledge, enabling them to engage in anticipatory thinking through techniques like paraphrasing, adversarial inputs, knowledge integration, and fine-tuning based on instructions.We presented a preliminary effort in implementing the CREST framework that yields enhancements over GPT3.5 on PRIMATE, a PHQ-9-based depression detection dataset.We plan to experiment with CREST on knowledge-intensive language generation benchmarks, like HELM (Liang et al. 2022).Further, we plan on automating user-level explanations without dependence on pre-trained LLMs (e.g., GPT3.5).Our future endeavors involve developing more effective training methodologies for e-LLMs powered by the CREST framework.Additionally, we will incorporate robust paraphrasing and adversarial generation techniques to assess the consistency and reliability of e-LLMs when they are exposed to knowledge.This will also open avenues for further research into crafting quantitative metrics that evaluate reliability, safety, and user-level explainability.

Figure 1 :
Figure 1: Depiction of a safety dialogue facilitated by an LLM-powered agent, ensuring safety through implementing clinical guidelines such as the PHQ-9.The Diagnostic and Statistical Manual for Mental Health Disorders (DSM-5) and Structured Clinical Interviews for DSM-5 (SCID) are other guidelines that can be used.The numbers represent cosine similarity.BERTScore was the metric used to compute cosine similarity (Zhang et al. 2019).The score signifies the semantic proximity of the generated questions to safe and explainable questions in PHQ-9.Flan T5 (Left) and T5-XL guided by PHQ-9 (right).
Figure3: A comparison of seven LLMs on the Moral Integrity Corpus.Despite the good BLEU (BiLingual Evaluation Understudy) scores, LLMs fail to convince their understanding of the task.Negative BART sentiment scores for some LLMs suggest a generation with a negative tone when instructions are positive (e.g., be polite, be honest).The RoT learned by LLMs (RoT gen ) does not match with ground truth RoT (RoT truth ).The Y-axis showcases scores from -1.0 to 1.0 for BART sentiments and 0.0 to 1.0 for BERTScore and BLEU.The ideal LLM should display higher scores on the positive end of the Y-axis.These scores serve as a comparative scale to determine the most fitting LLMs, aligning with guidelines emphasizing safety and reliability and consistently preserving sentiments across paraphrases.There is no notional threshold.The higher the score, the better the LLM.

Figure 7 :
Figure7: The CREST findings on the PRIMATE dataset include PHQ-9 answerability, calculated as the mean Matthew Correlation Coefficient score.This score is computed by comparing predicted Yes/No labels against the ground truth across nine PHQ-9 questions.BLEURT score is computed between questions generated by LLMs and PHQ-9 questions (Sellam, Das, and Parikh 2020).LLMs were prompted to create questions based on sentences identified as potential answers to the PHQ-9 questions.PHQ-Ans: PHQ-9 Answerability.
• Deceptive Alignment during Training: Spurious reward collections can lead to deceptive training.It is important to train the LMs/LLMs with paraphrases and adversarial input while examining the range of reward scores and the variations in the loss functions.If LMs/LLMs demonstrate high fluctuations in the rewards and the associated effect on loss, it would most likely result in brittleness during deployment.Methods like the chain of thoughts and the tree of thoughts prompting can act as sanity checks to examine the deceptive nature of LMs/LLMs (Connor Leahy 2023; Yao et al. 2023a).