Automatic mining of symptom severity from psychiatric evaluation notes

Abstract Objectives As electronic mental health records become more widely available, several approaches have been suggested to automatically extract information from free‐text narrative aiming to support epidemiological research and clinical decision‐making. In this paper, we explore extraction of explicit mentions of symptom severity from initial psychiatric evaluation records. We use the data provided by the 2016 CEGS N‐GRID NLP shared task Track 2, which contains 541 records manually annotated for symptom severity according to the Research Domain Criteria. Methods We designed and implemented 3 automatic methods: a knowledge‐driven approach relying on local lexicalized rules based on common syntactic patterns in text suggesting positive valence symptoms; a machine learning method using a neural network; and a hybrid approach combining the first 2 methods with a neural network. Results The results on an unseen evaluation set of 216 psychiatric evaluation records showed a performance of 80.1% for the rule‐based method, 73.3% for the machine‐learning approach, and 72.0% for the hybrid one. Conclusions Although more work is needed to improve the accuracy, the results are encouraging and indicate that automated text mining methods can be used to classify mental health symptom severity from free text psychiatric notes to support epidemiological and clinical research.

and observational research (Jackson et al., 2017;Kovalchuk, Stewart, Broadbent, Hubbard, & Dobson, 2017;Perlis et al., 2012). Specifically, psychiatric EMRs contain rich knowledge regarding the mental health status of patients and important contextual information that is often in free text. Unlike other disciplines, free text is a key means to record information in mental healthcare as there are few laboratory tests that can describe symptoms and their severity (unlike, e.g., measuring the blood pressure for hypertension). Even when specific instruments and tests (e.g., mini mental state examination) are used, they are most often reported in free-text narrative. Mental healthcare therefore mainly relies on free text descriptions of symptoms, which are then interpreted, inspected, and assessed by health professionals in order to understand the type and the severity of the disease. A key question that we explore in this paper is whether we can automatically process such notes to extract disease severity for a given patient.
Processing of healthcare narrative has been a focus of clinical text mining and natural language processing for over 30 years, with notable results in automated harvesting of important clinical concepts and events in many domains (Abbe et al., 2015;Doan, Collier, Xu, Duy, & Phuong, 2012;Friedman, Shagina, Lussier, & Hripcsak, 2004;Savova et al., 2010;Sohn, Kocher, Chute, & Savova, 2011;Spasić, Livsey, Keane, & Nenadić, 2014). The main challenge is that clinical narrative is often written with a distinct style, seldom conforming to standard grammar, frequently with spelling and typing errors as well as common abbreviations and acronyms, with their meaning being often ambiguous depending on the context (Abbe et al., 2015;Dehghan, Keane, & Nenadic, 2013;Ford et al., 2016). Another challenge includes the extensive use of negations to rule out clinical signs and references to subjects other than the actual patient (Eriksson, Jensen, Frankild, Jensen, & Brunak, 2013).
Text mining has been also applied to free-text data in the field of mental health. For example, Sohn et al. (2011) aimed to identify physician asserted drug side effects from psychiatric and psychological narratives through a hybrid approach of machine learning and rules, with an F-score of 75%. Similarly, Eriksson et al. (2013) aimed to recognize possible adverse drug events from clinical narrative text in psychiatric hospital records with 89% precision through dictionaries and postcoordination rules in order to construct adverse drug events compound terms. Perlis et al. (2012) extracted outcomes of antidepressant treatments in major depressive disorders from EMRs through a supervised approach that involved logistic regression, with the precision ranging from 78% to 86%. Cunningham, Tablan, Roberts, and Bontcheva (2013) utilized a rule-based approach for the extraction of mini mental state examination results from both short clinical notes and free text health record correspondence between clinicians with an overall precision ranging from 85% (in short notes) to 87% (in correspondence texts). Jackson et al. (2017) sought to capture a number of key symptoms of severe mental illness from clinical discharge summaries with a median F-sore of 88% using regular expression pattern matching.
Several community challenges in clinical text processing have been organized to assess the state of the art for specific tasks, including comorbidity extraction (Uzuner, 2009), heart disease risk factors (Karystianis, Dehghan, Kovacevic, Keane, & Nenadic, 2015), and medication information (Uzuner, Solti, & Cadag, 2010). One of the tasks in the recent 2016 CEGS N-GRID shared tasks focused on the determination of symptom severity for a patient based on information included in their initial psychiatric evaluation report (Filannino, Stubbs, & Uzuner, 2017).
The classification regarding the severity of symptoms is important to understand if a patient requires immediate medical attention or hospitalization. In this paper, we describe and evaluate three approaches for the extraction of mental health symptom severity from psychiatric records in the context of the CEGS N-GRID shared task. We have explored both knowledge-driven (rule-based) and data-driven (ML based) methods to observe which one performed better for the given task. Specifically, the approaches include (a) a knowledge-driven methodology based on lexicalized rules combined with manually constructed dictionaries characterizing positive valence symptoms; (b) a neural network (NN) built on lexical and semantic features extracted from the text; and c) a hybrid approach that combined the best predictions between the rule-based and NN methods. School. The organizers released 325 psychiatric records that were used as the training set, and a set of 216 unseen psychiatric records for validation purposes. Each patient was represented by one record that included information from their initial psychiatric evaluation, containing unstructured free-text narrative (e.g., "arrested for driving while intoxicated") and structured question-answer pairs (e.g., "History of Drug Use: Yes"). Each report (i.e., patient) has been classified with regard to the severity of experienced symptoms as: 0 (absent = no symptoms mentioned), 1 (mild = symptoms present but not a focus of treatment), 2 (moderate = a focus of treatment), and 3 (severe = requiring hospitalization or emergency room visit or equivalent). Through a thorough process, each report has been manually categorized in terms of symptom severity by two expert psychiatrists from the Massachusetts General Hospital and the Harvard Medical School; because psychiatrists may interpret the notes differently (which in this case happened in around 40% of cases), a third annotator adjudicated to generate the gold standard data (Filannino et al., 2017).
The annotation process relied on the RDoC framework (Kozak & Cuthbert, 2016), which has been developed by the United States National Institute of Mental Health for the assessment of a patient's symptom severity in various domains. RDoC focuses on five psychiatric domains: (a) positive valence, (b) negative valence, (c) cognitive, (d) social processes, (e) arousal and regulatory systems-each characterized at different levels such as genomic, cellular, behavioural-depending on the type of available data. The CEGS N-GRID Challenge focused only on mining psychiatric symptom severity that belonged to the positive valence domain: A domain that pertains to events and situations that the patient actively engages into, signalling mental health disorders ranging from behavioural ones (e.g., binge alcohol drinking and excessing marijuana use) to medical treatments (e.g., detoxification and inpatient treatment) and mental disorders (e.g., bulimia and mania). We note that we have used the CEGS N-GRID gold standard data as provided, without further exploring or questioning the validity of the professional judgement of the severity classification provided in the annotated data sets.
The task we address here was to automatically extract severity for a given patient given their record. Figure 1 shows an overview of the approaches for the automatic classification of symptom severity: a rule-based methodology (Section 2.2), a NN approach (Section 2.3), and a hybrid method (Section 2.4).
2.2 | Rule-based approach 2.2.1 | Dictionaries of symptoms A sample of 50 records from the training data set randomly selected for each severity (13 for severe, 13 for moderate, 18 for mild, 6 for absent) was manually inspected by a domain expert (C. H. K.) to identify terms indicative of each severity level. The domain expert classified them into structured and unstructured mentions in text. In particular, structured mentions are represented by questions referring to particular symptoms followed by a "yes" or a "no" answer (e.g., "History of Drug Use: Yes," "Does the patient think they have an eating disorder: No"). Any mention of a positive valence symptom that has been described in free text is labelled as unstructured (e.g., "Mr. Andrade reported a chronic history of polysubstance dependence most notable opiate addiction" [positive valence symptom]). Table 1 shows the symptoms identified for each class along with their classification as either structured or as unstructured mentions with a brief description and respective examples. We note that for the classes of severe and moderate severity, we did not take into consideration any structured questions as there were no discriminative questions.
Following the initial inspection of the sample of records, a number of task-specific semantic classes that represent various positive valence symptoms were manually organized into dictionaries with terms as well as potential abbreviations, synonyms, and acronyms (see Table 2, for some examples).

| Engineering information extraction rules
We created rules for each of the symptom severity classes for the recognition of mentions of positive valence symptoms. The rules are based on common lexical patterns identified in clinical notes (e.g., "been hospitalized for eating disorder" [symptom for moderate class]; "patient was arrested for driving under the influence [DUI]" [symptom for severe class]) that describe symptom mentions in text. The lexical patterns use (a) frozen lexical expressions as anchors for certain symptom mentions (e.g., "parole: history of driving while intoxicated," "Hallucinogens: Yes") based on verbs, noun phrases, and prepositions; and/or (b) semantic place holders (through dictionary mentions) suggesting the presence of a positive valence symptom (e.g., "history of substance abuse include cocaine and alcohol consumption," "active in AA," "leading to his DUI"). We have also implemented concept enumeration as it appears quite frequently in the training data, particularly for the reporting of various positive valence symptom mentions (e.g., "neurovegetative symptoms: appetite [positive valence symptom]; interest [positive valence symptom]"). Table 3 shows an example of a lexical pattern.
In order to create the rules, we used General Architecture for Text Engineering (Cunningham et al., 2013), a well-established framework for text annotation and categorization. The observed lexical patterns in text were converted into rules using Java Annotations Pattern Engine, a pattern matching language for General Architecture for Text Engineering. Table 4 shows some rule examples, whereas Table 5 indicates the number of rules for specific symptoms roughly suggesting the complexity of the targeted information.
More than one pattern may exist in a psychiatric note and might refer to one or more symptoms. Because we are aiming to classify each record according to the severance of its present symptoms, we combined the mention-level classifications at the record level by considering precedence rules. For example, events with the highest clinical significance (i.e., close to severe) determined the rating of the whole patient document regardless of the number of lower severity events present (i.e., if a record has two symptoms belonging in different classes, this record will be assigned the highest in terms of severity class.).

| Machine-learning approach
In our machine learning approach, we implemented a NN approach because NNs are flexible for integrating different input data types into the same architecture, such as word counts, extracted values (e.g., age, gender, and other diagnoses), outputs coming from other text processing pipelines (e.g., rules), or even raw text in the form of word embeddings. Additionally, in the last 5 years, due to recent developments in hardware and software, NNs have become the state of the art in imaging, sound processing, and in certain areas of natural language processing (NLP; Goodfellow, Bengio, & Courville, 2016). Therefore, we decided to assess the utility of this technique in this task, which bears the added challenge of containing relatively few samples FIGURE 1 An overview of our hybrid approach for the determination of symptom severity in psychiatric evaluation records   Table 6 shows the regular expressions used to capture mentions of addictive alcoholic behaviour.
For the full list of the used regular expressions, see Table A1.
The NN receiving the described inputs was formed by three densely

| Hybrid approach
Finally, we combined the rule-and ML-based approaches into a hybrid system. The hybrid system used the same architecture as the network described in Section 2.3, but used as an input the bag formed by the rules implemented in the rule-based approach. Namely, the method counted how many times each rule fired (i.e., how many times the precondition of the given rule was true along the input document). The network was trained with backpropagation on mean squared error, as in Section 2.3. Figure 3 shows the architecture of the hybrid approach.

| Evaluation metrics
The system has been formally evaluated as part of the 2016 N-GRID challenge on a test-set containing 216 unseen reports. In order to reduce the complexity of the N-GRID task, only a single severity score per record was used, instead of having various scores for each positive valence symptom mention. As for the metric of evaluation, the macroaverage mean absolute error measure (MAE M ) was used (Baccianella, Esuli, & Sebastiani, 2009). MAE M measures the error and returns a percentage score (ranging from 0 to 100, with 100 being the highest performance).
This metrics works well with imbalanced data by assigning the same importance to each severity class regardless of its frequency in the corpus by calculating how close the classification of each report was to the gold standard one. The following formula is used where C is the number of classes of interest (4 in this case); D j is a set of records with score j; x i is a record; h(x i ) represents the predicted score with y i being the correct one.  An example of a lexical pattern to capture mentions of the severe class symptom of DUI arrest; "pt" is matched via a dictionary that contains variations of words representing "patient"; "was arrested for" is a semifrozen lexical expression for the identification of the positive valence symptom "DUI"  Rule examples (using the GATE notation) for the recognition of positive valence symptoms for the severe class (e.g., matching "patient was having occasional blackouts," aiming to identify any blackout episodes); moderate class (e.g., "her alcohol use has been problematic" aiming to identify the severity of the alcohol use); and mild class ("appetite is poor" aiming to identify low appetite) {Token.string =~"(?i)blackouts"} Note. The extracted symptom mentions are highlighted in bold. The rules use strict token matching e.g., {Token.String =~"(?I)appetite"} matches "appetite" (upper or lower case). Various dictionaries contain single and multiword terms (e.g., (volume), (degree), (substances) include words describing a decrease, the severity of the substance abuse, and the various substances that have been linked to abuse, respectively (see Table 2, for examples). {token})[0,1] matches any token. GATE = general architecture for text engineering. Note. Structured questions were used only for the mild class. The * in the mild class refers to the fact that each structured question for each symptom was represented by only one rule. OCD = obsessive compulsive disorder; OCSD = obsessive compulsive spectrum disorders; YBOCS = Yale-Brown obsessive-compulsive scale; DUI = driving under the influence; ER = emergency room; IOP = intensive outpatient program. Note. The regular expression to detect addiction was formed by the disjunctive combination (i.e., combination through the Boolean operation OR) of all regex with "OR category" addicted (indicated in second column). Similarly, the regular expression to detect alcohol dependence was formed by the disjunctive combination of all expressions in the alcoholic category. \b is used to represent the word boundaries. Table 7 displays the performance on the training and evaluation sets for all the approaches. On the evaluation set, the rule-based method achieved the highest performance with 80.1%, whereas the NN method achieved 73.4%. The hybrid approach returned the lowest score (72.1%).
Despite the drop in the performance compared to the training data, the rules generalized better than the data-driven approaches. Table 8 shows a confusion matrix with the numbers of reports that were correctly and incorrectly classified by the rule-based approach in the evaluation set. By far, the biggest confusion was among mild and moderate cases (76 mild cases predicted as moderate, and 20 moderate cases predicted as mild), with some confusion between severe and moderate reports. Still, we note that there was not much confusion between extreme cases (i.e., severe and mild), which is reflected in relatively high MAE M scores.

| Rule-based approach
While inspecting the training set sample, we noticed that symptom severity could be assessed more accurately (and consistently) by The neural network: From left to right: Depiction of an input text document; two input systems applying bag approaches to input document; bag results produced by each method for the input document; 3-layered neural network; possible outputs of the neural network, treated as regression The neural network integration of the rule-based approach. From left to right: Depiction of an input text document; input system based on rules; output of the input system, counting how many times each rule fired; 3-layered neural network; possible outputs of the neural network, treated as regression  We performed manual examination of the errors in the evaluation data set with 74 records assigned the wrong symptom severity; 24 records were assigned a higher class than the correct one; and 50 records were assigned a lower severity class.
We note that in more than half of cases where a higher class was assigned, the reports mention symptoms that could indicate such higher severity. For example, correct recognition of a moderate symptom (e.g., "304.03 Opioid dependence," "History of drug use: Yes") in the unstructured part does not necessarily agree with the classification provided at the gold standard (which was mild in the above cases).
Also in a quarter of cases, the rules did not deal with negated context (e.g., "parole: no operating under the influence/DUI," "no h/o detox," "She denies any difficult with sleep, appetite, energy"). There were few cases associated with symptoms of family members that were wrongly associated with the patient (e.g., "grandmother hospitalized after suicide attempt"). Finally, another source of errors were treatment plans that were not identified as such (e.g., "Will initiate an outpatient detox with long-acting …").
The cases where lower severity was wrongly assigned are mainly due to the fact that we have focused on capturing unstructured symptom mentions with the exception of structured mentions of the mild class. This choice led to several falsely classified documents to less severe classes. For example, four records had a structured question suggesting a symptom but we have decided that it was best for our system's performance to focus on unstructured symptom mentions only. Additionally, an unstructured mention of "Alcohol abuse" (indicating moderate severity) was overwritten as mild severity by identifying the structured mention of "Marijuana: Yes"). In five records, mentions of specific drugs indicated a certain severity level, but were not included in our rules (e.g., "Trial of ritalin [symptom for the moderate class]"). In a similar fashion, another source of errors (almost a third) was due to expression patterns that have not been spotted in the training data (e.g., "residential treatment" for the severe class, "trichotillomania" for the moderate class).

| Neural network and hybrid approaches
The

| Limitations and future work
We designed our rules after exploring a rather small sample of the training set (50 records). Perhaps, a larger set might have helped engineering more rules that would able to cast a wider net to identify mentions of positive valence symptoms in text. Our emphasis on unstructured symptom mentions might also be a limitation as information from structured questions is disregarded instead of being taken into consideration for the classification of a record (with the exception of the mild class). Initially, we considered structured questions for all classes but we noted that they were not discriminative and would produce a number of errors. Perhaps, they can be useful in cases where there is not a single unstructured mention that can help for the recognition of the class severity.
Our precedence rules might be another limitation. Choosing the highest severity symptom to characterize (i.e., classify) a record (and ignoring mentions of other symptoms of lower severity) has led to some misclassifications. Combining different severity signals could be an interesting task to explore in the future.
All the methods developed here have been trained on initial psychiatric evaluation records. Although both rules and the data-driven approaches may work on reports from other organizations, it is likely that they would need adjustments both in lexical and expression coverage. However, the rules and language models developed here focus on severity of symptom expressions and therefore may be used to process other types of mental health records.
We have designed and explored three methods of the identification of positive valence symptom severity in initial psychiatric evaluation records. The performance is promising, ranging from 72% (hybrid) to 80% (rule-based). We noted that the lexicalized rules managed to generalize automatic classification of symptom severity relatively well.
However, NNs, at least as implemented in this study, failed to generalize in the final evaluation set, even when 10-fold cross-validation results with the training set suggested that no severe overfitting was being produced. Combining the NN with rules in a hybrid serial system did not ameliorate the lack of generalization. Although there is still significant room for improvement, the results are encouraging and indicate that automated text mining methods can be used to classify mental symptom severity from psychiatric notes with a range of reasonably promising performance.