CCheXR‐Attention: Clinical concept extraction and chest x‐ray reports classification using modified Mogrifier and bidirectional LSTM with multihead attention

Radiology reports cover different aspects from radiological observation to the diagnosis of an imaging examination, such as x‐rays, magnetic resonance imaging, and computed tomography scans. Abundant patient information presented in radiology reports poses a few major challenges. First, radiology reports follow a free‐text reporting format, which causes the loss of a large amount of information in unstructured text. Second, the extraction of important features from these reports is a huge bottleneck for machine learning models. These challenges are important, particularly the extraction of key features such as symptoms, comparison/priors, technique, finding, and impression because they facilitate the decision‐making on patients' health. To alleviate this issue, a novel architecture CCheXR‐Attention is proposed to extract the clinical features from the radiological reports and classify each report into normal and abnormal categories based on the extracted information. We have proposed a modified Mogrifier long short‐term memory model and integrated a multihead attention method to extract the more relevant features. Experimental outcomes on two benchmark datasets demonstrated that the proposed model surpassed state‐of‐the‐art models.


| INTRODUCTION
The emergence of electronic health records (EHR) has created new prospects in the healthcare industry, and the growing use of digital content has provided numerous advantages. 1 A large amount of EHR information exists in the form of unstructured free text. 2 "The free-text reporting format tends to offer a more natural and expressive approach to documenting clinical events and facilitate communication among the care team in the healthcare environment". 3edical imaging techniques such as x-rays, magnetic resonance imaging (MRI), and computed tomography (CT) scans are among the few clinical examinations used by radiologists to diagnose pulmonary diseases. 4][8][9] Deep learning (DL) techniques have recently demonstrated outstanding performance in natural language processing (NLP).DL methods have been adopted for various tasks in the medical domain.Unlike machine learning (ML) and rule-based methods which require handcrafted features and manual design of rules for training purposes, DL methods automatically learn features and have stronger generalization ability.Long short-term memory (LSTM) is one of the most popular DL models predominantly used by researchers owing to its potential to capture long dependencies.Bidirectional LSTM (BiLSTM), a variant of LSTM, has forward and backward hidden layers to address sequential modeling issues.BiLSTM models have achieved impressive results for clinical-named entity recognition (CNER), [10][11][12] coreference resolution, 13,14 relation extraction, 15,16 and classification 17,18 for mainstream text processing tasks.Despite having a multitude of advantages over LSTM, there are some key problems with the BiLSTM model: (1) the model becomes complex due to the presence of highdimensional input distance; (2) the model sometimes fails to capture contextual features; and (3) reduced performance due to the absence of medical words in pretrained word embeddings.
We propose a model to address the aforementioned issues for the clinical concept extraction (CCE) and classification of chest radiographs with a modified Mogrifier and bidirectional LSTM with multihead attention (CCheXR-Att).CCheXR-Att utilizes pre-trained embeddings to generate the contextual vectors of the input words, and because the CNER can be improved by extracting the character-level information, 19 we propose generating character embeddings by adopting the selfattention method.The word representations are fed into the bidirectional Mogrifier LSTM (BiMogrifier LSTM) layer, where backward representations are computed to capture contextual information.Global and local character embeddings are fed into the conventional BiLSTM model.The outputs from both the BiMogrifier LSTM and traditional BiLSTM are concatenated, and the concatenated result is provided as input to the multihead attention (MHA) layer to capture important features.Finally, SoftMax is used to predict the final label.
The main contributions in this paper are summarized as follows: • Propose a novel architecture, CCheXR-Att, for radiological concept extraction and classification of chest x-ray reports.The remainder of the paper is structured as follows: Section 2 discusses related work.In Section 3, we discuss the proposed methodology.The datasets, evaluation metrics, and training details are discussed in Section 4. We provide a detailed analysis of the results obtained in Section 5. Finally, the conclusion and future work are discussed in Section 6.

| RELATED WORK
In the field of healthcare and clinical practice, a substantial amount of text is generated, encompassing various aspects such as symptoms, test results, diagnoses, treatments, prevention, and patient outcomes.These textual data hold valuable information, and accurately identifying all the details within a clinical report can greatly assist healthcare professionals in understanding a patient's overall context during their diagnosis or treatment phase, thereby enhancing healthcare support.The potential application of clinical-concept detection and extraction within the healthcare domain is noteworthy.This involves creating systems that can extract relevant clinical information from medical narratives or data found on social media platforms.
The techniques employed in constructing CCE applications have largely been adapted from the broader realm of NLP. 20These methodologies generally fall into two categories: rule-based approaches and statistical approaches which are further categorized into ML, [21][22][23][24] DL, [25][26][27][28][29][30][31][32][33][34] and hybrid methodologies. 35,36ecently, several researchers have explored this domain and achieved remarkable performance outcomes.For instance, Li et al. 37 proposed a model combining character-level CNN-BiLSTM-CRF and trained it using the Nadam algorithm, achieving an F1-score of 84.61% on the 2019 i2b2/VA concept extraction task.Gerevini et al. 38 used NLP tools for the annotation of chest CT reports and ML methods to classify them.As DL methods have shown competitive performance in different domains in recent years, many authors have used DL methods in the clinical domain.For example, Venkataraman and Pineda 39 used an LSTM RNN-based model on textual human and veterinary records and compared them with decision trees and random forests.The model scored higher than the baselines achieving macro-F1 scores of 74% and 68% for veterinary and human text narratives, respectively.
In addition to ML and DL methods, neural-based contextual embeddings have recently gained considerable popularity due to their superior performance compared to traditional word embeddings.Si et al. 40 explored various neural-based embeddings for extracting clinical concepts from textual narratives.Similarly, L opez-Ubeda et al., 41 used transfer learning methods for the classification of Spanish radiological reports and achieved a score of up to 70% F1 score using a pre-trained multilingual model.For the classification of medical texts, Prabhakar and Won 42 developed a hybrid DL model with MHA, achieving an accuracy of 96.72% for the QC-LSTM model and 95.76% for the hybrid BiGRU.Olthof et al. 43  and classifying each report into normal and abnormal categories.Figure 1 shows a flowchart of the proposed model.The input sentences are first pre-processed and fed into the embedding layers.We used two embedding layers: a word-embedding layer and a characterembedding layer.The word-embedding layer builds word embeddings for each word in a sentence using global vector embedding (GloVe) embedding. 44In addition, the model adopts a multi-attention neural network to generate local and global character-level features.The proposed BiMogrifier LSTM layer processes word embeddings as input, whereas the standard BiLSTM layer processes character-level embeddings.The output from both layers is concatenated, and the result is fed into the MHA layer to find important features.The representations from the attention layer are then inserted into Soft-Max to determine the ideal label (normal or abnormal).

| Embedding
The sentences are pre-processed and converted into a vector.Given the limitations of conventional embeddings in capturing contextual information, we opted for the utilization of pre-trained embeddings in this paper.We have used GloVe as our first embedding generation method, and the self-attention mechanism as the second method for generating local and global-level character embedding to capture more character-level features.
Prior research predominantly concentrated on word embeddings; however, it has recently been observed that character embeddings (CE) based on the self-attention method capture more information than word embeddings.As a result, we employ the self-attention mechanism to concurrently generate character embeddings at both local and global levels.The attention score is computed as where s x i , q ð Þ = x T i Â q, x i represents the input status of a word or character, and the input state represented by q corresponds to x i .
All the character representations found in a single sentence are combined into a global feature matrix for global character embedding.The BiLSTM model processes the feature matrix to incorporate more contextual information and then employs a self-attention technique to generate a new representation matrix.This results in the creation of new character-level features.The next step in obtaining the global character embeddings is to compute the average value of each character feature, followed by max pooling, which chooses the largest value of all the characters contained in a word.
Similarly, local character embedding is generated by employing a self-attention mechanism within one word.Self-attention often results in a large output dimension, hence, we construct a layer using the back-to-back pool method.Selecting a feature as word embedding with a single max-pool is not sufficient, and character information is usually lost in the first pooling layer if two maxpool layers are applied.Therefore, we employ average pooling before max pooling, which chooses the highest value as word embedding to generate character representation based on attention."Back-to-back pool layers allow for the unification of the dimensions of each output in the character-level feature extraction layer". 45The proposed architecture of the CCheXR-Att model is illustrated in Figure 2.

| Bidirectional Mogrifier LSTM
In this paper, we present a modified version of the Mogrifier LSTM. 46The motivation behind the model is to capture contextual information that traditional LSTM cannot extract efficiently.The BiLSTM model, which also processes the backward information in addition to forward information, shows a significant improvement over the LSTM model.The functions followed in LSTM are: Flowchart of the proposed model.
where I t R nÂh denotes input gate, F t R nÂh denotes forget gate, and O t R nÂh denotes the output gate.X t information such as the input and forget states, can be calculated.The input and forget gates produce the cell state.The next LSTM cell is then loaded with hidden and cell states.Since there is no connection between the previous state h prev and the present input x in the LSTM, and with no opportunity for interaction before the gate, this absence of interaction could cause contextual information to be lost.Inspired by the Mogrifier LSTM, 46 which improves the contextual modeling by providing the interaction before the gate, we propose a bidirectional Mogrifier LSTM that not only improves the contextual modeling ability but also the concept extraction.Figure 3 shows the proposed BiMogrifier LSTM cell structure.
BiMogrifier LSTM updates the input and prior hidden states through mutual gating.The input is crossed with the gate in each cycle.The BiMogrifier LSTM model derives bidirectional hidden information h ! and h from the forward and backward directions, respectively.x i representing the input embedding and h i prev representing the previous hidden states are processed by the Mogrifier operation.
where Q i and R i are matrices with randomly initialized values, and r represents the number of rounds.σ represents a logistic sigmoid function and ⨀ represents an element-wise product.

| Bidirectional LSTM
The basic BiLSTM network is used as the second layer to process the character-level embeddings.Local and global character-level embeddings are fed into the BiLSTM network.The forward and backward hidden representations h ! and h for BiLSTM are computed using Equations ( 13) and ( 14), respectively.The final output hidden representation h BiLSTM is obtained by concatenating h ! and h as given in Equation (15).
The contextual representations obtained from the BiMogrifier LSTM and the traditional BiLSTM are then fed into the concatenation layer.

| Concatenation layer
This layer performs element-wise summations of the representations from the previous layers. 47The concatenation of two independent networks is represented by h Final as shown in Equation (16).

| Multihead attention
Clinical-named entities are not present in isolation in clinical reports and hold dependencies among them, accompanied by a long interval between entity characters.Given the significance of this dependency, it must be captured by assigning dependent characters more attention by assigning them higher weights for significant characters and lower weights for less important characters.To capture this dependency, the model needs to give extra importance to the characters that are dependent on the current character.
We used an MHA method to locate important features.The MHA structure is shown in Figure 4. Attention scores are computed using Equation (17).
The MHA employs parallel h heads to concentrate on various components of the value vector channels.The Q,K,and V parameters represent the characters in the sentence and are set to be equal while calculating self-attention.The learning parameters are defined as The ith head attention is calculated using Equation (18).
The computation output from these parameters is concatenated h times, a linear transformation is performed, and the output of the phrase's tth character is obtained using Equation (19), where concat() denotes the splicing function and W 0 ℝ nÂ d h is the weight parameter.

| SoftMax
The next layer that we implemented after the MHA layer is the SoftMax layer to decode the predicted labels.The evaluation score is computed using Equation (20).
where X represents the input sequence, y represents the corresponding label sequence, P i,j represents the score of the ith character labeled as label j, W represents the transition matrix, and W i,j is the state transition score from label i to j.
SoftMax is used to compute the conditional probability of the sequence label y given X, using Equation ( 21).

P yjX
ð Þ¼ e S X,y

| EXPERIMENT
This section gives an overview of datasets, evaluation metrics, and the training process.We also evaluated the effectiveness of several baseline models and extensively analyzed both the proposed model and its different variants.In addition, we thoroughly investigated the model's performance through an ablation study to gain deeper insights into the proposed model.

| Dataset
We employed two standard benchmark datasets in our study.The first dataset, known as the Indiana University Chest X-ray Reports (IU-CXR) dataset, comprises 3955 radiology reports and is sourced from the National Library of Medicine. 48The second dataset, called MIMIC-CXR, 49 encompasses 377 110 images linked to 227 827 textual reports.For our research, we focused solely on the textual reports within this dataset.

| Evaluation metrics
We assess the performance of the proposed approach through various evaluation metrics: accuracy, F1-score, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).These metrics are computed using the following equations: We have also used the Matthews Correlation Coefficient (MCC), a statistical tool for evaluating model performance that considers T Positive , T Negative , F Positive , and F Negative , thus making it a balanced measure of classification performance.This is particularly important in CCE, where the goal is to identify all relevant concepts accurately without missing any or generating F Positive .The MCC is computed using Equation (28).

| Training details
The assessment of all models is done on a Windows machine with an NVIDIA RTX3070 GPU.The word 'embedding' originated as a 100-dimensional GloVe pretrained representation sourced from Wikipedia domain text.Character embedding was configured at a dimension of 40.ADAM optimizer with a learning rate of 0.0001 is employed.Each iteration was based on a batch size of 16, and the training spanned 40 epochs.Dropout regularization with a rate of 0.5 was employed to enhance the model's stability.

| RESULTS AND DISCUSSION
This section provides a detailed analysis of the results obtained from the proposed model.A comparative analysis is conducted between the proposed model and traditional as well as state-of-the-art models to assess its efficacy.

| Baselines
We evaluated the performance of the CCheXR-Att model in comparison with several baseline and state-of-the-art models, including LSTM, BiLSTM, CNN-BiLSTM-CRF, 37 BoW-based model, 38 FasTag, 39 BERT LARGE , 40 Hybrid BiGRU, 42 Fine-Tuned BERT, 43 Mogrifier LSTM, 46 and MSAM. 50STM relies on a single LSTM network to represent the sentence.The final sentence representation is derived by averaging all hidden states.BiLSTM creates representations in both the forward and backward directions, enabling it to capture the semantic meanings of a sentence from both directions.CNN-BiLSTM-CRF 37 receives preprocessed text and offers two choices for character-level representation generation: CNN and BiLSTM.The BiLSTM method captures neighboring information, with forward LSTM capturing the left context and backward LSTM capturing the right context.The final labeling, influenced by local label dependencies, is optimized using the CRF layer to assign the most suitable tag to each word.BoW-based model 38 relies on the bag-of-words framework wherein the complete textual content of the report is employed for classification purposes.This method omits the use of manually curated annotated datasets and automated annotation tools.FasTag 39 involves sequential embeddings of terms abstracted from medical narratives.By utilizing the GloVe technique, these terms undergo compact encoding, enabling the portrayal of a vector space in which semantically similar terms exhibit close associations.BERT LARGE 40 is improved by adding extra BiLSTM layers atop its architecture, making it deeper and more complex.BERT LARGE is fine-tuned by replacing the CRF layer with a BiLSTM architecture due to BERT's effective sequence labeling.Hybrid BiGRU 42 consists of a CNN for extracting local features and a BiGRU along with an MHA mechanism to model the semantic features to enhance the overall effectiveness of the model.Fine-Tuned BERT 43 understands the relationships within single words and complete sentences.BERT is initialized with pre-trained parameters and then optimizes all parameters using labeled data for the CCE task.Mogrifier LSTM 46 introduces a "Mogrifier" update, a gating mechanism that enhances LSTM networks by integrating information from different time steps, improving their ability to capture complex dependencies in sequential data. MSM 50 employs self-attention mechanisms with MHA heads to capture temporal dependencies and patterns effectively.
This allows the model to focus on the most relevant features, enabling the accurate classification of radiological concepts.

| Performance comparison with traditional and state-of-the-art models
We compared CCheXR-Att with traditional and state-ofthe-art models to prove its effectiveness.Performance achieved by various models on the IU-CXR and MIMIC-CXR datasets is presented in Tables 2 and 3, respectively.It is observed that LSTM achieved the lowest accuracy of 82.19% on IU-CXR, and an accuracy of 80.65% on the MIMIC-CXR dataset.One reason for the poor performance of LSTM is that it is unidirectional and processes each word in a sentence equally.BiLSTM exhibited a slightly improved performance compared with LSTM on both datasets.In comparison with conventional LSTM and BiLSTM models, CNN-BiLSTM-CRF exhibited superior performance.The inclusion of both CNN and BiLSTM architectures combined with CRF contributed to an enhanced performance level on both datasets.The BoW-based model achieved slightly reduced performance as it neglects word order and syntactic information.The existence of misspellings, abbreviations, and medical terminologies influenced the performance of FasTag.BER-T LARGE showcased great performance on both datasets.This is due to its extensive pre-trained knowledge, providing an inherent benefit in comprehending specialized terminology and patterns within clinical reports.
The hybrid BiGRU model also achieved comparable performance with fine-tuned BERT.The performance comparison on the IU-CXR dataset indicates that the models consistently achieved relatively similar accuracies with the Hybrid BiGRU model.By contrast, the finetuned BERT model demonstrated the highest accuracy as it involves training BERT on labeled data, allowing it to learn task-specific features, relationships, and intricacies present in the clinical reports.Notably, the Mogrifier LSTM demonstrated a substantial accuracy of 85.59% on the IU-CXR dataset and an accuracy of 81.43% on the MIMIC-CXR dataset.MSAM which incorporates selfattention mechanisms, exhibited an accuracy of 84.6% on the IU-CXR dataset and 85.91% on the MIMIC-CXR dataset.MSAM showed a better performance than the baseline owing to the self-attention method adopted by the model to give extra weight to important entities.
Remarkably, the CCheXR-Att model emerged as a standout performer across both datasets.On the IU-CXR dataset, it attained the highest accuracy of 92.89%, an F1-score of 89.31%, sensitivity of 88.23%, a specificity of 81.51%, PPV of 87.75%, NPV of 90.45%, and MCC of 88.62%.Similarly, on the MIMIC-CXR dataset, CCheXR-Att achieved an accuracy of 93.58%, an F1-score of 92.03%, sensitivity of 88.32%, specificity of 81.49%, PPV of 90.23%, NPV of 92.64%, and MCC of 82.72%.This impressive performance can be attributed to the incorporation of word and character embeddings in the construction of a bidirectional model of the basic Mogrifier LSTM.BiMogrifier LSTM-Att also displayed favorable outcomes, achieving an accuracy of 90.73% on the IU-T A B L E 2 Performance comparison of traditional, state-of-the-art, and proposed models on the IU-CXR dataset.CXR dataset and 89.6% on the MIMIC-CXR dataset.The Boxplots of performance metrics for the CCheXR-Att and its variants on both datasets are shown in Figure 5.The training and validation accuracy comparisons per epoch for different variants of the CCheXR-Att on both datasets are shown in Figure 6.

| Proposed model analysis
We examined different variations of CCheXR-Att to see how well they work.The variants include LSTM-Att, BiLSTM-Att, Mogrifier LSTM-Att, and BiMogrifier LSTM-Att.
The LSTM-Att model utilizes separate LSTM layers for word and character-level representation learning.The representations obtained from both LSTM layers are concatenated at the concatenation layer followed by MHA to learn relevant position-specific information.SoftMax classifier is then used to predict the final output.By contrast, the BiLSTM-Att variant replaces both LSTM layers with a bidirectional LSTM while keeping other layers consistent.Conversely, the Mogrifier LSTM-Att introduces a Mogrifier LSTM layer in place of BiLSTM in the previous variant, to learn word and character-level representation.The rest of the architecture remains the same.Similarly, the BiMogrifier LSTM-Att employs two bidirectional Mogrifier LSTM layers among which one layer learns the word and the other layer learns character-level embeddings, followed by concatenation, MHA, and finally a SoftMax layer to predict the class.Alternatively, the CCheXR-Att model considers bidirectional Mogrifier LSTM and bidirectional LSTM layers for word and character-level representations, respectively.The output from both layers is then concatenated at the concatenation layer and is followed by the MHA layer.A SoftMax classifier is then applied for class prediction, as elaborated in Section 3 of the paper.
We have also analyzed different parameter values, including learning rate and dropout rates, which affected model performance.Among the tested learning rates, 0.0001 stood out as optimal, delivering the highest accuracy and F1-score for IU-CXR and MIMIC-CXR datasets.Various dropout rates were tested, and optimal outcomes emerged as dropout rates increased incrementally from 0.2 to 0.5.The best accuracy and F1-score occurred at a 0.5 dropout rate for both IU-CXR and MIMIC-CXR datasets.Specifically, IU-CXR achieved 90.17% accuracy and 84.97% F1-score, while MIMIC-CXR attained 91.62% accuracy and 90.65% F1-score.
We explored local and global self-attention methods using various head values (20, 40, 60, and 80) for the CCheXR-Att.Results from the experiments revealed that, for the local character-level self-attention approach, the highest accuracy occurred at 80 attention heads, and the highest F1-score emerged at 60 attention heads for the IU-CXR dataset.Similarly, on the MIMIC-CXR dataset, the accuracy occurred at 60 attention heads and the F1-score at 80 attention heads.On analyzing various head values with the global self-attention method, we observed that CCheXR-Att achieved its highest accuracy at a head value of 40, while the F1-score reached its maximum at 60 for the IU-CXR dataset.By contrast, the highest accuracy and F1-score for the MIMIC-CXR dataset were observed with an 80 head value.

| Ablation study
To comprehensively assess the individual contributions of the components comprising CCheXR-Att, we conducted an ablation study using the IU-CXR and MIMIC-XR datasets.The results achieved by our model in the absence of various proposed modules are presented in Table 4.

| Removing self-attention-based character-embedding module
Utilizing CE through the self-attention technique proves to be more informative than word embeddings.The findings in Table 4 show the relatively reduced performance of the model in the absence of CE s , thus affirming their efficacy.This highlights the valuable role of character embeddings at both local and global levels in helping the model understand detailed character-level features.The model's performance decline without CE s indicates that the omission of such embeddings introduces contextual information gaps, potentially introducing bias and consequent performance deterioration.

| Removing BiMogrifier LSTM module
The incorporation of the BiMogrifier LSTM introduces a dynamic updating mechanism that involves mutual gating, enabling contextual modeling by facilitating interaction preceding the gating process.Our results show a significant performance drop on both IU-CXR and MIMIC-CXR datasets when the BiMogrifier LSTM module is removed.This confirms the essential nature of the dynamic updating mechanism and strongly suggests that the including of the BiMogrifier LSTM module greatly improves concept extraction and classification.

| Removing MHA module
The integration of an MHA mechanism that extracts important insights from different parts of a sentence and gives priority to relevant elements greatly benefits classification results.Our experimental findings confirm that incorporating this MHA mechanism significantly improves the effectiveness of the proposed model.Conversely, the lack of such a mechanism suggests that the model struggles to capture essential semantic information that spans various representation dimensions and positions, resulting in a noticeable decline in performance.

| Discussion and findings
Applying the proposed model to chest x-ray reports offers comparatively simple and low-effort means to overcome the limitations discussed in this paper.Traditional ML and DL models are not able to effectively capture contextual features, and the presence of a high-dimensional input distance causes complexity in the models.Another reason for the poor performance of traditional models is the absence of medical words in pre-trained word embeddings.
To this end, the proposed model is useful given its ability to utilize word and character embedding by integrating the local and global-level attention methods, and further enhancing the Mogrifier LSTM to include information from the backward direction through the gating mechanism.Additionally, the model is improved through an MHA method to capture important information obtained from the concatenation layer.
In addressing complex CCE tasks, we find the combination of word and character embeddings to be highly effective.Word embeddings capture comprehensive word information, while character embeddings excel in handling out-of-vocabulary (OOV) words, collectively enhancing our model's CCE performance.CCheXR-Att utilizes both BiMogrifier LSTM and BiLSTM to successfully extract meaningful features.CCheXR-Att exhibited higher accuracy compared with BiMogrifier LSTM-Att.This variation in accuracy between CCheXR-Att and BiMogrifier-Att is attributed to distinctions in clinical narratives, the presence of rare words, and dataset size.By leveraging both word and character embeddings, our model adeptly captures word relationships, leading to improved concept extraction.It excels in annotating entities of varying lengths, demonstrating superior information extraction capabilities.While BiLSTMs are particularly effective at learning complex features and patterns, the introduction of Bi-Mogrifier LSTM enhances contextual information identification, thereby boosting the overall performance of CCheXR-Att.The combination of architectural elements and embedding techniques positions CCheXR-Att as a robust solution for advancing CCE tasks, showcasing its proficiency in capturing contextual information and achieving superior performance across diverse datasets.

| CONCLUSION AND FUTURE WORK
This paper presents CCheXR-Att, a novel approach for the clinical concept extraction and classification of chest x-ray reports.Specifically, the model integrates pretrained word embeddings and character-level embeddings based on the self-attention method that is further processed with proposed Bi-Mogrifier LSTM and BiLSTM, respectively.
The proposed model aims to fetch useful entities from clinical narratives and provide support to healthcare professionals, radiologists, and researchers to make better decisions through the detected information, and to increase patients' quality of life.In addition, the experiment demonstrated the effectiveness of CCheXR-Att which suggests that the framework with different components introduced in the model can capture accurate information and classify the reports correctly.In the two benchmark datasets, the proposed model performs better than the state-of-the-art models.In the future, an investigation into the efficacy of incorporating state-of-the-art language models will be undertaken.Furthermore, a comprehensive assessment and comparison of alternative neural architectures for clinical concept extraction and classification will be conducted.It is also planned to include external domainspecific knowledge in future implementation.The extensibility of the model to encompass multiple languages is also a prospective avenue of exploration.
represents the input information.W xi , W xf , W xo , W xc , W hi , W hf , W ho , and W hc are the weight parameters, and b i , b f , b o , and b c are the bias parameters.Ct and P t represent the candidate memory and present memory cells respectively, and H t represents the present hidden state.The input, forget, and output states are then computed by the LSTM cell.From the X t and H tÀ1 , relevantF I G U R E 2 Proposed architecture of CCheXR-Attention model.CCheXR-Att, classification of chest radiographs with a modified Mogrifier and bidirectional LSTM with multihead attention.

F I G U R E 5
Boxplot of performance metrics for the proposed model and its variants on the datasets: (A) IU-CXR, and (B) MIMIC-CXR.IU-CXR, Indiana University Chest X-ray Reports.F I G U R E 6 Training and validation accuracy throughout 40 epochs for the proposed model and its variants on datasets: (A) IU-CXR, and (B) MIMIC-CXR.IU-CXR, Indiana University Chest X-ray Reports.
T A B L E 3 Performance comparison of traditional, state-of-the-art, and proposed models on the MIMIC-CXR dataset.: CCheXR-Att, classification of chest radiographs with a modified Mogrifier and bidirectional LSTM with multihead attention; LSTM, long shortterm memory; NPV, negative predictive value; MCC, Matthews Correlation Coefficient; PPV, positive predictive value. Abbreviations T A B L E 4 The performance comparison of our model without various proposed modules on the IU-Xray and MIMIC-CXR datasets.: CCheXR-Att, classification of chest radiographs with a modified Mogrifier and bidirectional LSTM with multihead attention; CE, character embeddings; IU-CXR, Indiana University Chest X-ray Reports; LSTM, long short-term memory; MHA, multihead attention. Abbreviations