Extracting structured information from unstructured histopathology reports using generative pre‐trained transformer 4 (GPT‐4)

Deep learning applied to whole‐slide histopathology images (WSIs) has the potential to enhance precision oncology and alleviate the workload of experts. However, developing these models necessitates large amounts of data with ground truth labels, which can be both time‐consuming and expensive to obtain. Pathology reports are typically unstructured or poorly structured texts, and efforts to implement structured reporting templates have been unsuccessful, as these efforts lead to perceived extra workload. In this study, we hypothesised that large language models (LLMs), such as the generative pre‐trained transformer 4 (GPT‐4), can extract structured data from unstructured plain language reports using a zero‐shot approach without requiring any re‐training. We tested this hypothesis by utilising GPT‐4 to extract information from histopathological reports, focusing on two extensive sets of pathology reports for colorectal cancer and glioblastoma. We found a high concordance between LLM‐generated structured data and human‐generated structured data. Consequently, LLMs could potentially be employed routinely to extract ground truth data for machine learning from unstructured pathology reports in the future. © 2023 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.


Introduction
Deep learning (DL) is a powerful computational tool which can extract patterns in unstructured data such as images and text.Most text in pathology routine reports is unstructured [1]; however, traditionally, the focus of DL in pathology has been on the analysis of whole-slide histopathology images (WSIs).In this application, DL has shown promise in reducing workload for human experts by handling time-consuming repetitive tasks, and even extracting information that might initially be hidden to human experts [2].DL-based image analysis is likely to give rise to a new generation of biomarkers for precision oncology [2,3].However, the development of DL models crucially depends on large amounts of labelled data that need to be painstakingly curated by human experts.The process of scanning the slides and storing them as high-resolution WSIs has been commoditized and is now readily available [4].However, in order to train supervised deep learning methods on these images, a ground truth label needs to be available, which often is present in the original, unstructured, or semi-structured pathology report.Extracting these ground truth labels is slow, expensive, and can be error-prone due to multiple readers participating in the labelling process.
Contemporaneous DL models for natural language processing (NLP) are large language models (LLMs) such as generative pre-trained transformer 4 (GPT-4).These LLMs exhibit remarkable capabilities in understanding and processing of human-written texts and can reason over a wide range of applications [5].In radiology, a recent proof-of-concept study has demonstrated that this can be used to extract structured and quantitative information from unstructured reports written in plain language [6].A similar use of LLMs in histopathological reports would open up new possibilities of leveraging large data for the training of AI vision models.Some previous studies have addressed similar ideas, such as those by Abedian et al [7], Leyh-Bannurah et al [8], and Savova et al [9].However, these studies developed their own standalone NLP system which was specifically tailored to the application (and potentially the dataset) at hand; also, they required training on large domainspecific datasets.In contrast, modern LLMs such as GPT-4 have zero-shot capabilities [5], which means that they can be applied to complex tasks without any further training.This could allow users to extract structured data from pathology reports without making any changes to the LLM model, without any re-training or fine-tuning.
This study therefore aimed to demonstrate the ability of a state-of-the-art LLM to extract structured information from histopathological reports written in plain English.As a proof of concept, we addressed two large sets of histopathology reports: a set of highly variable pathology reports for colorectal cancer (CRC) obtained from The Cancer Genome Atlas (TCGA), and a set of microscopy observations from anonymized neuropathology reports of diffuse gliomas obtained from the Division of Neuropathology, University College London Hospitals (UCL).We further substantiated the findings by analyzing a small dataset of pathology reports in German.

Artificial intelligence tools disclosure
As per the COPE position statement of April 2023 [10], we hereby disclose that the following artificial intelligence tools were used to write this article: GPT-4 (OpenAI, Inc., San Francisco, CA, USA), only for checking and correcting spelling and grammar in the Introduction and Discussion sections, using the paid version of the chatGPT platform.

Data collection and ethical approvals
We collected three sets of histopathology reports for our experiments.The first set comprised colorectal cancer reports from The Cancer Genome Atlas (TCGA) database (https://www.cancer.gov/tcga).The TCGA CRC dataset included a wide variety of reporting styles and document qualities.We randomly selected and imported (with a Python script), 100 reports from this dataset for our first experiment.The second dataset consisted of routine neuropathology reports collected between 2011 and 2019.The cohort contained n = 1,882 free-text reports of (H&E) images from n = 1,877 patients with adult-type diffuse gliomas.Anonymized text data from this cohort were obtained as text files, structured on a spreadsheet on 19 July 2022, from the Division of Neuropathology, University College Hospitals NHS Foundation Trust.The reports were based on tissue samples obtained from this NHS Foundation Trust as part of the UK Brain Archive Information Network (BRAIN UK; https://www.southampton.ac.uk/brainuk/), which is supported by Brain Tumour Research and has been established with the support of the British Neuropathological Society and the Medical Research Council [11].The third dataset comprised N = 21 reports which were first anonymized and then rephrased by a pathologist (to not include any data of actual patients).According to the current local regulation of the Medical Association of the State of Rhineland-Palatinate, which sets the framework for all local medical ethics committees, no ethics approval is needed for using anonymous, retrospective data which cannot be accessed by a third party.The local ethical committee of the University Hospital of Aachen approved this retrospective study (reference number 23/111) of anonymized data and waived the requirement to obtain individual informed consent.

Optical character recognition (OCR)
We extracted plain text from the scanned histopathology reports using the Tesseract OCR engine (https://github.com/tesseract-ocr/tesseract; accessed on 13 April 2023).The OCR engine was employed to convert scanned images into machine-readable text, which was then processed by GPT-4.

GPT-4 interaction
We interacted with GPT-4 through the OpenAI Application Programming Interface (API) (https://openai.com/blog/openai-api; accessed on 24 April 2023).We provided GPT-4 with specific prompts for each of our experiments, designed to instruct the model either to fill in a predefined structured reporting template, to propose a structured reporting template, or to extract quantitative information from the reports.In all cases, we ensured that the prompts were clear and concise, and provided necessary context for the model to perform the required tasks.The prompts are given in supplementary material, Tables S1-S3.

Data analyses
We compared the structured data generated by GPT-4 with manually extracted data from the same set of Large language models for pathology reports histopathology reports.In our first experiment, we assessed the accuracy of GPT-4 in extracting the correct T-stage, N-stage, and M-stage, as well as the total number of examined lymph nodes and the number of positive lymph nodes.In our second experiment, we evaluated the structured reporting templates proposed by GPT-4 in terms of their ability to organise relevant medical information.In our third experiment, we compared the quantitative information extracted by GPT-4 with the corresponding values in the original reports.We calculated the accuracy of GPT-4 as the percentage of cases in which the extracted information matched the manual annotations.

Extraction of structured reports from unstructured reports
For the first experiment (#1), we used colorectal cancer histopathology reports from the TCGA-CRC cohort.These reports had a broad range of quality and format (Figure 1).We randomly chose 100 reports and extracted plain text from the images of reports supplied as PDFs, using optical character recognition (OCR) whenever the text was stored as an image.We used a pre-defined simple structured reporting template and asked GPT-4 to fill in the template for each case (supplementary material, Table S1).We then compared the resulting structured data with structured data obtained by manual review of the same reports.In general, we found that the automatically extracted information agreed well with the manually extracted information: GPT-4 successfully extracted the correct T-stage in 99/100 cases, the correct N-stage in 95/100 cases, and the correct M-stage in 94/100 cases, respectively.The total number of examined lymph nodes and the number of positive lymph nodes were correctly extracted in 98/100 and 99/100 cases (see Table 1).Remarkably, GPT-4 was capable of untangling nontrivial phrases and of incorporating prior medical knowledge into its reasoning.For example, when the text states that the 'tumour invades until the fat subserosa', GPT-4 correctly interpreted this finding as T3 stage.Similarly, GPT-4 correctly interpreted the text 'Regional lymph nodes (pN): Two of 26 lymph nodes positive for metastatic carcinoma (2/26) …' as N1 stage, even though this was not explicitly mentioned in the report.Where GPT-4 was not accurate, this was partly due to the OCR not working correctly.Of the 15 cases in which errors occurred, five were due to errors in the OCR step: not all text could be reliably extracted due to insufficient scan quality and sometimes the pathologist annotated their findings in handwriting, which was almost never translated correctly into text (supplementary material, Figure S1).It should be noted that GPT-4 can also be used to flag such problematic reports.When prompted to identify potential errors in the OCR step based on the digital text, GPT-4 correctly identified problematic reports that might subsequently be reviewed manually.More details on the remaining ten errors in which OCR worked correctly are given in Supplementary experiments.
Overall, these results show that the extraction of structured information is possible with reasonable accuracy even in challenging cases (supplementary material, Figure S2) or where the archived reports had not been digitized a priori and have a low scan quality and a wide variety of reporting styles.

Structured template proposition by GPT-4
The first experiment was based on a pre-defined reporting template, but defining such a template manually can be labour-intensive.Therefore, for experiment #2, we asked GPT-4 to suggest a structured reporting template just based on a set of ten pathology reports.For this, we employed randomly selected routine neuropathology reports of patients with adult-type diffuse gliomas.The detailed prompt is shown in supplementary material, Table S2.We found that GPT-4 was capable of organising the information from the reports in structured categories comprising general information about the sample (ID and type of sample), tumour morphology, and immunohistochemistry.The proposed reporting template is given in Table 2. Subsequently, we tasked GPT-4 with filling in this template based on the provided reports, without any additional guidance on how to fill in the categories.We compared the resulting data with the manually extracted data for the same cases and found that such filled-in structured reports were an accurate representation of the prosaic texts.In clinical routine, histopathological reports are often opened with a dedicated question in mind, e.g.does the patient have an IDH mutation?Therefore, we tested how long it took for a medical doctor (DT) to extract information about IDH mutation status.For a set of ten reports, on average it took 2.3 ± 0.7 s to extract this information from the structured reports after the reports had been opened.The free-text reports resulted in much longer reading times of 11.7 ± 5.3 s.Thus, the structured reports proved to offer better accessibility to a human reader.Yet the items allocated to each category were not consistent enough to be used directly for the training of conventional deep learning algorithms.For example, for the category 'microcalcifications', some structured reports read 'present in stroma', while others were filled in with 'present'.Therefore, a post-processing step was needed to group similar entries into one group in order to make them machine-readable in the conventional sensei.e.classify the category items.
Overall, these results demonstrate that GPT-4 not only can extract information along a predefined scheme but also process it further to extract medical information contained within a set of reports and propose a unified reporting scheme.Large language models for pathology reports 313 of patients with adult-type diffuse gliomas matching a given predefined template.We focused on numerical information (e.g.Ki-67 labelling index) and on information that can be allocated to an ordinal or nominal scale (e.g.ATRX expression).The prompt was chosen such that GPT-4 could choose from intervals along a predefined scale.For example, for the Ki-67 labelling index, we asked GPT-4 in natural language to adhere to a scale of 0-<5%, 5-<10%, 10-<15%, etc.The full prompt is given in supplementary material, Table S3.We found that GPT-4 extracted all information faithfully in 100/100 cases.If the information in the original report was ambiguous and/or not directly assignable to the predefined scale, the language model reasoned in accordance with human thinking.For example, a Ki-67 index of 'up to 20%' was allocated to the most evident range of '15-<20%', and an expression of '<3%' was correctly determined to fall into the 0-<5% range.When information was not available from the report, the model correctly refrained from choosing an item and reverted to 'not mentioned'.Taken together, these results show that GPT-4 can transfer the information contained within prosaic reports to pre-defined structured categories with a high accuracy that rivals human annotation but at a fraction of the cost.

Extracting quantitative information on a large scale
For experiment #4, we proceeded to extend the previously developed concepts to large-scale data.We employed all available (n = 1,882) reports of adult-type diffuse gliomas from the UCL cohort and extracted structured data along the pipeline of experiment #3.Coding the script to read the data and send it to GPT-4 via the API took us less than 4 h.Processing all reports via the API took 17 h; however, this was done automatically and no human intervention was needed.In comparison, annotation of all reports by a human reader would have taken about 125 h based on our experience with the annotation of the 100 reports of experiment #3.To investigate the error rate of GPT-4 in this task, we manually checked 500 reports and found that GPT-4 made five errors when extracting the IDH mutation status, resulting in an error rate of 1.0 ± 0.4% (mean ± SD).Subsequently, we used the data extracted from all reports to investigate the correlations between IDH mutation status, Ki-67 labelling index, necrosis, and microvascular proliferation (see Figure 2).We observed a positive correlation between Ki-67 labelling index and the presence of necrosis (Figure 2A), higher Ki-67 labelling index when IDH wild-type was present (Figure 2B), a co-occurrence of necrosis and microvascular proliferation (Figure 2C), and a co-occurrence of both necrosis (Figure 2D) and microvascular proliferation (Figure 2E) with IDH wild-type.
Taken together, these results demonstrate that GPT-4 can efficiently and accurately extract structured data from a vast number of reports, greatly reducing the time and effort required for manual human annotation, and that this procedure enables the analysis of correlations between various factors in diffuse gliomas at a large scale with ease.More details are provided in the Supplementary experiments and in supplementary material, Tables S1-S4.

Discussion
In this study, we investigated the feasibility of using a large language model (LLM), specifically GPT-4, to extract structured data from unstructured pathology reports in a zero-shot approach.The results of our experiments suggest that GPT-4 can be used effectively to extract relevant information from histopathological reports with high accuracy.This capability holds potential to reduce the workload of human experts while preparing ground truth data for machine learning applications.
Our first experiment showed that GPT-4 successfully extracted accurate staging and lymph node information from TCGA colorectal cancer reports, even when faced with a wide range of reporting styles and varying document quality.While occasional errors occurred due to limitations in OCR or handwritten annotations, GPT-4 demonstrated the ability to extract quantitative data from unstructured text using prior medical knowledge embedded within the trained weights of GPT-4.However, at this point, it is not possible to discern exactly which part of the training data contained the relevant information of interest which was learned by GPT-4 during its training.In the second experiment, we demonstrated that these abstract reasoning capabilities and the learned medical knowledge can be used to propose structured reporting templates based on unstructured medical reports.This is an important finding, as it will allow machine learning models to extract quantitative data from unstructured reports, unlocking these data for a plethora of downstream applications.Thirdly, we demonstrated GPT-4's ability to accurately extract quantitative information from reports when the underlying medical reports were available in digitized formas is the case in many hospital information systems.While images are usually not digitalized in most pathology departments worldwide, the reporting process is generally digital in the European Union and the UK.In our study, we found that when provided with detailed instructions for the desired categorizations, GPT-4 performed at a human level.Finally, we employed this pipeline to extract machine-readable data on a large Based on ten randomly chosen reports, GPT-4 proposed a template (upper left).Provided with an example report (upper right), GPT-4 correctly filled in the template (lower panel).We note that in our template, the American spelling 'tumor' was used, while in the report, the British spelling 'tumour' was used.We did not observe any effect of these spelling variations on the model output.CUSA, Cavitron Ultrasonic Surgical Aspirator.
Large language models for pathology reports scale and demonstrate that GPT-4, and perhaps other LLMs, will enable large-scale analyses faster and at lower cost compared with human annotation.
Based on our experiments, we conclude that it is likely that LLMs such as GPT-4 will play a significant role in advancing the development of deep learning models  in precision oncology [2,3].There are two main reasons that lead us to this conclusion.First, by extracting ground truth data from pathology reports, LLMs could minimise the need for time-consuming and expensive manual labelling processes at a fraction of the cost of human experts: on average, the processing and annotation by GPT-4 of one report amounted to a cost of about USD 0.02, which is far less than human annotation would have cost.Second, LLMs could significantly speed up the development process of scientific studies: manual annotation of the reports in experiment #4 would have taken months in our routine workflows, but we were able to perform a large-scale analysis of the interplay of molecular fingerprints and histomorphological changes within days.We also identified some limitations of the proposed workflow.First, the performance of GPT-4 was occasionally affected by inaccuracies in OCR and the presence of handwritten annotations.This poses a challenge since a significant fraction of reports from clinical routine are archived as scanned documents, and digitization in histopathology is still in its early stages [12].However, this is not a limitation of LLMs per se but rather of the document pre-processing step.With the advent of more powerful vision models and their direct integration into LLMs [13], these challenges will probably be mitigated.Nonetheless, it is important to be aware of this limitation to develop mitigating strategies (e.g.random/additional quality checks, etc.).Second, our study is a proof-of-concept study that focused on colorectal cancer and diffuse adult-type glioma; future research should explore the applicability of LLMs for extracting structured data from other cancer types and other medical domains [6].The templates used could also be extended further or modified to fit established datasets as published by the International Collaboration on Cancer Reporting and others [14].Future studies which aim to facilitate the transition of structured reports into clinical routine should perform experiments which focus on these dedicated clinical reporting templates.In our study, we mainly focused on the translational research use-case and therefore used custom templates as defined by clinical researchers.Third, the input length to LLMs is limited.This prevented us, for example, from feeding more than approximately ten reports simultaneously into the model in experiment #2 and thus potentially limits the amount of general information extractable from reports.This will likely be mitigated in the future with the advent of LLMs with much larger input token length, i.e. the possibility to input more words at once.Fourth, the size of the datasets used in our study to measure the capabilities of GPT-4 was limited.This was due to the proof-of-concept layout of our study.Larger studies in multiple medical use cases are required and should particularly investigate potential drifts of the model prediction over time [15].
It should also be noted that generative models can 'hallucinate' findings [5,16].In our experiments, we provided GPT-4 with an explicit prompt and asked it to align responses with a clearly defined template, which might be less prone to hallucinations than 'open' questions.However, it is without question that fabricated outputs will erode trust in AI systems in medicine and need to be investigated carefully in future research.One concerning error that we found was that the term 'pN0/12' in the input was interpreted by GPT-4 as a total of 21 (not 12) lymph nodes being examined.Such transpositions are dangerous and LLMs should be carefully evaluated for the presence of such transpositions in the future.We propose that standardised and publicly available testing sets of histopathological reports with their corresponding labels should be established both to test the accuracy of proposed models and to measure the probability of hallucinations, i.e. the probability of producing positive labels that are not present in the original dataset.Most importantly though, GPT-4 is provided by a commercial company that allows for model use through an API but does not publish its model architecture and there is no guarantee that the information submitted to the model remains private.It is therefore of utmost importance that the research community develops similarly capable open-source models that can be deployed on-site so that patient information can be analysed without disclosing potentially compromising and sensitive patient data [17].Research into LLMs is evolving rapidly and as of July 2023 there is a growing body of publicly available base models such as the 'Large Language Model Meta AI' (Llama), Llama-v2, Vicuna, and many others [18].These models can be fine-tuned for medical use cases and with emerging techniques of efficient training such as low-rank approximation, fine tuning is even possible on hardware that is available to individual researchers [19].
There is ample room for future research on the application of LLMs in research and clinical routine in histopathology.One route that we did not investigate particularly, but which we believe holds high potential to improve safety in clinical routine, is the use of LLMs for quality control: important findings are often summarised by the pathologists and LLMs could be used to check the consistency between the report and the summary (such as TNM) to avoid errors.Additionally, GPT-4 might also be used to check the consistency of the extracted information and more detailed prompts might be utilised to include specific information about the TNM staging of the specific tumour entity to increase accuracy.We propose that the investigation of these quality control metrics is well worth the effort, since more accurate labels will lead to better-performing deep learning classification models.
In conclusion, our study demonstrates the ability of GPT-4, and potentially other LLMs, to extract structured data from unstructured or partially structured pathology reports, offering a promising avenue for reducing the workload of human experts and enhancing the development of deep learning models in precision oncology.Further research is needed to refine the extraction process, expand its applicability to a broader range of Large language models for pathology reports medical data, safeguard against fabrication of information, and develop open-source LLMs.

Figure 1 .
Figure 1.Overview of the input data and the analysis pipeline.(A) Representative pathology reports for TCGA-CRC.All TCGA-CRC reports were first pre-processed using optical character recognition (OCR).(B-D) Overview of the analysis pipelines for experiments #1-#4.Experiment #1 employed the TCGA-CRC reports, whereas experiments #2-#4 used neuropathology reports of patients with adult-type diffuse gliomas that were already available in digital format.Figure constructed using Flaticon.com.

Figure 2 .
Figure 2. Automated analysis pipeline for free-text reports.We employed GPT-4 for automated extraction of quantifiable information from n = 1,877 reports of diffuse adult-type glioma.(A) Distribution of Ki-67 scores in histopathological specimens with and without necrosis.(B) Distribution of Ki-67 scores in histopathological specimens with and without mutation of either IDH1 codon 132 or IDH2 codon 172.(C) Correlation between necrosis and microvascular proliferation.(D) Correlation between necrosis and IDH mutation.(E) Correlation between microvascular proliferation and IDH mutation.

Table 1 .
The accuracy of extracting structured information from free-text reports.

Table 2 .
Proposition of structured reports by GPT-4.Histology shows multiple very small CUSA-type pieces of a moderately cellular tumour composed of monomorphic cells with rounded nuclei, dark chromatin staining, and scanty cytoplasm.Perinuclear haloes are not a feature.The tumour cells are arranged in patternless sheets with no apparent rosette formations or perivascular pseudorosette formations.There is microcalcification in the stroma.Mitotic figures are not readily seen.Branching capillary network is seen in places, but microvascular proliferation is not observed.Necrosis is not present.