Diagnosis of focal liver lesions from ultrasound images using a pretrained residual neural network

Abstract Objective This study aims to develop a ResNet50‐based deep learning model for focal liver lesion (FLL) classification in ultrasound images, comparing its performance with other models and prior research. Methodology We retrospectively collected 581 ultrasound images from the Chulabhorn Hospital's HCC surveillance and screening project (2010–2018). The dataset comprised five classes: non‐FLL, hepatic cyst (Cyst), hemangioma (HMG), focal fatty sparing (FFS), and hepatocellular carcinoma (HCC). We conducted 5‐fold cross‐validation after random dataset partitioning, enhancing training data with data augmentation. Our models used modified pre‐trained ResNet50, GGN, ResNet18, and VGG16 architectures. Model performance, assessed via confusion matrices for sensitivity, specificity, and accuracy, was compared across models and with prior studies. Results ResNet50 outperformed other models, achieving a 5‐fold cross‐validation accuracy of 87 ± 2.2%. While VGG16 showed similar performance, it exhibited higher uncertainty. In the testing phase, the pretrained ResNet50 excelled in classifying non‐FLL, cysts, and FFS. To compare with other research, ResNet50 surpassed the prior methods like two‐layered feed‐forward neural networks (FFNN) and CNN+ReLU in FLL diagnosis. Conclusion ResNet50 exhibited good performance in FLL diagnosis, especially for HCC classification, suggesting its potential for developing computer‐aided FLL diagnosis. However, further refinement is required for HCC and HMG classification in future studies.

5][6] However, US has limitations compared to CT and MRI, such as operator dependence and reduced sensitivity for small lesions. 7Thus, choosing the right imaging method depends on clinical scenarios,with CT and MRI preferred for comprehensive evaluation. 2,8Exciting advancements in US, like elastography and fusion imaging, may improve its diagnostic capabilities in the future. 9he emergence of artificial intelligence (AI) in recent years has opened doors to advanced computational tools that can assist radiologists in interpreting medical images and improving diagnostic accuracy. 10AI algorithms, particularly machine learning and deep learning models, are able to analyze large volumes of imaging data and extract meaningful patterns, achieving highly accurate and efficient identification and classification of liver lesions. 11The role of AI in medical imaging for diagnosing liver cancers holds immense potential for improving diagnostic accuracy and efficiency, leading to better patient outcomes. 116][17] The incorporation of residual connections into RestNet50 allows this network to learn complex features from images, leading to improved classification accuracy.
Tiyarattanachai et al. 17 developed a predictive model for detecting and classifying FLLs using RetinaNet.The study employed a diverse dataset of hepatic ultrasonography images, which demonstrated promising results indicative of high model performance.However,opportunities for improvement persist due to certain challenges.Notably, the presence of blood vessels within the liver,heterogeneous background liver parenchyma,renal cysts, inferior vena cava, and splenic lesions within the bounding box constraints have an adverse impact on the model's performance in FLL detection.This limitation was reflected in a lower detection rate in the external validation dataset (75.0%, 95%CI: 71.7−78.3).Schmauch et al. 18 employed the ResNet50 model for the classification of focal liver lesions (FLLs) in ultrasound images.They designed a two-stage approach for simultaneous FLL detection and classification, with model performance assessed using the Area Under the Curve (AUC) metric.Although the study demonstrated a notably high AUC value for FLL detection and classification, it is important to note that their dataset lacks ground truth information for lesion confirmation, and it contains a limited number of hepatocellular carcinoma (HCC) images (only 6).Mostafiz et al. 19 introduced an innovative method for detecting FLLs in ultrasound images.Their approach integrates deep feature fusion and super-resolution techniques, exhibiting promising advancements in lesion detection accuracy.However, it is imperative to recognize certain limitations that can provide guidance for future research initiatives.Notably, this approach is tailored for a binary classification model distinguishing normal liver tissue from FLLs.
Notwithstanding the above advancements provided by AI, this technology poses several challenges when applied to US image analysis that stem from the unique characteristics of US imaging.Only a few studies have examined the effectiveness of AI, and specifically RestNet50, for diagnosing liver cancer from ultrasound images.][22][23][24] This study proposes the integration of ResNet50, a CNN-based model, to improve the diagnosis of focal liver lesions from ultrasound images.Our objective is to develop a deep learning model based on the ResNet50 architecture and assess its efficacy in classifying five liver tissue classes: non-focal liver lesion (non-FLL), simple hepatic cyst (Cyst), focal fatty sparing (FFS), hemangioma (HMG), and hepatocellular carcinoma (HCC).Additionally, we conduct a comparative analysis with different models and previous research findings to gain further insights into the diagnostic potential of ResNet50 for the identification of focal liver lesions.

Study design
This retrospective study uses ultrasound images from patients that were eligible for the Hepatocellular Carcinoma (HCC) surveillance and screening initiative project at Chulabhorn Hospital between 2010 and 2018.All images were obtained in DICOM format from the Picture Archiving and Communication System (PACS).The objective of this study was to develop a model for classifying lesions from these images into HCC, HMG, FFS, Cyst, and non-FLL.The findings of this study will serve as a foundation for the development of a computer-aided diagnosis (CADx) system in the future.

Dataset
We retrospectively collected 581 B-mode ultrasound images obtained with the upper abdominal ultrasonography protocol from 581 individual livers.Our inclusion criteria were: (1)

Ground truth
We relied on radiological reports from CT or MRI studies to validate the diagnosis of FLL.It is important to emphasize that CT and MRI are widely recognized as standard medical imaging modalities for diagnosing HCC, in accordance with the guidelines for HCC diagnosis in the diagnostic radiology workflow. 25

Image preprocessing
An experienced sonographer manually cropped all liver lesions, with the size of the region-of -interest (ROI) adjusted to lesion size (see Figure 1).Non-FLL regions were cropped using ROIs sized between 300 × 300 and 400 × 400 pixels.The resulting images were then resized to a standardized size of 224 × 224 pixels to meet the requirements of the input layer within the ResNet50 model.The original ultrasound images obtained from PACS are in RGB format with three color channels (red, green, and blue).The annotations, representing the selected areas, were displayed using color.In order to standardize the annotation process, this study converted the RGB ultrasound images to grayscale.All images were then converted back to RGB format using custom code running in Matlab (MATLAB® 2021b).

Data splitting and image augmentation
The dataset was categorized into five classes: non-FFL, cyst, FFS, Hemangioma, and HCC.To obtain a sufficiently diverse dataset with an adequate number of samples for each class.To address the bias resulting from imbalanced training data, we implemented a preprocessing technique for classes with less than 150 images.This technique involved rescaling or translating the images to artificially increase their count to 150 for each class (750 images in total).

Five-fold cross validation
In this study, we employed a five-fold cross-validation to assess the accuracy and generalizability of four distinct models: the graph-generative neural network (GGN), ResNet18, ResNet50, and Visual Geometry Group (VGG16) models.This involved randomly and equally partitioning the dataset into five subsets and training/testing the models on different combinations of these subsets.In the training phase of the five-fold cross-validation, we enhanced the number of images by applying various image augmentation techniques to four subsets within each fold.These techniques included translating images along both the x-axis and y-axis, as well as horizontally flipping them.As a result, we significantly increased the number of images per subset from 150 to 600.This process yielded a total of 2400 images for training across the five-fold.This augmentation aimed to enrich the training dataset and ensure a more balanced representation of the various classes during model training.Subsequently, we compared the performance of the ResNet50 model with that of the GGN, ResNet18, and VGG16 models.During the testing phase, we employed a set of 150 images from the remaining subset in each fold.This set comprised 30 images per class and was used to assess the performance of all the models.Figure 2 illustrates the workflow of our approach, encompassing image preprocessing, data splitting, F I G U R E 2 Depicts the workflow that accompanies our research methodology.This process encompasses several key stages.Initially, the dataset was randomly divided into five folds for the purpose of cross-validation.Subsequently, the overall performance of the models was assessed through five rounds of separate training and testing.
five-fold cross-validation, and model performance assessment.

CNN-ResNet50 architecture
This study adopted and modified a preexisting pretrained convolutional neural network (CNN) for image recognition known as Residual Network 50 (ResNet50) to develop a diagnostic model for FLLs  27,28 object detection, 29,30 and image segmentation. 31Its depth and residual connections allow it to learn intricate features and capture fine-grained details, making it a powerful architecture for visual recognition tasks.

Evaluation metrics for model performance
To evaluate the performance of the proposed method, we used confusion matrices to obtain the following three metrics, including accuracy, sensitivity, and specificity 19 :

Accuracy
This metric describes the number of correct predictions over all predictions.The percentage of Accuracy is calculated using the following formula: where TP is True Positives, TN is True Negatives, TP is True Positive, and FP is False Positives.

2.8.2
Sensitivity or recall or true positive rate (TPR) This metric refers to the proportion of actual positive instances correctly identified by the model.It is calculated as the number of true positives divided by the sum of true positives and false negatives.Mathematically, the percentage of sensitivity can be expressed as: x 100 (2)

Specificity
It refers to the ability of the deep learning model to correctly identify negative samples or true negatives.Specificity is calculated as the ratio of true negatives to the sum of true negatives and false positives: x 100 (3)

Comparison with previously published studies
In the classification of non-Focal Liver Lesions (FLL) or fatty liver, our ResNet50 model demonstrated superior performance compared to conventional Convolutional CNNs and was comparable to the VGG16 model as reported by Reddy et al. 21omparing the performance of our ResNet50 model to the CNN-based VGGnet model developed by Yamakawa et al. 22 in classifying HMG and HCC, both models achieved sensitivities over 80%.However, the sensitivity of ResNet50 notably surpassed that of the CNN-based VGGnet (86.3 ± 7.6 (HMG) and 81.2 ± 3.7 (HCC) compared to their sensitivity of 12%).
When comparing our ResNet50 model to the CNN + ReLU model introduced by Ryu et al., 23 our model demonstrated superior performance in discerning HCC.Specifically, our model achieved an accuracy of 87.2% ± 2.2, sensitivity of 80.7% ± 6.8, and specificity of 81.2% ± 3.7, surpassing the results reported in their study, which showed accuracy, sensitivity, and specificity of 80%, 67%, and 90%, respectively.
These results indicate the promising performance of the ResNet50 model in the context of liver lesion

DISCUSSION
In this study, we employed a pretrained ResNet50 deep learning model that was specifically optimized to classify five distinct classes of focal liver lesions (FLLs) from ultrasound images.This transfer learning approach reduces data requirements and computational time compared with training from scratch, while producing a network optimized for our specific task.Our model performed well on the training phase with an average accuracy of 87.0 %.
To assess the performance of the ResNet50, GGN, ResNet18,and VGG16 models,we conducted a rigorous five-fold cross-validation analysis on the same dataset, benchmarking it against other models.Our results demonstrated that our model consistently outperformed GGN and ResNet18.While VGG16 exhibited comparable performance to our model, it exhibited higher uncertainty (larger SD) in model accuracy and sensitivity across the five cross-validation folds.Subsequently, we leveraged all models for predictive tasks in the subsequent phase across five folds, employing a predivided testing dataset comprising 30 images per class in each fold.Consequently, our experiments established ResNet50 as the superior model among the options considered for comparison with other investigators.
We assessed the effectiveness of our newly optimized ResNet50 model by conducting performance evaluations on a testing subset, which comprised 10% of the data reserved for each fold.This subset consisted of 30 ultrasound images, encompassing both Focal Liver Lesions (FLLs) and non-FLL images.The model achieved ranging from 90% to 96.7% accuracy for predicting cysts, correctly identifying 27−29 out of 30 images across five-fold.In the remaining cases, the model misdiagnosed one image as non-FLL and another one image as hemangiomas.These incorrect diagnoses likely reflect the shared hypoechogenic pattern among cysts and hemangiomas.Alternatively, they may be caused by posterior acoustic enhancement of the cystic lesion.Further investigation will be necessary to determine the specific reasons behind these misclassifications.
With regard to FFS, our model accurately predicted 23−28 out of 30 images with seven images being misclassified as non-FLL, six images being misclassified as HMG, and five images being misclassified as HCC across five-fold.This incorrect classification may be attributed to overlapping features of hypoechoic lesions that are present in both FFS and HCC.The factors contributing to this misclassification warrant further examination.
In the case of HCC, our model achieved ranging from 70%−86.7% accuracy, correctly identifying 21−26 out of 30 images.The remaining four HCC images were misclassified as FFS (8 images), HMG (12 images) and non-FLL (9 images) across five-fold, likely because hyperechoic HCCs share features with typical hemangiomas, 32 FFS and liver tissue heterogeneity in non-FFL.Inclusion of all HCC features (hypo, hyper, and mixed echogenicity) with large sample size during model training would enhance the performance of future deep learning models.
For hemangiomas, the model achieved ranging from 76.7% to 86.7% accuracy, correctly classifying 23−26 images out of 30 images.However, one image was misclassified as cysts, nine images as FFS, and 18 images as HCC across five-fold.Hemangioma poses the greatest challenge for our model in misclassifying HMG to HCC, likely because its features overlap with other lesions.For example, typical hemangioma presents as a uniform hyperechogenic lesion with well-defined margins, as may also be seen with small HCC.Additionally, atypical hemangiomas present various characteristics, such as inhomogeneous tissue and ill-defined margins, that can be confused with HCC and hypoechoic lesions similar to FFS. 33 Certain hemangiomas can exhibit posterior acoustic enhancement similar to hepatic cysts, potentially leading to misclassification by our model.Moreover, the inclusion of neighboring anatomical structures within the ROI, such as vessels or the gallbladder, may contribute to potential misclassification by the model.Additional research is necessary to understand and resolve this diagnostic ambiguity.Inclusion of the color dropper as a novel image feature may offer significant advantages in future investigations.
Our model achieved up to 100% (ranging from 83.3% to 100%) accuracy in non-FLL classification with only one image being misclassified as Cyst, eight images as FFS, and three images as HCC for unclear reasons across five-fold.It is possible that, because all ultrasound images in this study were obtained from chronic liver disease patients who met the inclusion criteria for the liver cancer screening project at the research site, the misinterpretation of non-homogenous liver tissue in three patients led to this instance of misclassification.In addition,other tissue structures,such as hepatic vessels, in ROI might cause misclassification as cyst and FFS.The present study demonstrates promising results in ultrasound image classification; however, further investigation will be necessary to address the observed misclassifications, particularly when distinguishing hemangiomas from other lesions.
Previous studies [20][21][22][23][24] indicate that deep learning models exhibit high efficacy in binary classification tasks, specifically when distinguishing between benign and malignant cases.However, results of this kind do not provide adequate support for routine clinical implementation.The extension to multiple classes addresses the practical challenges encountered in clinical scenarios, particularly in the context of ultrasound image-based diagnosis of FLLs.In our study, we introduced a five-class deep learning model for FLL diagnosis.Our model demonstrates superior performance in distinguishing HCC when compared with some existing approaches that utilize a four-class methodology.
Our model achieved sensitivity and specificity values comparable to those reported by Hwang et al., 20 who utilized a binary class FFNN model that produced sensitivity/specificity values of 40%/60% for classifying HMG and HCC.Despite using more classes, our CNN-ResNet50 model surpasses their reported values in terms of sensitivity, specificity, and accuracy.Our findings also align with those reported by Ryu et al., 23 who employed a CNN + ReLU model for classifying four classes of FLL (Cyst, HMG, HCC, and MLC) and two classes of FLLs (benign and malignant).The study by Nishida et al., 24 which relied on a large HCC dataset, also supports our findings.Despite having access to fewer HCC cases (54 images) compared with the models adopted by Ryu et al. and Nishida et al., our model achieved a sensitivity of 73.3%, surpassing their results of 67% and 67.5%, respectively.However, our model exhibited lower specificity and accuracy.This indicates that, compared with their models, ResNet50 was superior in diagnosing true positive cases, but inferior in diagnosing true negative cases.
Our model achieves lower performance (sensitivity, specificity, and accuracy) than reported by Yamakawa et al. 23 for a CNN-based VGGnet with four classes of FLLs (Cyst, HMG, HCC, and MLC) and two classes of FLL (benign and malignant).However, their model was limited in its ability to distinguish between MLC and other FLLs, with a sensitivity of only 46% and a specificity of 12%.The sensitivity and accuracy achieved by their model may be improved by adopting a binary class approach; however, specificity remains relatively low (5%), indicating poor performance for correctly identifying true negative cases.Additionally, our model achieved the highest performance in classifying non-FLL from other FLLs, consistent with a study by Reddy et al. 21These authors utilized three different models (CNN, VGG16+transfer learning, and VGG16+transfer learning and fine-tuning) for classifying normal liver and fatty liver tissue.Their results demonstrated good performance across all models, particularly the model that combined VGG16 with transfer learning and fine-tuning.
This study presents several limitations.First, while the transfer learning network used here does not require an extensive dataset for training, the adopted sample size for FLLs was relatively small compared with the diverse lesion characteristics that are typically present in ultrasound images.A larger sample size may bolster ResNet50's performance in diagnosing FLLs; however, increasing sample size may not be sufficient.To address feature overlap among FFS, HMG, and HCC, future studies should include a broader spectrum of features for model training, including various characteristics of HCC and HMG such as hypo, hyper, and mixed echogenicity.Expert radiological assessment and assemblage of datasets with diverse image feature characteristics will significantly enhance model performance.Second, the absence of a consensus method for ultrasound image standardization led us to exclude image preprocessing for normalization.Future investigations should incorporate various image normalization techniques and compare their impact on model performance to determine the best method for image normalization.Third, the absence of MLC cases from our dataset may limit the clinical utility of our model.Future studies should include MLC cases to improve the applicability of ResNet50 to clinical settings.Finally, this study relied on data from a single center.Our findings therefore warrant further investigation through a multicenter study.Notwithstanding the above limitations, our five-class FLLs-trained model is adequate for developing CADx to assist in focal lesion screening, which is the primary purpose of the ultrasound modality.

CONCLUSION
The preliminary results show that the modified ResNet50 model adopted in this study produced satisfactory performance and accuracy values for most classes, particularly for non-FLL, Cyst, and FFS cases, which were associated with higher sensitivity values.However, its ability to differentiate between HMG and HCC cases required the improvement, as evidenced by the high number of images misclassified as HCC.Compared with previous studies that employed four-class deep learning approaches, ResNet50 outperformed CNN-based ReLU and FFNN models in the diagnosis of HCC.Future investigations should thoroughly analyze the selected dataset with diverse FLL characteristics of HMG and HCC to gain a comprehensive understanding of model performance in diagnosis of these two FLLs.

AU T H O R C O N T R I B U T I O N S
Guarantors of integrity of entire study, study concepts/study design, approval of final version of submitted manuscript, manuscript drafting or manuscript revision for important intellectual content, statistical analysis, Sutthirak Tangruangkiat and Monchai Phonlakrai; literature research, data acquisition or data analysis/interpretation, experimental studies, Napatsorn Chaiwongkot, Thanatcha Rawangwong, Araya Khunnarong, Chanyanuch Chainarong, Preyanun Sathapanawanthana, Pantajaree Hiranrat; and manuscript editing, agrees to ensure any questions related to the work are appropriately resolved, Chayanon Pamarapa, Ruedeerat Keerativittayayut, Witaya Sungkarat.

F I G U R E 1
Illustrates ROI placement on a hemangioma in an ultrasound image (a) and cropped image (b).

Figure 3
illustrates the key layers of the ResNet50 architecture used in this study, including: I. Input layer: Accepts input images of size 224 × 224 pixels with RGB color channels.To align images with the prescribed specification of this layer, grayscale images were converted to RGB format.II.Convolutional layers: The initial layer applies 64 filters of size 7 × 7 with a stride of 2, extracting low-level features from the input image.III.Max pooling layer: Reduces the spatial dimensions of the feature maps using a 3 × 3 pool size and a stride of 2. IV.Residual blocks: ResNet50 incorporates a hierarchical structure with 16 residual blocks distributed across four stages.Each stage exhibits a varying number of blocks and filter sizes.Stage 1 encompasses three blocks: two convolutional layers with 64 filters,and one convolutional layer with 256 filters.Stage 2 comprises four blocks: two convolutional layers with 128 filters, and one convolutional layer with 512 filters.Stage 3 consists of six blocks: two convolutional layers with 256 filters, and one convolutional layer with 1024 filters.Lastly, stage 4 involves three blocks: two convolutional layers with 512 filters, and one convolutional layer with 2048 filters.V. Global average pooling: Converts 2D feature maps into 1D vectors by averaging across each feature map channel, resulting in fixed-length feature vectors.VI.Fully connected layers: Two fully connected layers with five units each serve as classifier, mapping the learned features to specific class labels.The final softmax activation layer converts the output of the fully connected layers into probabilities, representing predicted class probabilities for the input image.VII.Output layer: This layer serves as classification output layer, estimating the likelihood or probability that the input data belong to one of the possible classes: non-FFL, cyst, FFS, Hemangioma, or HCC.F I G U R E 3 Illustrates the modified CNN-ResNet50 architecture, which has been adapted from the original architecture to accommodate five classes of FLL.

F I G U R E 5
Presents a comparative analysis of model performance achieved through five-fold cross-validation in the training phase, utilizing the same dataset.The models under consideration are GGN, ResNet18, VGG16, and ResNet50.TA B L E 1 The 5-fold cross-validation results of the accuracy, sensitivity, and specificity of ResNet50 in the training phase.

Table 2
presents a comparative analysis between our ResNet50 model and previous research in the field of computer-aided liver lesion diagnosis.Our study demonstrates the superiority of the ResNet50 model over a binary class Feedforward Neural Network (FFNN) model proposed by Hwang et al.
The comparison of predictive ResNet50 model's performance across previous studies.