Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods

Generating an image/video caption has always been a fundamental problem of Artificial Intelligence, which is usually performed using the potential of Deep Learning Methods, Computer Vision, Knowledge Graphs, and Natural Language Processing (NLP). The significant task of image/video captioning is to describe visual content in terms of natural language. Due to a semantic gap, this presents a massive problem in understanding and explaining images or videos syntactically and semantically. The current systems need somewhere to fill the gap between low‐level and high‐level features while mapping. Therefore, to tackle this problem, there is a need to describe the latest research and methods to overcome difficulties and to propose effective solutions. This work thoroughly analyses and investigates the most related methods (deep learning and knowledge graph‐based approaches), benchmark datasets, and evaluation metrics with their benefits and limitations. Here we have also reviewed the state‐of‐the‐art methods related to image/video captioning and their applications in the current scenario. Finally, we provide thorough information on existing research with comparisons of results on benchmark datasets. We have also mentioned the existing challenges and future direction of research.

Image captioning refers to automatically generating a natural language description of an image. 1 It is a challenging task that involves both image understanding and language generation.The goal of image captioning is to generate a sentence that accurately describes the content of an image and is grammatically correct. 2 Video captioning also known as video description/video storytelling generates natural language descriptions of the events and actions in a video.This task is like image captioning but more challenging, as it requires understanding visual content and events' temporal dynamics.Recently, deep learning methods such as Convolutional Neural Networks (CNNs), 3 Recurrent Neural Networks (RNNs), 4 and Knowledge Graphs (KGs) 5 have been used to achieve significant improvements in image/video captioning.These methods use neural networks to learn features from images and videos, and then these features are used to generate captions.It involves getting an object, attribute detection, their interaction with other objects, the relationship among things, and finally explaining the same so that anyone can quickly get what is happening in the image/video.

Image captioning
Explaining or generating the description of an image is called image captioning or frame captioning. 6Image captioning refers to finding descriptions of each frame, but the important thing is to find the definition in terms of a sentence that should be correct with proper meaning. 7,8][27][28][29] Generating an automatic description of the image is a new work.As we know, most communication between machines and humans depends on natural language understanding and explaining. 30That is why various applications of image description take place in a real-world scenario, such as information access and retrieval, [31][32][33] providing assistance to visually impaired people, 34,35 education, natural language processing, and social media factors, 36 and so forth.Image captioning is gaining popularity and becoming a significant field of artificial intelligence and computer vision study.Some examples of image captioning are given in Table 1.

Image Image caption
"A cat is perched close to a pine tree and is gazing upward". 37n old man with a black suit is playing Piano".38 "A baby girl wearing a Hijab in preparation to offer Namaz (Salah)".
F I G U R E 1 Video caption.
Somehow, it's a new work to generate the results from the computer in terms of getting an automatic description of the image.As we know most of the communication between machines and humans depends on natural language understanding and explaining. 30That is why there are various applications of image description taking place in a real-world scenario such as information access and retrieval, [31][32][33] providing assistance to visually impaired peoples, 34,35 education, social media, natural language processing, and so forth.Image captioning is gaining popularity as a difficult and significant area of artificial intelligence and computer vision study and is becoming more and more crucial.

Video captioning
Providing an automatic video content description with human-understandable language is widespread and is termed video captioning. 39The video captioning task is extremely attractive in artificial intelligence, computer vision, and knowledge graphs.In the past, video captioning was the task of detecting visual content with manually designed characteristics and generating a caption in terms of the sentence. 40,41he purpose of creating video captions is to provide a sequence of words to explain the visual content of that video.It is necessary to capture the temporal dynamics to comprehend the video material, in addition to the fact that the video includes significantly more information than a still image. 42There are many applications where video captioning takes place in a significant way, such as Video Retrieval Systems (VRS), 43 Visual Questioning Answering (VQA), 44,45 assist visually impaired people, 46 Text-to-speech technology, 39 and so forth.An example of video captioning is given in Figure 1.

Dense video captioning
Dense captioning of videos is composed of different steps; the first is to identify all events in the video, the second is action recognition, and the third is to do video captioning for all possibilities in a particular video. 38Localizing noteworthy events from an uncut video and creating textual descriptions (captions) for each identified event is known as dense video captioning.The majority of earlier works on dense video captioning used only visual cues. 479][50][51][52][53][54] An example of a dense video caption 38 is given by Figure 2.

ORGANIZATION
Further the work is organized as follows.Section 3 reviews related work on image/video captioning and dense video captioning.Section 4 describes deep learning methods for image captioning with applications, evaluation metrics, and datasets.Section 5 describes methods of deep learning towards video captioning with applications, datasets, and evaluation metrics.Section 6 describes methods of deep learning for dense video captioning with applications, datasets, and evaluation metrics.Section 7 describes knowledge graph-based methods for image captioning and dense video captioning.
In Section 8, we presented an evaluation of existing work with benchmark datasets.We provided some existing challenges in Section 9.In Section 10, we concluded this work with specific points with future directions in image and video captioning.

RELATED WORK
Captioning describes an image/video as a natural language sentence/paragraph.In the field of deep learning and knowledge graph, it has been a vital task.It occurs through recognizing each action and video captioning.We have thoroughly studied existing work regarding image captioning, video captioning, and dense video captioning.Liunian Li et al. 55 developed a model Grounded Language-Image Pre-training (GLIP) for mastering object-level, linguistically conscious, and semantically rich visual representations.Work suggests integrating text and image deeply, making the detection model linguistically aware and a solid foundational model.Then able to pre-train GLIP on scalable and semantically rich grounding data via reformulation and deep fusion.
Jia et al. 56 developed a straightforward technique "A Large-scale Image and Noisy-text embedding (ALIGN)" for scaling up the learning of visual and vision-language representations from vast amounts of noisy image-text data.They trained dual encoder model using a contrastive loss on benchmark datasets Flickr30K and MSCOCO and got good results.
Maofu Liu et al. 57 tackled the problem of image information and explored visual attention to understand an image using a Fully Convolutional Network (FCN). 58FCN is used to predict the labels.Work has been carried out on the Chinese Caption Dataset (CCD), and the result is compared with others effectively and feasibly.
Xuelong Li 59 carried out text guided attention and semantic attention to get the most related spatial information and the reduced semantic gap between visual and natural language.At last, the authors gathered all data to produce the required answers in caption form for an optical question answer system.Songtao Ding 60 proposed attention theory in psychology for image caption and combined low-level features (quality of an image) with high-level features (regions of an image) to focus particular areas of an image.The authors introduced the theory of attention in psychology for appearance captioning and used filter image features. 61They combined bottom-up attention mechanism with faster R-CNN to get the results with benchmark datasets such as MSCOCO (Microsoft Common Objects in COntext), Flickr30K, PASCAL, 62 and SBU 63 datasets.
Xinlei Chen et al. 64 employed sentence-based explanations and bidirectional mapping between the visuals.Their work led to generating novel captions of an image and was able to reconstruct visual features given in an image description.Authors performed testing of their work in sentence generation, sentence retrieval, and image retrieval.They used different datasets to evaluate the performance of their model, such as the PASCAL sentence, 65 Flickr8K, Flickr30K, and the MSCOCO. 66unhua et al. 9 created a system to produce new captions using unique pairings of elements.The authors proposed a Multimodal Recurrent Neural Network (m-RNN) framework that is tailored for the retrieval and sentence creation tasks.The model was comprised of a CNN and an RNN that interact with one another in a multimodal layer that received three inputs: an image representation, an embedding word layer, and a recurrent layer.A final softmax layer was used to build the probability distribution for the next word.
Mathew et al. 67 proposed a SentiCap system that combined positive and negative attitudes into captions.It was a switching RNN model with word-level regularization that emphasized sentiment. 68To create styled subtitles, two networks CNN and an RNN, were used.A large dataset of image captions was used to train one network to provide typical factual descriptions, and a smaller dataset containing sentiment polarity was used to prepare the other network.Studies revealed that 74% of the phrases created by SentiCap had the correct sentiment.
Anderson et al. 69 proposed a novel algorithm for training sequence models, such as RNN, on partially specified sequences, which are represented using finite state automata.This method lifted the restriction that previously required captioning image models to be trained on paired image-sentence corpora only.The authors applied their approach to an existing neural captioning model and achieved SOTA (State-of-the-art) outcomes to novel object caption tasks using an MSCOCO dataset.Further, they trained their model to describe new visual concepts from the Open Images dataset while maintaining competitive COCO evaluation scores.Now moving towards the research related to Video Captioning, several prominent researchers gained good results in this field, such as Zhang et al. 70 presented a comprehensive video caption system that included a new structure and effective training strategy to tackle the current problem with video captioning that occurs because current models lack visual presentation due to negligence of interaction among objects.They presented an encoder using an Object Relational Graph (ORG) to capture more specific interaction features and enhance visual representation.Also, they designed a system called Teacher Recommended Learning (TRL) to utilize the successful External Language Model (ELM) fully and incorporated the wealth of linguistic knowledge into the caption model.The ELM generated additional semantically comparable word proposed to address the long-tailed problem, extending the ground truth words utilized in training.Three benchmark datasets, MSVD, 71 MSR-VTT, 72 and VATEX, 73 were used to evaluate the system's performance, which revealed that the proposed ORG-TRL system achieved state-of-the-art performance.Visualizations and extensive ablation studies demonstrated the efficiency of the approach.
Chenggang et al. 74 proposed an encoder-decoder neural network with a brand-new Spatial-Temporal Attention Mechanism (STAT) for video captioning.They suggested that STAT effectively accounted for both the spatial and temporal patterns inside a video clip, causing the word prediction decoder to automatically select the significant portions in the most appropriate temporal segments.Two well-known benchmark datasets, MSVD, and MSR-VTT-10K, were used to assess the spatial-temporal attention mechanism.According to experimental findings, the spatial-temporal attention mechanism performed at the cutting edge on three widely used evaluation metrics BLEU-4, METEOR, and CIDEr.
Yang et al. 75 proposed a novel method for video captioning using adversarial learning and LSTM.The authors worked on the Generative Adversarial Network (GAN) 76 model, which incorporated two different things; a generator (used to generate natural language sentences available in visual content) and another discriminator which controls the accuracy of that particular sentence.They used the LSTM network to implement an already-existing video captioning concept.They suggested a novel realization for the discriminator that used both the sentences and the video features as input and was tailored specifically for the video captioning challenge.
Xiaojuan et al. 77 proposed a semantic descriptor with an objective for scene recognition.The authors presented the statistical information of objects appearing in each scene to compute the distribution of each object across stages, which obtains the co-occurrence pattern of things.To make image descriptors more discriminative, they discarded the patches with non-discriminative objects to enhance the intra class generalized characteristics.They performed their experiment on benchmark scene datasets such as Scene, 78 MIT Indoor, 79 and SUN 80 and achieved good results.
Lianli et al. 81 introduced a unique framework for learning multi-level representations and generating syntax-aware video captions called Hierarchical Representation Network with Auxiliary Tasks (HRNAT).In order to learn how to represent movies hierarchically using the three-level representation of languages as a reference, they used the cross-modality matching task.In addition to being globally identical to the video material, descriptions created with the help of the syntax-guiding task and the vision-assist task also adhered to the syntax of the ground truth description.Their model's essential elements were universal and easily adaptable to applications requiring both captioning of videos and Video Question Answering (VQA).The effectiveness and superiority of the suggested method over state-of-the-art methods were validated by the authors' evaluation of the framework's performance on several benchmark datasets.
Huaishao et al. 82 proposed a model known as CLIP4Clip.It was utilized to seamlessly convert the knowledge of the CLIP model to video-language retrieval.They used a video encoder and text decoder on datasets like MSR-VTT, MSVC, LSMDC, ActivityNet, and DiDeMo.They adopted the ViT-B/32 83 video encoder, which has 12 layers and a 32-bit patch size.Their work is based explicitly on the pre-trained CLIP (ViT-B/32) 84 and focused chiefly on converting image representation to video representation.The video-text retrieval challenge in their work is successfully completed using the pre-trained CLIP (ViT-B/32).
Tang et al. 85 proposed a CLIP-enhanced video-text matching network as the foundation of the CLIP4Caption system, which enhances video captioning.This approach compels the model to strongly learn text-correlated video features for text creation, making the most of the knowledge from vision and language.Additionally, they employed a Transformer structured decoder network to efficiently learn the long-range visual and language relationship, unlike most previous models that used LSTM or GRU as the sentence decoder and provided a new ensemble method for captioning assignments as well.Experiment results showed how practical their approach is on benchmark dataset like MSR-VTT.
Zhou et al. 86 discovered facial units in video sequences of one or more persons in an unsupervised manner.The methodologies on temporal segmentation, 87 and clustering 88 of sequences containing facial features tried to fill the semantic gap between low-level and high-level features.Portillo et al. 89 explored using CLIP, a language-image model, to produce video representations without requiring the annotations mentioned earlier.This model was specifically designed to discover an area where text and photos could be compared.The approach solely considered visual and text modalities and employed an aggregation function to frame-level characteristics, which is prevalent in other video retrieval works.The authors performed their experiments with two benchmark datasets, MSR-VTT and MSVD, and got good results.
Bang et al. 90 introduced Dual Attribute Prediction, an auxiliary task requiring a video caption model to learn the correspondence between video content and attributes and the co-occurrence relations between attributes.For video captioning, "pre-training and fine-tuning" has become a de facto paradigm, where ImageNet Pre-training (INP) is usually used to help encode the video content, and a task-oriented the network is fine-tuned from scratch to cope with caption generation.They compared INP with CLIP (Contrastive Language-Image Pre-training), their work investigated the potential deficiencies of INP for video captioning and explored the key to generating accurate descriptions.They studied INP versus CLIP and found INP made video caption models tricky to capture attributes' semantics and sensitive to irrelevant background information.By contrast, CLIP's significant boost in caption quality highlighted the importance of attribute-aware representation learning.The authors achieved good results on MSR-VTT datasets.There are many existing works [91][92][93][94][95][96][97][98][99][100] regarding video captioning.
Moving on to work on dense video captioning, it generates and describes every event happening in the video in terms of natural language.The extensive image captioning challenge served as a source of inspiration for this. 49Vladimir Iashin et al. 47 introduced a novel method for dense video captioning that could use a variety of modalities to describe events and demonstrated how audio and speech modalities could enhance a dense video captioning model in particular.An automatic speech recognition system was used to produce a textual description of the address that was temporally synchronized.After that, they treated this description as a distinct input in addition to the audio track and accompanying video frames.To translate multimodal input data into textual descriptions, they structured the captioning task such as an automated translation issue and used the recently described transformer architecture.The authors used the ActivityNet 38 dataset to show off the effectiveness of their approach.
In particular, authors 51 used the concept of the context-aware, 38 generalizing the temporal event proposal module to use both past and future contexts and an attentive fusion to distinguish captions from extremely over lapped events.Meanwhile, Single Shot Detector (SSD) 101 was also used to generate event proposals and reward maximization for better captioning. 52o mitigate the intrinsic difficulties of RNNs to model long-term dependencies in a sequence, Zhou et al. 48tailored the recent idea of Transformer 102 for dense video captioning.In Reference 103, the authors noticed that the captioning may be benefited from interactions between objects in a video and developed recurrent higher order interaction modules to model these interactions.Xiong et al. 104 noticed that many previous models produced redundant captions and proposed to produce captions conditionally on the previous caption in a progressive manner while applying paragraph and sentence-level rewards.Similarly, a "bird view" correction and the reward maximization on two levels for a vast logical narrative telling were also employed. 53Yu et al. 105 presented Accelerated Masked Transformer (AMT) model 48 for dense video captioning and compared it with the counterpart.AMT is significantly faster while maintaining its performance.The authors worked on two parts.First, they brought a compact, anchor free suggestion and a basic attention strategy to the organization.They used a single shot feature masking technique and a standard attention mechanism.Authors got better results by experimenting with datasets such as ActivityNet Caption and YouCookII. 106hough existing captioning techniques made encouraging strides, they could not represent implicit components of the image since the information gained from ground truth captions needed to be more extensive also could not characterize new traits or qualities beyond exercising the collection of knowledge.By adding data from outside sources into the caption generating process, this problem can be resolved.In a previous study, Anne Hendricks et al. 107 used object knowledge from external object recognition datasets or text corpora to enable innovative object captioning.
Li et al. 108 and Gu et al. 109 used Knowledge Graphs to generate scene graphs and answer visual questions, respectively.These two works made their models adaptable to external test situations by embedding the knowledge obtained from external knowledge graphs into a shared area with other data.Zhou et al. 110 suggested about knowledge graphs to improve image captioning, which pre-trains an RNN by extracting phrases that are directly and indirectly connected to entities identified by an object detector using a knowledge graph.
Based on related work, we have categorized visual captioning based on deep learning and knowledge graph based methods for image/video captioning and dense video captioning in Figure 3.

DEEP LEARNING METHODS FOR IMAGE CAPTIONING
Recently lots of work has been done on image captioning using the methods of deep learning.We explored some of the great work by prominent researchers in artificial intelligence and computer vision.

F I G U R E 4B
General deep learning method for image captioning.
Using a Convolutional Neural Network (CNN) that has already been trained, high-level visual information from images is extracted for use in image captioning.These attributes are then used as inputs to a sequence model, such as a Recurrent Neural Network (RNN), to produce textual captions.By converting the raw picture data into a more condensed and comprehensive representation, feature extraction in image captioning accelerates the procedure.By utilizing the image's visual context, enables the succeeding series model to concentrate on the textual part of caption development.
In the context of Convolutional Neural Networks (CNNs), a feature vector at the fully connected layer with dimensions of 1 × 1 × 2048 refers to a single-dimensional vector that captures high-level abstract features extracted from the input image.(1 × 1) indicates the spatial dimensions of the feature map.In this case, it means that the feature map has been reduced to a single spatial location to one row and one column (2048) refers to the number of channels or feature maps in the (1 × 1) feature map.Each channel captures a different aspect of the image's high-level features.In the mentioned working process of generating a caption of an image, we just took an image of the cat as an input for the model using Deep Residual Networks (ResNet 50). 111It is a pre-trained model for encoding and extracting features of that image; after that LSTM 112 was used for decoding and generating a caption of the image such as one word at a time. 113In deep learning methods, features may manage a huge and varied set of images since they are automatically learned from training data.CNNs, 3,114 are primarily used to understand features, and a Softmax is dedicated to classification in the captioning process.Currently, CNNs proceed with RNNs 4 to generate captions of the specific image or image datasets.There has been a lot of research regarding image captioning by deep learning methods, as we can see the Table 2.

Datasets
There are some datasets 128 that are widely used to evaluate and compare image captioning, and these can be seen in Table 3.

Evaluation metrics
When something is developed, it may mean that it will perform poorly.We must evaluate the model, system, architecture, and so forth, with some evaluation metrics.To achieve better results, we must assess image captioning work by standard metrics such as BLEU, ROUGE, METEOR, CIDEr, and SPICE.These are famous and widely used measures to gauge the effectiveness of developed models.The image captioning task focuses on the grammar, the richness of the defined sentence, quality, and the correct semantic way of sentences. 135

BLEU
BLUE is a Bilingual Evaluation Understudy. 136It is a standard approach used to assess the level of generated text/sentence while doing the process of image captioning.The BLEU metric is based on n-gram-based precision.It scores each translation segment against a group of examples of reference translations with high quality before estimating the total rate.BLEU uses an n-gram matching rule in the picture description as a similarity measurement method.The predicted caption and label have n-grams and can be examined to determine the BLEU assessment metric.The Euclidean Norm, also known as the L2 norm or 2-norm, is used to determine the accuracy or fine-grained representation when determining the length of a vector.In order to account for potential restrictions and numerical restrictions inherent in calculations of limited precision, this accuracy examines how well the Euclidean Norm captures the vector's extent in Euclidean space.The accuracy or level of information used to represent a vector using the Euclidean Norm, also known as the L2 norm or 2-norm, is referred to as the 2-norm's precision.In Euclidean space, a vector's length is expressed as its 2-norm.A vector's Euclidean Norm (2-Norm) is given as follows for a vector.
Suppose there are two sentences, one in target sentence and another one in the predicted sentence, then we must find a Precision 1-gram (p1), a Precision of 2-gram (p2), a Precision of 3-gram (p3), a Precision 4-gram (p4).Further, combining these Precision Scores using the formula given in Equation ( 1).It can be computed for different values of N and using different weight values.Typically, we use N = 4 and uniform weights w = N/4 using Equation ( 1), where N is the geometric average precision Now we must calculate a 'Brevity Penalty' by the following Equation (2).
Where c is the predicted length equal to the number of words in the expected sentence, and r is the target length equal to the number of words in the target sentence.The brevity penalty is 1 if c > r, and the brevity penalty is e (1 − r/c) if c ≤ r.
In this example, c = 8 and r = 8, which means Brevity Penalty = 1.Finally, to calculate the BLEU Score, we multiply Equation (3) the Brevity Penalty with the Geometric Average of the Precision Scores (GAPS). 137ue (N) = Brevity Penalty * GAPS (N)

CIDEr
Each sentence is treated as a document by Consensus-based Image Description Evaluation, which is known as CIDEr. 138he cosine angle of the word frequency inverse document frequency (TF-IDF) vector is calculated to determine how comparable the description sentence and the label are.The outcome is then produced by averaging the similarity of tuples of various lengths.The specific formula is given as Equation (4).
where c stands for a potential caption, S for a group of reference captions, n for an n-gram that needs to be assessed, and M for the total number of reference captions in (.) for an n-gram-based TF-IDF vector.This technique enables distinct tuples to have varied weights with various TF-IDFs since tuples that appear more frequently across the whole corpus often carry less information.As a result, CIDEr can accurately and significantly assess the grammar of descriptive sentences.

METEOR
Another evaluation index used in machine translation is the Metric for Evaluation of Translation with Explicit Ordering (METEOR). 139The METEOR metric computes the Precision and Recall for a query image caption before averaging the results.Suppose wt = word numbers.wr = word count in reference.m = amount of most used words between wt & wr.
Then, Precision (P) = m/wt and Recall(R) = m/wr.Now the harmonic mean is denoted as F mean by Equation (5).
At last, the metric METEOR is computed as follows.

METEOR = (1 − pen) * F mean
where penalty factor(pen), Here, the pen is the penalty factor, and cb stands for "chunks," which stands for an ordered block of contiguous space and hyper parameters determined such as α, θ, γ.

ROUGE
Based on the co-occurrence data of the N-tuples in the evaluation abstracts, the ROUGE that is Recall Oriented Understudy for Gisting Evaluation 140 approach analyzes abstracts.It is a technique for measuring the recall rate of N-tuples and is employed to gauge how fluently a machine can translate.Take the frequently employed ROUGE-L as an illustration.LCS (C, R) can be expressed as the length of the longest common sub-sequence given Candidate C and Reference R by Equation (6).

SPICE
Semantic Propositional Image Caption Evaluation (SPICE). 141With the help of SPICE, objects, properties, and relationships in the description sentence are encoded using a graph-based semantic representation, which is then evaluated on a semantic level.Assume that S stands for a collection of reference captions and that c is a contender.The candidate's scene graph is denoted as G(c) =" O(c), E(c), and K(c)," while S ′ s scene graph is designated as G(s) and T (.) denotes the transformation of a scene graph into a set of tuples with the type T (G(c)) O(c) U E(c) U K (c).
The Precision can then be represented as Equation (7).
Here ⦻ binary matching operator yields a list of tuples that match.While SPICE is better at evaluating semantic information, it cannot assess the natural fluency of sentences since it ignores grammatical criteria.Additionally, as the evaluation primarily examines noun similarity, it is unsuitable for applications like machine translation.

DEEP LEARNING METHODS FOR VIDEO CAPTIONING
The task of creating captions for video is very similar to that of image captioning.The main intention of the video caption process is to describe a video's visual content in natural language form. 39We can see the use of video captioning in the subtitling video, video surveillance, the interaction of humans with machines, and so forth.The primary deep-learning method of video captioning is given in Figure 5.
Figure 5 shows the procedure for the development of the video captioning process such as 1.Take a video clip/ dataset as input.
2. Generate frames of the input video.
There have been lots of research regarding video captioning by deep learning methods; we can see this in Table 4.

Datasets
The key factors influencing the rapid development of this field of study have been the availability of labeled datasets for video descriptions.Except for a few datasets containing many phrases or even paragraphs per video sample, most datasets only assign one caption per video.Here we mentioned benchmark datasets for video captioning process in Table 5.

Evaluation metrics
The metrics for the evaluation of machine generated captions of video are the same as image captioning metrics, such as Bilingual Evaluation Understudy (BLEU), 136 Consensus-based Image Description Evaluation (CIDEr), 138 Metric for evaluation of translation with explicit ordering (METEOR), 139 Recall-Oriented Understudy for Gisting Evaluation (ROUGE), 140 Semantic Propositional Image Caption Evaluation (SPICE), 141 and Word Mover's Distance (WMD). 174I G U R E 5 General deep learning method for video captioning.

DEEP LEARNING METHODS FOR DENSE VIDEO CAPTIONING
Dense video captioning aims to identify the critical moments in the video input and create thorough captions for each one.Dense video captioning is challenging since it calls for a thorough understanding of the video's contents and contextual reasoning of specific occurrences to maintain accuracy and faith in describing events in videos. 175,176ense video captioning comprises two tasks: event proposing and performing the captioning process on those events. 38Latest work 47,106,176 follows the two-stage "detect-then-describe" framework, in which the event proposal module first predicts a set of event segments, then the captioning module constructs captions for each candidate event segment.Another line of work 91,177 removes the explicit event proposing process.We can get an idea of the dense video captioning process by the Figure 6. 91here is a lot of work regarding dense video captioning 49,123,176,178,179 that we mentioned in Table 6.Here we explained one of those works as mentioned in the above method.The dense caption generation was formulated as a set prediction job, as indicated in the following steps.Authors proposed a straightforward, efficient framework for beginning-to-end dense video captioning with parallel decoding (PDVC) that contains following steps: 1. Use a video feature extractor that has been trained.2. A transformer encoder can be used to retrieve several frame-level features.3.Then, three prediction heads and a transformer decoder are suggested to anticipate the positions and captions.4. Quantity of events that an event query can learn.
The work offers two kinds of caption heads based on different LSTM models: vanilla LSTM and deformable soft attention enhanced LSTM.This is required because it does the task without needing to delete redundant data through non-maximum suppression, rank the captioning and localization scores to determine the top identified events during testing.

Knowledge graph
Knowledge Graphs have been used frequently in research and business, usually in close association with semantic web technologies, linked data, large scale data analytics, and cloud computing.Its popularity is influenced by the introduction of Google's Knowledge Graph in 5 Significantly, major companies such Google, Yahoo, Microsoft and Facebook have created their own" Knowledge Graphs" that provide semantic power searches and enable smarter data processing and delivery.It could be envisaged as a network of all things relevant to a specific domain or organization.They are not limited to abstract concepts and relations but can also contain instances of documents and datasets.The knowledge graph can be described as follows: 1. Mostly discusses real world objects and the relationships between them, arranged in a graph.2. Defines potential entity types and relationships in a schema.
3. Allows for the potential relationship between any two arbitrary entities.4. Occupies a range of subject areas. 184 generate new knowledge, a knowledge graph gathers and incorporates data into an ontology, and with the help of a reasoning engine, it comes with possible solutions.Figure 7 illustrates the combination of these assumptions, which yields an abstract knowledge graph architecture. 5Most of the prominent open knowledge graphs are Google's Knowledge Graph, 185 Google Knowledge Vault, 186 DBpedia, 187 YAGO (Yet Another Great Ontology), 188 Freebase, 189 and Wikidata, 190 NELL, 191 PROSPERA, 192 cover multiple domains, representing a broad diversity of entities and relationships.

Knowledge graph-based methods for image captioning
As per Figure 8, the description reads, "A woman is standing with luggage," because the caption would only convey basic elements of the image and not why the woman is standing there.The phrases "lady and luggage," which characterize the image's key components, are actually given more importance in the earlier generated caption than in other words.The caption will read as though she might be looking for a roadside direction sign board to take down an address or waiting for a bus with her bags by combining external knowledge (Knowledge Graph). 193o get the caption of an image by Knowledge Graph, we must go with the following steps: 1. Feature extraction and word embedding Region proposal network is used to create many rectangular region proposals.Each suggestion is then supplied to three fully linked layers, an ROI pooling layer, and a vector representation of each image region.To provide a general knowledge of a picture with image, we need to generate features V = v1, v2, … , vL, viRD (r to the power d), the mean pooling vector v will be given to initialize the LSTM decoder.create every word at every point.Here the word attention is added to develop captions using the caption generator.Ground truth captions are the primary source of word attention during training.

Knowledge graph
The knowledge contributed by people for caption also known as internal knowledge and is represented by the ground truth annotations for each image in paired image caption datasets.However, it is only possible to include some of the information necessary for captioning tasks in many available datasets, which restricts the advancement of research.Consequently, obtaining information from outside sources to aid in caption production will enhance the generalization capabilities of the captioning model.Knowledge graphs have become increasingly common in artificial intelligence in recent years.Chin et al. 194 used ConceptNet to aid computers in comprehending human intentions.
ConceptNet is an open, multilingual knowledge graph containing common sense information intimately tied to daily human life.
Each knowledge item in the knowledge graph can be seen as a triplet (subject, rel, object), where subject and object stand in for two real-world concepts or entities and rel denotes their relationship.Faster R-CNN is used to identify several things or visual ideas.Then we use these objects or concepts to retrieve semantically related knowledge from the knowledge graph to gain an informative understanding pertinent to the given image.

Reinforcement learning-based sequence generation
This reinforcement learning-based training method's central tenet is that the reward that the inference algorithm receives during testing serves as the baseline for the reinforcement algorithm.This method maintains consistency during training and inference, greatly enhancing the quality of the captions that are created.

Knowledge graph-based methods for dense video captioning
As per work, 195 TransE 196 represents the knowledge graph, and Mask R-CNN represents the object detector.Relation represents the projected result of the TransE model, and object category represents the predicted result of the object detector.Figure 9 shows the process of dense video captioning with knowledge graph.We can understand the work depicted in Figure 9 by following steps.

Object detection
As per the above procedure, Mask R-CNN 197 detects objects in given video frames.

Knowledge graph
For the relationship between the objects TransE 196 the model has been chosen based on the distributed vector representation of entities and relationships as knowledge representation.

Authors Year Methods Task Dataset
Jiahui et al.

Transformer
Transformers have been taken as the main framework for the above method to enter 2D features, 3D features, and relationship information.Based on a thorough study of existing work, we have presented in Figure 10 the most used dataset for image captioning, and the most used dataset for video captioning is presented in Figure 11, and evaluation metric in Figure 12.

Comparison of image captioning work with evaluation metrics on benchmark datasets
Finally, Table 8 summarizes the evaluation outcomes of representative image captioning research conducted by academicians on benchmark datasets such as MSCOCO, Flickr30k, and Flickr8k dataset.When analyzing the MSCOCO dataset, Zhilin et al. 206 produced good results in the BLUE metric (0.670), and Marco et al. 207 produced the best results when evaluating CIDEr (0.938).Junqi et al. 208 produced outstanding results when analyzing ROUGE (0.509).

Comparison of video captioning works with evaluation metrics on benchmark datasets
On benchmark datasets such as MSVD, MSR-VTT, ActivityNet, YouCook2, M-VAD, MPII-MD, Charades, TACoS MLevel, and LSMDC, researchers have performed representative video captioning work.In Table 9, we explained the evaluation findings of this work.When analyzing performance using METEOR (36.9) and ROUGE (73.9), Lee et al. 211   (19.63).Working with the ActivityNet data collection, Wang et al. 213 had excellent results in BLUE (2.30) METEOR (9.60).Sun et al. 214 made positive results using the YouCook2 dataset.On the YouCook2 dataset, Lei et al. 215 obtained the best results in BLUE (8.0), CIDEr (35.74), and METEOR (15.9).The best results were obtained by Yao et al. 216 on the M-VAD dataset 6.1 in BLUE (0.7), CIDEr (6.1), and METEOR by Pan et al. 217

EXISTING CHALLENGES
Though deep learning and knowledge graph based methods have been used to generate captions of images and videos, there are still several unresolved issues, which are needed to be addressed.These are as follows: 1. Existing captioning systems frequently generate captions sequentially, where the next generated word depends on both the previously generated word and the picture characteristic.They often lack composition and naturalness.This frequently results in language structures that are syntactically accurate but semantically meaningless, as well as an absence of diversity in the output captions.2. Creating rich, inventive, and human like captions bridging the semantic gap between linguistic and visual representations.3. Current evaluation metrics still need to be improved because they ignore the image.When scoring various descriptive captions, their scoring frequently remains insufficient and misleading.Human assessment continues to be the gold standard for rating captioning systems.
4. There is a requirement to improve the higher-quality video representation approach for video captioning. 5. Need for logic and common sense in scene comprehension.

CONCLUSION AND FUTURE WORK
This article reviews and evaluates most studies on image/video captioning and dense video captioning.The work classifies the captioning methods into two groups: the deep learning approaches and knowledge graph based approaches.Each category is based on each research method's fundamental characteristics and differences.Many researchers have employed various scene interpretation techniques, including the encoder-decoder and attention mechanisms.We mentioned several evaluation measures that are most frequently utilized concerning evaluation measures.We briefly overviewed the most popular datasets and evaluation metrics for dense captioning and simple captioning processes.The most appropriate datasets for image captioning are MSCOCO, Flickr8K, and Flickr30K, and for video captioning, they are MSVD and MSR-VTT.The most recent techniques are then evaluated using benchmark datasets.While carrying out this work, we mentioned numerous methods for extensive image/video captioning.The best models for extracting image/video content are CNN, RNN, and LSTM, which are widely used for language production.By this thorough review, we also conclude that Knowledge Graph based methods are best for captioning because they can detect objects and predict relations between objects with their attributes.Incorporating knowledge graphs into the image/video captioning systems can improve the semantic understanding, consistency, and coherence of the generated captions, making them more valuable and understandable to humans.Knowledge graphs can be used to refine the performance of image/video captioning systems in several ways: 1. Providing additional contextual information that can help the captioning system to better understand the content of image/video.For example, a knowledge graph containing information about ordinary objects, scenes, and actions could guide the captioning system's attention to relevant parts of the image/video.Additionally, the knowledge graph can provide information about relationships between objects and scenes, which can help the captioning system generate more accurate and detailed captions.2. Knowledge graphs can be used in training the model.By using knowledge graph entities to anchor the captions, also the model learns to generate semantically consistent captions with the information in the knowledge graph.3. Knowledge graphs can be used in the post-processing stage of captioning to improve the coherence and consistency of the generated captions.
For prospective future research, this paper explores dense video captioning utilizing knowledge graph based approaches.This comprehensive analysis will assist academics better comprehend the methodology, measurements, and datasets for image/video descriptions and pave the way for future research.

F
I G U R E 3 Deep learning and knowledge graph based methods for image/video captioning.
Figures4A and 4Bhelp us grasp the fundamental idea of creating image captions.We used deep learning techniques to feature extraction and describe the working process, respectively.We extracted 2048 feature vectors from an image with an input size of 224 × 224 × 3, which denotes an image with a resolution of 224 pixels in both height and width and three color channels (R, G, and B).

F
I G U R E 4A Feature processing (CNN layers).
2. Word attention Let the description of a picture be a sentence S = w1, w2, … , wN, where N is the size of the narrative.These sequential learning methods typically employ RNN or LSTM, with LSTM demonstrating excellent performance, to F I G U R E 7 Knowledge graph.

F I G U R E 8
Image captioning by knowledge graph.

F I G U R E 9
Knowledge graph module for dense video captioning.TA B L E 7 Knowledge graph-based methods dedicated to image/video captioning.

F I G U R E 10
Image captioning datasets.F I G U R E 11 Video captioning datasets.Datasets are the same for image caption whether anyone is taking deep learning approaches or knowledge graph-based methods for image captioning.It's the same in case of video captions, whether we go with deep learning techniques or knowledge graph-based methods for video captioning.
Recent works dedicated to image captioning.Benchmark datasets for image captioning.
TA B L E 2 Recent works dedicated to video captioning.Benchmark datasets for video captioning.
TA B L E 4 General deep learning method for dense video captioning.Recent works dedicated to dense video captioning.
TA B L E 6

Table 7
shows the recent work regarding image/video captioning by knowledge graph based methods.
Evaluation metrics for image/video captioning.Image captioning with evaluation metrics on benchmark datasets.