A two-step model for video key-frame determination



What types of information in key-frames of a storyboard are critical when users extract the meaning of a video? For this research question, we reviewed the literature and then conducted the preliminary study. Next, based on the literature review and our findings from the preliminary study, we constructed the two-step model for video key-frame determination. We set the hypothesis that the proposed two-step method will produce more meaningful key-frames than the mechanical method (by which key-frames are extracted simply at an interval of few seconds or minutes) in terms of summarizing a video. In order to testify the hypothesis, we conducted an experiment to compare the storyboards constructed based on the proposed two-step method to those built based on the mechanical method. The two-step model showed better accuracy in identifying the content of a video.


A visual storyboard is the simplest and the most common method as a video surrogate (Mundur, Rao and Yesha, 2006). How can we extract key-frames of high quality? Several methods use automatic key-frame extraction for the purpose of developing computerized algorithms. The simplest method is to extract storyboard key frames at an interval of several seconds or minutes. This method prevails in the Internet environment primarily due to both its cost-efficiency and easy implementation. However, it lacks theoretical explanation, regarding to why it has to be in that way to extract meaningful key-frames. The second method uses a SBD (shot boundary detection) technique. The technique uses automatic detection of shot changes and automatic extraction of key frames at each shot. However, since shot changes may be of different types (fade in and out, dissolve, wipe, etc.), the extracted key-frames may not adequately represent the video content. Furthermore, the shot detection based key-frame selection may contain redundancies because similar content may exist in several shots; thus, we need to eliminate redundant shots. In order to decrease the redundancy, we can group shots into logical scenes, and then select a key-frame for each scene. However, video programs have no strict boundaries of logical scenes. Therefore, automatically segmenting a video into such higher-level units will always be difficult.

The third method of automatic key frame extraction uses more advanced techniques. This method uses pattern recognition technologies to search for particular objects, overlay text or motions and then selects frames that include those targets, thus making key-frames for a storyboard. The advanced techniques are not scalable enough to allow interactive searching on Internet-scale, however the techniques are proving robust and effective for smaller collections (Smeaton et al., 2008). In this study, we looked into what kinds of frames actual users consider to be important, under the assumption that understanding users' behavior is one of the central clues. The procedure used in our study and discussed in this paper, which is composed of 4 parts: 1) literature review, 2) preliminary study in which 11 participants identified the criteria used to select frames representing the content of 12 videos, 3) developing the algorithm for key-frame extraction, and 4) conducting an experiment for the evaluation of the proposed algorithm.

Literature Review

Computerized Extraction of Key-Frames and Objects

SBD techniques are developed on the basis of determining the similarity between adjacent frames, using color histograms or edges (Danisman & Alpkocak 2006; Liu et al. 2007). Once a video has been segmented into shots, the next step is to choose a single frame as a key-frame for each shot that exceeds a given duration threshold. Key-frames can be chosen simply by selecting the one in the middle, at the start, or at the end of the shot. Smeaton & Browne (2006) and Dumont & Merialdo (2007) suggested that the simple approach of taking the middle frame as the key-frame image is often favored due to its computational efficiency and relatively good performance. Mundur, Rao, and Yesha (2006) proposed a video summarization technique that used Delaunay clusters, which eliminate redundancy but do not preserve the temporal order. The above-mentioned studies assumed that the key-frames as a whole represent the contents of shots, scenes, or programs. SBD techniques have the shot redundancy issue of detecting too many shot-boundaries, thus causing the difficulty of eliminating redundant shots. In order to decrease the shot redundancy, we can group shots into logical scenes, and select key-frames for each scene. There are also other advanced approaches in which a key-frame can be determined by the attributes of the video content, such as motion (Togawa & Okuda, 2005), event (Yokoi, Nakai & Sato, 2008), and the overlay text (Kim & Kim, 2008).

Manual Selection by Humans: Image Categorization and Tagging

Images Categorization: Panofsky (1955) placed visual images in three categories: pre-iconography, iconography, and iconology. Shatford (1986) interpreted Panofsky's pre-iconography and iconography as generics and specifics, respectively. While the generics indicate the general subject matter of an image (e.g., bridge), the specifics indicate the specific subject matter of an image (e.g., Golden Gate Bridge). Furthermore, Shatford relates Panofsky's iconology to abstracts, which relate to the intrinsic, personal meaning of an image. Eakins and Graham (1999) classified images into three levels of abstraction: primitive features (e.g., color), logical features (e.g., the identity of objects), and abstract features (e.g., the meaning of an image). Greisdorf and O'Connor (2002) suggested that people interact with images/videos at three levels. At the first level, the primitive features of an image are perceived, whereas at the second level, objects are identified. At the third level, inductive interpretation of an image/video is required, with inferences being made about its abstract attributes. Based on Panofsky's three categories and metadata, the categories and subcategories obtained from the six studies are summarized in Table 1.

Table 1. Summary of Previous Studies on Categorization of Images
original image

Image Tagging: Peters and Stock (2007) explained Panofsky's theory by giving an example of a photo found in Flickr. They described that Flickr image tags describe Panofsky's three levels (ofness, aboutness, and iconology) and aspects of “isness” in the sense of Ingwersen (2002). Beaudoin (2007) constructed an image category model consisted of 18 categories (e.g., event) to take a look for an underlying pattern for Flickr image tags. This study classified the sampled 140 tags found in Flickr into the 18 categories and found that the most frequently used category of tags is named geographical locations (28.21%). The next most used categories are compound (14.05%), thing (11.37%), and event (5.69%). Kim and Kim (2009) developed an image category model consisted of 5 categories (e.g., description, analysis and interpretation) and 17 subcategories, based on the previous studies on image and folksonomy (Panofsky 1955; Peters & Stock 2007). They classified the sampled Flickr 3,848 tags into the 5 categories and 17 subcategories. They found that the most frequently used category is analysis (43.24%). In the analysis category, the most frequently used subcategories are location (12.93%), object (12.45%), scene (6.25%), event/action (5.32%), and person (4.19%). Based on their five categories and etc., the tag categories and subcategories obtained from the three studies are summarized in Table 2.

Table 2. Summary of Previous Studies on Image Tag Categories
original image

Preliminary Study: Users' Understanding of Key-frames in a Storyboard

In order to explore the criteria for video key-frame determination, we had reviewed the literature. As a next step, we conducted the preliminary study that investigated users' cognitive understanding about key-frames in a storyboard.

Research Question

We addressed the research question to be answered is: What types of information in key-frames of a storyboard are critical when users extract the meaning of a video?

Participants and Creation of Storyboard Interfaces

We recruited 11 participants (undergraduate students) from Myongji University. We downloaded twelve video storyboards from the Open Video Project repository located at http://www.open-video.org. We developed the HTML interface in order to enable the participants to browse only storyboards without seeing other metadata. Table 3 shows the sample videos. The sample videos are education videos and their duration is from 2.17 and 15.06 minutes (See Table 4). Most of the sample videos were made with narrators' comments; they will be used again in the following experiment for the two-step model evaluation. We asked the participants to summarize each storyboard of the twelve videos and then describe which key-frames are useful in identifying the content of each storyboard and in what reasons they selected the key-frame(s) as well.

Table 3. Sample Video List
original image

Results of Preliminary Study

Eleven participants' responses obtained from 12 sample videos (132 cases) were analyzed according to the three categories of Yang and Marchionini (2004) as shown in Table 4. We redefined the textual category by giving it a narrower scope, applying it only to text or symbols in an image, such as overlay text and symbols. The “object” and “event/action” were found to be the most important factors in identifying the content of a video, because they were mentioned most often (35 times or 26.5% for object; 32 times or 24.2% for event/action, as is shown in Table 4). Making associations among key-frames in a storyboard to extract the meaning of a video is similar to identifying the semantic relations between words in a text-based abstract to understand the content of a document. The “connection” or association between key-frames was mentioned 21 times (or 15.9%) out of the total of 132 mentions. “Overlay text” was mentioned 19 times (or 14.4%), while “person” was mentioned 13 times (or 9.8%). The item “schematics, graphic, symbols” was mentioned 4 times (or 3.0%). Lastly, “color” was mentioned twice (or 1.5%). No participant mentioned the “implicit category.”

Table 4. Results of Preliminary Study (132 Responses)
original image

Constructing the Two-Step Model for Key-frame Determination

We pre-sampled video frames using a mechanical method to produce candidate key-frames. This pre-sampling step enabled us to reduce the numbers of candidate key-frames to manageable proportions. It has been proven that the quality of the summaries is not affected by pre-sampling (Mundur, Rao and Yesha, 2006). Next, we selected key-frames manually from the candidate key-frames in two different levels: video and scene. While at the video level, frames with overlay text or narrators were selected, at the scene level, frames with objects, events, persons, or scenery were selected. The eight steps used for selecting key-frames were as follows:

  • 1.Extract candidate key-frames for each video mechanically at an interval of every three seconds, using a KMPlayer (Konqueror Media Player). This step enabled us to include one or two frame(s) per shot, since each shot lasted 2 to 5 s.
  • 2.Select frames with overlay text, schematics, or symbols (T-frames) from the candidate key-frames at the video level.
  • 3.Select a frame with a narrator (N-frames) from the candidate key-frames at the video level. If N-frames appear redundantly in several frames, then select the N-frame that appears first.
  • 4.Group candidate key-frames into scenes manually; we set scene boundaries based on scenery changes, where the spatial background of a scene had been changed into a different one.
  • 5.Select object frames (O-frames) that indicate the central theme of the video at each scene.
  • 6.Select action/event frames (E-frames) that indicate the central theme of the video at each scene.
  • 7.Select person frames (P-frames) that indicate the central theme of the video at each scene.
  • 8.Select scenery frames (S-frames) that represent the geographical settings or time periods that are related to the key-events or actions at each scene.

Based on the proposed criteria for key-frame extraction, 12 sample videos were analyzed (See Table 5). As an example, we will explain how the sequential storyboard of Video 6 (“Computer Rage,” with a duration of 2.59 min) was constructed, as shown in Figure 1. After pre-sampling, the 4,262 frames included in Video 6 were reduced to 56 candidate key-frames. From these 56 candidate key-frames, we selected a T-frame and an N-frame. After grouping all of the key-frames into 8 scenes, we selected three O-frames from the 4th and 7th scenes; four E-frames from the 1st, 2nd, 4th, and 7th scenes; and three P-frames from the 2nd, 3rd, and 4th scenes. Additionally, we also chose two S-frames from the 6th and 7th scenes. As a result, as shown in the left side of Figure 1, we constructed a storyboard for Video 6 that consisted of 13 key-frames (we deleted the frame with the title page of the video in order to force the participants to focus on the expressive power of the images).

Table 5. Sample Video Information
original image

Experiment Design: Two-step Model Evaluation

We conducted a comparative experiment using 42 participants to investigate the efficiency of proposed two-step method. In the experiment, the proposed two-step method was compared to the mechanical method in terms of accuracy in summarizing a video.

Hypothesis Our next research question to be answered is: To what a degree the proposed two-step method produces meaningful key-frames for increasing users' ability in summarizing the content of a video? The research question leads to the hypothesis that the two-step method will produce more meaningful key-frames than the mechanical method (by which key-frames were extracted simply at an interval of few seconds or minutes) in terms of summarizing a video.

Participants and Sample Videos We recruited 42 participants from Myongji university; all of them were undergraduate students whose majors were library and Information science or mass communication. We tried to equally divide the 42 participants into two groups (group A and B) in order to remove gender gap or major (or grade) difference; each group included 21 participants. We conducted a normality test before t-tests to ensure that the data did not significantly violate the normality assumption. Both groups are found to be normally distributed at a confidence level of a 0.05. As sample videos, we used the same 12 videos as those utilized in the preliminary study for users' cognitive understanding about key-frames (See Table 3).

Creation of Storyboard Interfaces We developed an HTML interface. Each video was identified only by a video number, and it had two types of storyboards; one type was consisted of key-frames extracted by the two-step method, whereas the other type was consisted of key-frames extracted by the mechanical method. We allowed group A to browse 12 storyboards made from the two-step method and group B to browse 12 storyboards made from the mechanical method. The Figure 1 shows two types of Video 6 (‘Computer Rage’) storyboards. Overall, the storyboard (Two-Step Method) had T- or E-frames, whereas the storyboard (Mechanical Method) had some duplicate key-frames.

Figure 1.

Storyboard Interface

Test Procedures

We asked the group A and B to summarize for each storyboard (type A or B) of the twelve videos; then, we compared the test results of the group A to those of group B. The test was conducted in a university computer lab. The procedure of the test is as follows. First, the participants summarized Video 1 after viewing its storyboard, and this was submitted to us. The participants performed the same tasks for the remaining eleven videos. We allowed the participants to summarize each storyboard within 5 minutes; thus, they took a total of 60 minutes to complete all the tasks for the twelve videos. Second, the summaries assigned by the participants were independently scored by two researchers in the range of 0.00–1.00. Next, an average of the two scores was calculated for each trial in order to yield the final scores.

Data Analysis: Results of Hypothesis Testing

In order to test the hypothesis that the two-step method will produce more meaningful key-frames than the mechanical method, we used an independent t-test. The significance level (p) or sig for tests is 0.05; if p is smaller than 0.05, the finding is statistically significant and null hypotheses are rejected. SPSS was used for the data analysis. The grand mean of summary scores (252 cases, 12 cases for each participant) from two-step method was statistically greater than that from the mechanical chopping method (0.64 vs. 0.50, t = −4.03, p<0.00), as shown in Table 6. Five results from twelve t-tests were found statistically significant, showing that hypotheses were partially accepted. Mean scores of participants' summaries after watching storyboard made of the two-step method were found greater than those of the mechanical method on ‘Computer Rage’ Video (0.45 vs. 0.22 respectively, t = −3.52, p<0.00). Also this observation remained the same with ‘Food Preservation,’ ‘Flexible Wing,’ ‘Bicycle Today - Automobile Tomorrow,’ and ‘Ubiquitous Computing in the Living Room.’ Most mean scores from the two-step method were still greater that those from the mechanical method. The remaining seven results from twelve t-tests were found statistically not significant; the mean scores of five videos of the two-step method are still greater that those of the mechanical method, while the mean scores of two videos (‘Benefits of Sleep’ and ‘Exercise and Nutrition’) made of the mechanical method are higher than those made of the hybrid method. We will detail the hypothesis result later in Discussion section.

Table 6. Results of Independent t-test
original image


Preliminary Study

The results of the preliminary study showed that most participants relied heavily on identifying not only visual information (objects or events) but also verbal information (overlay text or letters on objects) to make sense of the storyboards. It is worth noting here that verbal information may sometimes mislead us with what is unimportant. For example, some participants incorrectly determined that the theme of a video was chocolate because one key-frame showed a can that had “chocolate” carved on it. Actually, the topic of the video was how to preserve food using cans and glass bottles. Further, we also found that making a connection between key-frames or color code can be effectively used to determine the subject of a video. This result provided an answer to the research question by finding that object, event/action, and text are more critical factors than others (e.g., scenery or color code) in identifying the content of a video.

Two-Step Model Evaluation The experiment for the two-step model evaluation did not produce promising results of mean summary scores definitely being greater than the mechanical method. It is because only five out of twelve expectations were found to meet the statistical criterion to make a significant hypothesis testing, say, p < 0.05. There might be several reasons for this. First, we deleted the frame containing the title page of a video to make participants focus on the expressive power of the images. We excluded three frames with title pages from the 12 storyboards created by the algorithmic method, whereas no frame with a title page was excluded from those created by the mechanical method. Second, we found that the “Benefits of Sleep” storyboard created by the mechanical method included a T-frame with the overlay text “Enough sleep,” whereas that created by the algorithmic method did not. In an interview after the experiment, participants told us that this T-frame provided a clue to the meaning of the “Benefits of Sleep” storyboard. T-frames were found to be crucial for understanding the content of a video. Third, some of the key-frames in the storyboards extracted by the two-step and mechanical methods were identical, or were similar to each other. For example, ten out of the fifteen key-frames in the “Exercise and Nutrition” storyboard created by the mechanical method were the same or similar to those selected by the algorithmic method. This number was sufficient for them to find the two storyboards almost identical.


We proposed and validated the two-step model to be used for video key-frame determination. The model was built based on the literature review as well as our findings from the preliminary study of identifying users' frames recognition pattern. Compared to the mechanical method which chops video every seconds (or minutes) to extract key-frames, the two-step method showed better accuracy in identifying the content of a video. What are the implications of the above results and discussion for video abstracting? Overall, our findings could be effectively used as a theoretical basis for fully-automatic video abstracting.


This work was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD, Basic Research Promotion Fund) (KRF-2007–327-H00017)