SEARCH

SEARCH BY CITATION

Abstract

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. RELATED STUDIES
  5. ANALYSIS RESULTS OF TRANSCRIPTS
  6. A FRAMEWORK FOR GENERATING SPEECH SUMMARIES
  7. EVALUATION OF SPEECH SUMMARIES
  8. CONCLUSION
  9. REFERENCES

We investigated the effectiveness of a tag method for speech summarization and then examined the efficient integration of the tag method with two approaches (an acoustic method and a sentence location approach). Then, we examined the means to present extracted speech summaries. Last, on the basis of our findings and previous studies, we proposed and evaluated a tag-based framework for generating lecture video speech.


INTRODUCTION

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. RELATED STUDIES
  5. ANALYSIS RESULTS OF TRANSCRIPTS
  6. A FRAMEWORK FOR GENERATING SPEECH SUMMARIES
  7. EVALUATION OF SPEECH SUMMARIES
  8. CONCLUSION
  9. REFERENCES

Two fundamental aspects of speech summary generation are the extraction of key speech content and the style of presentation of the extracted speech synopses. For extracting speech summaries, feature-based and rule-based techniques for text summarization, along with an acoustic/prosodic feature that is not available in text, were utilized; the feature-based methods use lexical, structural, and discourse attributes, while the rule-based methods employ the heuristic or statistical evidences of each sentence and a document.

The features of text-based summarization may not be efficacious in speech summarization, since speech is inherently different from text. In fact, Christensen et al. (2003) find that the most revealing information for text is provided by the sentence position feature, while no such dominant traits exist for speech.

For speech summarization, we can employ user-assigned video tags that provide external information about the content of a video. Video tags can be effectively used as keywords in selecting significant sentences from lecture speech transcripts, without the processing of a vast amount of transcript data. Moreover, the tag method functions similarly to named entity extraction that utilizes the frequency of the proper names of persons, locations and organizations in that frequently used video tags belong to the named entities, such as people, location, and object categories (Kim, 2011). In this study, we first investigate the effectiveness of the tag method for speech summarization. We then examine how to integrate effectively the tag and two other methods (an acoustic method and a sentence location approach). Next, we investigate the way to present speech summaries to viewers. Last, taking the above-mentioned analysis results, we propose and evaluate a tag-based framework for generating lecture video speech.

RELATED STUDIES

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. RELATED STUDIES
  5. ANALYSIS RESULTS OF TRANSCRIPTS
  6. A FRAMEWORK FOR GENERATING SPEECH SUMMARIES
  7. EVALUATION OF SPEECH SUMMARIES
  8. CONCLUSION
  9. REFERENCES

In this section, we describe previous studies of video content extraction and the presentation style of video summaries.

Extraction. After statistically analyzing video bookmarks so as to form video storyboards, Chung et al. (2011) extract some meaningful key frames; in doing so, they assume that the video frames around bookmarks added by users are representative of and informative enough for video summarization. Their proposed method produced semantically more important summaries than two existing methods that utilize low-level audio-visual features. Hannon et al. (2011) propose and evaluate two methods, which are based on frequency and content of time-stamped Twitter messages, for generating video highlights of the World Cup offline. Maskey and Hirschberg (2005) show that the best performance was obtained by combining acoustic features with lexical, structural, and discursive ones in summarizing broadcast news. Zhang et al. (2007), in a comprehensive study of acoustic/prosodic, linguistic and structural features of speech summarization, contrast two genres of speech, broadcast news and lecture speech. Zhang et al. find that acoustic and structural features are more important than lexical ones for broadcast news, whereas the latter are more important than others for lecture speech summarization. Similarly, Zhang, Chan, and Fung (2007) suggest that rhetorical structures exist in lecture spoken documents and that acoustic and prosodic features permit the modeling of this rhetorical information for improving summarization performance.

Presentation. Video summarization can be presented in multimedia, audio, image or text form. We will describe previous studies of the utility of a presentation style of video summaries that employ qualitative investigation of user cognitive processes (Ding et al., 1999; Turner, 1994). Ding et al. (1999) examine three types of video summaries (visual, verbal [keywords/phrases], and joint visual and verbal). Their results show that users favored the combined summary, in which verbal information and images reinforce each other; such information helped them to grasp the overall meaning of the video and specify or clarify the thematic material of the visual surrogates; in comparison, visual information was more apt to convey affect, emotion, and excitement and to draw attention. Turner (1994) suggests that text and images are complementary and interdependent aspects of video information. Marchionini et al. (2009) submit that audio surrogates alone were almost as good as combined audio and imagery surrogates for video gisting tasks.

Our review of previous studies reveals that few of them have investigated the effectiveness of video tags in extracting key sentences from lecture speech transcripts or the utility of particular presentation styles of lecture speech summaries.

Research Questions

Our study addresses the following four research questions: RQ1: How effective is it to summarize lecture video transcripts using a tag method?

RQ2: What effects do acoustic and sentence location methods have on the summarization performance of the tag method when integrated with it?

RQ3: How effective is it to summarize video speech transcripts using our proposed tag-based framework? RQ4: How useful is speech summarization in speech form to that in text form?

ANALYSIS RESULTS OF TRANSCRIPTS

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. RELATED STUDIES
  5. ANALYSIS RESULTS OF TRANSCRIPTS
  6. A FRAMEWORK FOR GENERATING SPEECH SUMMARIES
  7. EVALUATION OF SPEECH SUMMARIES
  8. CONCLUSION
  9. REFERENCES

Sample Videos and Reference Summaries As sample data, we selected 24 presentation and lecture videos from the YouTube site, since such types of videos carry a large portion of informational content in their audio components. Each of the sample videos had about 17 tags and the video duration was between 4 min and 23 min. For the evaluation of speech summaries, after watching the sample videos and reading their abstracts, two researchers selected the most representative sentences for each sample video and then cooperatively composed its reference summary.

RQ1: How effective is it to summarize video speech transcripts using a tag method? We assume that the tag method might be closely related to the named entity extraction. Therefore, using the tag method for speech summarization will yield as same or even better performance than using the number of frequently occurred content words. To examine the effectiveness of tag-based summaries, we collected a transcript for each of the 24 sample videos and segmented each transcript into sentences using period marks. Then, we made programs that were used to compute a keyword score from the match between tag/title words and words in a sentence using cosine similarity measures; we decided to include title words to tag sets because of tag sparsity problems. Using the programs, we extracted sentences with keyword scores greater than a given threshold to form a summary.

Next, in order to evaluate tag-based summaries, we made two other summaries for each of the 24 sample videos: a content keyword-based summary and a reference summary. In order to form the content keyword-based summaries, we used the Extractor system (http://www.extractor.com/) to obtain a list of keywords frequently occurring in each transcript. Then, we computed a keyword score from the match between content keywords obtained through the Extractor system and words in a sentence using the same cosine similarity measure as used in the similarity between tag/title words and words in a sentence. Then, we selected sentences with high keyword scores to form a summary.

Last, for each video, we compared two summaries generated by the tag- and content keyword-based approaches against the reference summary using F measure. The F measure is obtained by using the harmonic mean of recall and precision. We used a t-test to compare the F measure of the tag-based method to that of the content keyword-based method. As a result, the F measure mean (0.29) of the tag-based method was higher than that (0.14) of the content keyword-based method and there was statistical difference (0.29 vs. 0.14, p (= 0.01) < 0.05).

RQ2: What effects do acoustic and sentence location methods have on the summarization performance of the tag method when integrated with it?

Sentence Location Method. In order to investigate the importance of sentence location information for summarizing speech transcripts, we divided each of the speech transcripts of the 24 sample videos into three segments that have approximately 10% (first segment),80% (middle segment), and 10% (last segment) of the sentences. Then, we checked which sentence in each reference summary appeared in a segment. The results showed that on average, 30.5% of reference summary sentences belong to the first segment, 60% belong to the middle segment, and 9.5% belong to the last segment. It means that most speakers follow a relatively rigid rhetorical structure; a speaker starts with an overview of the topic to be presented. Last, we used a t-test to compare the F measure of the position method to that of position method combined with the tag-based method; here, the position method enables to simply extract the first four sentences of each transcript to form a summary. As a result, the F measure mean (0.28) of the position method was lower than that (0.34) of the position method combined with the tag-based method; in the combined method, we used the average value of both position weight (a score of 1.0 is assigned to the first four sentences, and a score of 0.8 to the remaining sentences) and keyword score to extract relevant sentences.

Acoustic Method. Speaking rate, pitch pattern, and intensity have been used for tagging the important contents of lecture speeches. Wang and Narayanan (2007) suggested that intensity is the most useful feature in discriminating word prominence in speech. Through a pre-test by using Praat tool, we found that among three intensity features (maxDB, minDB, and the Diff between maxDB and minDB), the method of using the Diff between maxDB and minDB (Diff method) is the most effective, thus we decided to use it as an acoustic feature. We made each speech summary of 24 speeches by selecting sentences with high acoustic scores to form a summary; the Diff value between maxDB and minDB in a sentence was normalized by using an intensity range of a document and thus each feature category is within the range of [0, 1]. This simplifies our later comparative study. We employed a t-test to examine the efficacy of combining the tag method with the Diff method and found that the F measure mean (0.17) of the Diff method is lower than that (0.31) of the combined tag and Diff methods (see Table 1).

Table 1. Test Results
MethodF
tag0.29
content keyword0.14
position0.28
acoustic: Diff between maxDB and minDB0.17
tag + acoustic: Diff between maxDB and minDB0.31
tag + position0.34

Discussion

We found that the quality of the tag-based summaries is better than that of keyword-based summaries. This result has several implications. First, tags can be used efficiently as index terms. Second, a user assigned tag-based method is very effective because it does not require the processing of a vast amount of transcript data to extract content keywords. Based on the analysis results from the RQ1-RQ2, we concluded that it is effective to integrate the tag method with the Diff and sentence location methods.

A FRAMEWORK FOR GENERATING SPEECH SUMMARIES

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. RELATED STUDIES
  5. ANALYSIS RESULTS OF TRANSCRIPTS
  6. A FRAMEWORK FOR GENERATING SPEECH SUMMARIES
  7. EVALUATION OF SPEECH SUMMARIES
  8. CONCLUSION
  9. REFERENCES

We need to construct a specific model for speech summarization that utilizes social tags, title words, acoustic information, and sentence positional information. For this purpose, we proposed w-model for assigning weights to each sentence of a document.

W-Model

In extractive summarization, the weight score of a sentence Si in the w-model is calculated as follows:

  • equation image

where K is a keyword set including tags and title words. To measure the relevance score between its each sentence and a document, in addition to Sim1 (Si, K) to adjust weight by a keyword set, we added wlo(Si) to adjust weight by sentence location; a score of 1.0 is assigned to the first four sentences, and a score of 0.8 to those that belong to the remaining sentences. Further, we also included wac (Si) to adjust weight by the Diff value between maxDB and minDB. We then take the average value of “Sim1 (Si, K), Wlo (Si), and Wac (Si).” Last, sentences with average scores greater than a given threshold were extracted to form a summary.

How to Extract Speech Summaries

We collected a transcript for each of the 24 videos. We then segmented a transcript into sentences using period marks. Next, we assigned a score to each sentence through the following four tasks.

  • (1)
    Title Word Extraction: We collected a title word set for each of the 24 videos. For example, from the title of Video 3 “Kate Lundy: What I do for open government,” we extracted 4 single words and assigned a weight value of 1.5 to each.
  • (2)
    Tag Filtering and Summarization: We collected a tag set for each of the 24 videos. For example, Video 3 has a final tag list with 4 compound tags and 10 single term tags, after filtering and grouping tags. We assigned a weight value of 1 to each one-word tag and a weight value of 2 to each compound tag.
  • (3)
    Keyword Score Computing: A keyword score (Sim1) was computed from the match between tag/title words and words in each sentence of a given transcript using cosine similarity. In order to compute the keyword score of sentence 15 (S15), we created the following keyword (K) and S15 vectors using 18 terms (4 single title words, 4 compound tags, and 10 single term tags):
    • equation image
    The keyword score of S15, which includes 2 words (open, government), is 0.42.
  • (4)
    Total Weight Computing: In order to obtain the w-model score, in addition to the value of “Sim1 (Si, K),” we needed to compute the values of “Wlo (Si)” and “Wac (Si).” For example, the Wl0 (S15) score of Video 3 is 0.80 and the Wac (S15) score of that is 0.98, thus the value of w-model is 0.73 ((0.42+0.80+0.98)/3). As a result, 5 sentences (S4, S14, S15, S16, S17, S19) with total weight value greater than 0.56 were extracted to form a summary.

How to Present Speech Summaries Text summaries automatically generated in the previous step can be presented in either text or speech form. In order to create spoken summaries, text summaries need to be “spoken” by a text-to-speech synthesizer or by humans.

EVALUATION OF SPEECH SUMMARIES

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. RELATED STUDIES
  5. ANALYSIS RESULTS OF TRANSCRIPTS
  6. A FRAMEWORK FOR GENERATING SPEECH SUMMARIES
  7. EVALUATION OF SPEECH SUMMARIES
  8. CONCLUSION
  9. REFERENCES

RQ3: How effective is it to summarize video speech transcripts using our proposed tag-based framework? We used 10 videos selected from the 24 sample videos and an intrinsic method in order to evaluate our proposed w-model.

For the intrinsic evaluation, we made 10 speech summaries by using two methods (our w-model-based tag method and the w-model-based content keyword method), respectively; both methods used our proposed w-model but the latter method used content keywords instead of tag/title words for measuring the similarity between its each sentence and a spoken document. Next, using their reference summaries, we compared the F measure of w-model-based tag method with that of the w-model-based content keyword method. As the result of the Mann-Whitney test, the F measure mean (0.39) of the w-model-based tag method was higher than that (0.19) of the w-model-based content keyword method and there was statistical difference (0.39 vs. 0.19, p (= 0.04) < 0.05).

RQ4: How useful is speech summarization in speech form to that in text form? In order to compare summaries in speech form to those in text form, we constructed a pilot system for text and spoken surrogates for 10 sample speeches; text surrogates in Korean were obtained from the Ted Talk site and then the text surrogates were “spoken” by a text-to-speech synthesizer to create the spoken surrogates. Next, we asked each of recruited 46 participants, all of whom were undergraduate students pursuing majors in library and information science, to write the advantages and disadvantages of the spoken and text surrogates after watching both surrogates for each of the 10 videos. The analysis results are as follows:

Eighteen participants (39.1%) said that it was comfortable listening to the spoken surrogates; fifteen (32.6%) said that they focused well on the spoken ones; eight (17.4%) said that the spoken surrogates were played steadily and rapidly; six (13%) said that multitasking was possible when listening to the spoken ones (e.g., they can see visual images and listen to spoken surrogates at the same time); and four (8.7%) said that the spoken surrogates can preserve all acoustic/prosodic information such as pitch and duration, which leads to a better understanding of a video. On the contrary, twenty (43.5%) said that in the spoken surrogates, it is hard to go to the exact part they want to replay; eight (17.4%) said that it became difficult to focus while listening a long spoken surrogate; and six (13%) said that the spoken surrogates are ephemeral and thus, they could not remember all of the content of the spoken surrogate. Regarding the advantages and disadvantages of text surrogates, twenty-one (45.7%) said that it is very easy to go to the exact part that they want to read again; twelve (26.1%) said that text surrogates enabled them to gain an overall understanding of a video through the context of the sentences in the text surrogate; eleven (23.9%) said that text surrogates enabled them to control their speed of reading; and six (13%) said that it is easy to understand the meaning of unknown words (or phrases) because they were spelled out. On the contrary, sixteen (34.8%) said that it became difficult to focus or to read while reading a long text surrogate; seven (15.2%) said that reading text surrogates caused eye fatigue; and six (13%) said that considerably sized interfaces were required to browse text surrogates properly.

Discussion

We found that our w-model-based tag method is effective for selecting the most relevant sentences and that an efficacious means to present extracted speech summaries to viewers depends on system development environments or user demands because each presentation style of speech summary has unique features.

CONCLUSION

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. RELATED STUDIES
  5. ANALYSIS RESULTS OF TRANSCRIPTS
  6. A FRAMEWORK FOR GENERATING SPEECH SUMMARIES
  7. EVALUATION OF SPEECH SUMMARIES
  8. CONCLUSION
  9. REFERENCES

This study showed that our w-model-based tag method produced higher-quality summaries than the w-model-based-based content keyword method.

REFERENCES

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. RELATED STUDIES
  5. ANALYSIS RESULTS OF TRANSCRIPTS
  6. A FRAMEWORK FOR GENERATING SPEECH SUMMARIES
  7. EVALUATION OF SPEECH SUMMARIES
  8. CONCLUSION
  9. REFERENCES
  • Christensen, H., Gotoh, Y, Kolluru, B, & Renalset, S. (2003). Are extractive text summarisation techniques portable to broadcast news? In Proceedings of Automatic Speech Recognition and Understanding Workshop (pp. 489494). St. Thomas, USA.
  • Chung, M., Wang, T., & Sheu, P. (2011). Video summarisation based on collaborative temporal tags. Online Information Review, 35(4), 653668.
  • Ding, W., Marchionini, G., & Soergel, D. (1999). Multimodal surrogates for video browsing. DL '99, Proceedings of the Fourth ACM conference on Digital Libraries, Berkeley, CA.
  • Hannon, J., McCarthy, K., Lynch, J., & Smyth, B. (2011). Personalized and automatic social summarization of events in video. In Proceedings of the 16th International Conference on Intelligent User Interfaces (pp. 335338).
  • Kim, H. (2011). Toward video semantic search based on a structured folksonomy. Journal of the American Society for Information Science, 62(3), 478492.
  • Marchionini, G., Song, Y., & Farrell, R. (2009). Multimedia surrogates for video gisting: Toward combining spoken words and imagery. Information Processing and Management, 45(6), 615630.
  • Maskey, S. & Hirschberg, J. (2005). Comparing lexical, acoustic/prosodic, structural and discourse features for speech summarization,” in Proc. of Interspeech.
  • Turner, J. (1994). Determining the subject content of still and moving documents for storage and retrieval: an experimental investigation. Unpublished Ph.D. Dissertation. University of Toronto.
  • Wang, D. & Narayanan, S. (2007). An acoustic measure for word prominence in spontaneous speech. IEEE Trans Audio Speech Lang Processing, 15(2): 690701.
  • Zhang, J., et al. (2007). A comparative study on speech summarization of broadcast news and lecture speech. In INTERSPEECH-2007 (pp. 27812784).
  • Zhang, J., Chan, H. & Fung, P. (2007). Improving lecture speech summarization using rhetorical information. In Proc. ASRU.