- Top of page
Precision and recall are two widely accepted measures to Information Retrieval (IR) system performance. Different search strategies can yield different precisions and recalls for the same IR system, search engine or website. Therefore, precision and recall can also be used to measure search strategy effectiveness. YouTube® is a popular social media website which hosts the largest number of user-generated videos on the web. The YouTube® search engine only provides text words search function and search results are retrieved through the system matching these search terms to video descriptions, tags, comments, etc. Therefore, precision and recall can also be used to measure search strategy effectiveness. This paper aims to test if a search strategy using multiple search terms is effective by analyzing the relevance of the retrieved YouTube® videos on Smokeless Tobacco Products (STP) and YouTube® provided video usage statistics and community engagement statistics.
17 search terms were used to retrieve YouTube® videos on STP which were recorded in a master file. All duplicate videos from the retrieved results were excluded. Unique videos were then selected based on pre-defined inclusion and exclusion criteria set by the research team. A sample of 440 unique videos was randomly selected for data analysis. SPSS 19.0 was used for data analysis. Descriptive analysis, precisions and recalls, logistic regression, and odds ratio test results are reported. The study found that only a few search terms out of the total 17 terms are more effective in terms of relevance of the retrieved videos. Also, YouTube® provided video usage statistics and community engagement statistics did not have significant association with the relevance of the retrieved videos. Recommendations for potential solutions on designing more effective search strategies are provided.
- Top of page
Relevance, as a basic and fundamental notion in information sciences in general, particularly, in information retrieval (IR) has been substantially studied and investigated (Saracevic, 2007). Multi-dimensional nature of relevance has been recognized and agreed among researchers for decades. From topical relevance also called system relevance, cognitive relevance, situational relevance, psychological relevance, to dynamic relevance, the notion of relevance has been expanded from objective relevance (i.e. system relevance) to subjective relevance (i.e. other dimensions of relevance).
Due to the multi-dimensional nature of relevance, relevance is also judged with various relevance criteria. For the textual-based system, system/topical relevance is evaluated through matching search terms to the textual surrogates contained or indexed in the system, either free text words or controlled vocabularies. Precision and recall are two widely accepted measures to evaluate topical or system relevance along with Information Retrieval (IR) system performance effectiveness (Stokes, Foster, & Urquhart, 2009). Different search strategies can yield different precisions and recalls for the same IR system, search engine or website. Therefore, precision and recall can also be used to measure search effectiveness. Evaluation of search effectiveness can also be done with other methods, such as getting user feedback, analyzing click through data or system log data. As search strategy effectiveness in this study was mainly focused on the effectiveness of search terms, precision and recall of each search term are needed. Therefore, precision and recall were chosen to measure the search effectiveness in this study.
YouTube®, as the largest user-generated video online resource, only provides text words search function and search results are retrieved through the system matching search terms to video descriptions, tags, comments, etc. This text-word-only search function with no boundary for video collection and very limited filter feature requires that multiple search terms be used to search videos in YouTube® on the same topic. In this study, smokeless tobacco products (STP) was chosen as the search topic considering the important influence of YouTube® on health behaviors in the adolescent and young adult population, especially on tobacco use. In order to combat the use of smokeless tobacco, we need to easily retrieve these videos so we can document their messages and counteract the message these videos are sending. This study aims to test if the multiple search terms strategy was effective by analyzing the relevance of the retrieved videos based on the pre-defined inclusion and exclusion criteria and YouTube® provided video usage statistics and community engagement statistics.
What are the precisions and recalls for each search on STP in YouTube®?
What is the relationship between the likelihood of the relevance of the videos and the number of search terms with which the videos showed up in the search results.
Is a video more relevant if it has higher number of views, ratings, and number of comments?
METHODS Video Retrieval
The basic search terms on STP were devised based on the literature describing STP in the Campaign for Tobacco — Free Kids (www.tobaccofreekids.org/productsreport). Because YouTube® uses text-matching technique to find videos and does not support truncation search, keywords with synonyms and variations were used in order to acquire comprehensive search results. Bearing these considerations, 17 search terms were determined. They were: chaw, chew, chewing, commercial, commercials, dip, dipping, dissolvable, leaf, lozenge, plug, roll, smokeless, snuff, snuffing, snus, and twist.
In order to calculate precision and recall of each search, one search term was entered into YouTube® at a time and yielded separate search results. Only videos uploaded 6 months prior to search date were retrieved. For each search, the researcher recorded the video's unique identifier (URL). After the initial search of 17 terms, a master database was created, from which duplicates were removed.
Due to the dynamic nature of the YouTube® website, each unique video in the master database was downloaded with “DVD Video Soft” open source software and saved as a video file, using the video URL as the file name for future analysis. In addition the YouTube® provided video usage statistics and community engagement statistics (video metadata) for each video, which included number of views, ratings, and number of comments, was also captured using screenshots of the webpage for further analysis.
All unique videos were viewed and selected based on predefined inclusion and exclusion criteria set by the research team. In order to assure systematic assessment of content, and minimize subjectivity and coder bias, inter-rater reliability was assessed using Kappa's a, with a threshold set at .8. With a subset of videos, through an iterative process, the three coders sought to arrive at a common application of criteria, and reached the threshold for inter-rater reliability. Individual coders then reviewed assigned videos separately and selected those that meet the inclusion criteria. Videos that meet these criteria constituted the study sample.
For the purpose of this paper, the videos that met inclusion criteria were considered relevant.
Precision and recall for each term was calculated using the following expressions. Since recall is to obtain all relevant documents in the collection that include retrieved and also those not retrieved and it is impossible to know the total number of relevant documents in a search engine, there is no proper method of calculating absolute recall of search engines (Shafi & Rather, 2005). Therefore, we used the total number of videos retrieved from all 17 search terms to estimate all the relevant videos that are collected in the YouTube on the topic at the time of search.
Number of views divided by number of days that a video was posted was used to replace the absolute number of views showed on the webpage screenshots in order to eliminate the impact of length of time video was posted on YouTube®. Corrected number of views was used to represent this measure. In addition, number of comments and video ratings that were showed on the webpage screenshots were directly used to measure video's popularity.
All data was analyzed by using SPSS 19.0. The total number of included and non-included videos for each search term was calculated. The relationship between the relevance of the videos and the number of times the video appeared in different search terms was analyzed using logistic regression. Precision and recall of each search term was calculated. An odds ratio was calculated for each search term to test whether a search term is more likely retrieve relevant videos
- Top of page
Between October and December 2011, a total of 3603 unique videos were retrieved and downloaded. A sample of 440 videos was randomly selected. Among them, 287 unique videos (65.2%) were considered relevant.
Table 1. Precision and Recall for each search term
|Search terms||Retrieved videos||Relevant videos||Precision||Recall|
Among 17 searches, videos were most commonly retrieved in the search term “dip,” “chew,” “chewing,” and “snuff”. Most videos found in the search terms “chaw,” “dip,” “dipping,” “snuff,” and “snuffing” were included.
In four pairs of search terms that had the same word root, 49 videos showed up on both the search terms “chew” and “chewing,” 53 videos on with “dip” and “dipping,” 8 with “commercial” and “commercials,” and 1 with “snuff” and “snuffing”. The number of retrieved videos, relevant videos, as well as precision and recall for each search term are presented in Table 1. Since some unique videos showed up as the search results in multiple terms, the total numbers of retrieved videos and included videos were more than 440 and 287, separately.
Logistic regression analysis (R2 = .13 (Cox & Snell), Model X2(1) = 59.90, p < 0.01) found that videos were 2.81 times more likely to be included when they were showed up as the results in multiple search terms. Odds ratio test found that the following terms can retrieve more relevant videos than others: “chew,” “dip,” “dipping,” “leaf,” “roll,” “snuff,” and “twist”.
Table 2 shows that most irrelevant videos received more views and were rated higher than relevant videos. However, most relevant videos received more comments than irrelevant videos.
Table 2. Comparing means of relevant and irrelevant videos
|Metadata||Relevant videos (n=234)||Inrrelevent videos (n=125)|
|Corrected No. of views||4.08||(10.26)||33.44||(142.87)|
|No of comments||4.85||(12.68)||3.77||(8.73)|
- Top of page
The study findings shed light on some very interesting insights on search strategy effectiveness.
The result of logistic regression showed that videos which showed up in more search terms were more relevant. This result testifies that in a free text search, variations of search terms should be considered and used in order to retrieve as many results as possible. Short descriptions of the videos, tags, and comments are all from users, which yields many different ways of presenting the same or similar concepts. Therefore, variations of words, including different spellings, singular vs. plural, different names for the same concept, must be considered in searching YouTube® video. This uncontrolled vocabulary reality plus no boundary for video collection and only providing text word search with very limited filter features may cost more time and effort to search videos on YouTube®. Here comes a question: should all the variations of the search words be used in order to conduct a best possible comprehensive search? How could the balance between search comprehensiveness and search efficiency be achieved?
The results of precision, recall, and odds ratio test suggested that “dip,” “dipping,” “chew,” and “snuff” were four search terms that yielded higher precision and recall, as well as significant odds ratio. These four terms retrieved about 50% videos out of all the 17 searches. The 53 overlap videos between “dip” and “dipping” accounted for 73% retrieved videos from “dipping” while 1 overlap video between “snuff” and “snuffing” even accounted for 100% retrieved videos from “snuffing”. By looking into these numbers, it seems that “dipping” and “snuffing” were not effectiveness search terms and could be eliminated from the search term list. However, 49 overlap videos between “chew” and “chewing” accounted for about 50% of the retrieved videos from each search. In this case, both “chew” and “chewing” seems to be needed to retrieve the videos to avoid missing 50% of the results. This inconsistency makes it difficult to decide which search terms should be used or kept, especially before any search was conducted. Conducting a pilot search with a small sample and running precision, recall, and odds ratio test can be a potential solution to fine tune the search term list before a formal search starts.
Views, ratings, and comments reflect video's popularity and ratings and comments are considered to be the part of community engagement (The YouTube® blog). However, those relevant videos did not receive more views than those irrelevant videos from our findings. This result seems to be opposite to the common sense, which is that the more views a video receives, the more relevant the video should be. This may be attributed to how YouTube® counts the number of views for a video. It seems that as long as a user clicks on a video, the click will be counted as one view regardless of whether or not the video was actually viewed. “Ratings” indicates the combination of users' “likes” and “dislikes” of the videos, which is a subjective judgment and has less association with video's relevance. This may explain why included videos had lower ratings number than non-included videos. However, the study found that relevant videos received more comments than irrelevant videos. Users can not leave comments without watching videos and generally comments are related to the video topics. When users think videos are related to their certain needs, they could leave comments on those videos. In this sense, relevant videos can attract more users to leave comments. YouTube® also provides other basic usage statistical data and community engagement data to record users' reactions to the posted videos, such as likes, dislikes, and favorites, etc. However, these video usage statistics and community engagement items more present videos' popularity than relevance. As relevance is so user-oriented, subjective, and dynamic (Yang & Marchionini, 2004; Stokes, Foster, & Urquhart, 2009), these objective statistics provided by YouTube should be carefully used to judge video's relevance.
- Top of page
This study investigated search strategy effectiveness by analyzing relevance of STP YouTube® videos retrieved from 17 search terms and YouTube® video metadata. Although the study findings may not be generalized due to the search topic specificity, they still provide good observation points. Due to the way of YouTube® collecting and organizing its videos and its limited search features, synonyms and variations of text words should be considered and used for retrieving videos. In order to balance the search comprehensiveness and search efficiency, a pilot search with a small sample and running precision, recall, and odds ratio test before a search is recommended. YouTube® provided video usage statistics (e.g. Number of views) and community engagement statistics (e.g. ratings) should be carefully used to judge the video relevance. Further studies can be done to acquire more insights about YouTube® search performance, for example, by comparing the impact of different search strategies on the search results for the same search topic, or interviewing real YouTube® users to evaluate precision and recall of YouTube® searches.