Does degree of work task completion influence retrieval performance?
In this contribution we investigate the potential influence between assessors' perceived completion of their work task at hand and their actual assessment of usefulness of the retrieved information. The results indicate that the number of useful documents found by assessors does not influence their perception of task completion. Also, with the exception of full text records and across all document types, both measured at rank 10, no statistically significant correlation is observed with respect to retrieval performance influenced by degrees of perceived work task completion or individual types of documents.
The present poster investigates assessment behavior with respect to degree of perception of task completion and how this factor influences retrieval performance as measured by degree of usefulness of retrieved documents in a collection of different document types, the iSearch collection.
The iSearch collection (Lykke et al., 2010) integrates approximately 18,000 English monographic records from Danish digital libraries without abstracts, 160,000 papers and articles in full-text PDF format as well as 275,000 abstracts with a varied set of metadata and vocabularies captured from the open access portal arXiv.org. The full text documents are much longer (4,422 words on average) than the metadata records (272 words on average). The collection currently contains a set of 65 richly described information tasks including genuine work task statements created by 23 test experts from Physics university departments. The same 23 experts assessed the usefulness, not topicality, of retrieved documents (up to 200 randomly distributed over the different document types per task) to their actual work task situation in relation to the 65 information tasks.
For degree of usefulness we applied the four-graded relevance scale as proposed by Sormunen (2002): highly; fairly; marginally; and not useful, as well as Normalized Discounted Cumulated Gain (nDCG) measurements (Järvelin & Kekäläinen, 2002). A post assessment questionnaire (PAQ) on satisfaction with the assessment procedure and search outcomes was filled out for each task.
The analyses are partly based on the actual assessments done across the different document types in iSearch, and partly captured from the PAQ. We operate with the following two research questions
Research question one assumes that the higher the number of useful documents, the more complete the test persons will perceive the work task.
For research question two we hypothesize that retrieval performance, as measured by usefulness, will be higher the more complete the task is perceived. With respect to document types and performance, we expect full-text PDF documents to perform better than arXiv.org metadata and book records owing to their higher informativeness.
After all the assessments were captured per information task we calculated the distribution of the set of all positively useful documents over all 65 information tasks and document types. Highly, fairly and marginally useful documents constitute ‘all positively’ useful items. One central question from the PAQ was selected concerning the degree of work task completion, measured by a three-point scale: extremely complete; somewhat complete; and not complete. We generated descriptive statistics and cross-tabulated between retrieval performance and the perception of task completion. Statistical significance tests were performed as a two-tailed Student's t-tests (α = 0.05).
Influence of Assessment of Usefulness on Task Completion
Table 1, located at the end of the paper, shows the association between the actual assessments of useful documents made prior to the assessors' answers to the PAQ and the perceived degree of task completion captured by PAQ. We observe that in the category of work tasks perceived as ‘Not complete’, as expected the assessors do obtain a slightly smaller portion of highly useful documents (2%), compared to the category of tasks perceived ‘somewhat complete’ (3.2%) and the mean (3.0%). However, there is no significant difference in the average number of useful documents seen by the assessors across the categories of perceived task completion, although the figures indicate a slight negative trend for the ‘Not complete’ category. This is also the case for the ‘All Useful’ percentages and documents, Table 1. Thus, research question 1 is not answered positively: a lower number of useful documents retrieved does not (statistically for this sample) entail a similar sense of (less) task completeness.
Task Completion and Retrieval Performance
Observing Table 2 for the aggregated level named ‘All document Types’, the trend is clear up to rank 30, and statistically significant at nDCG10: when work tasks are perceived ‘Not Complete’ the usefulness score of the retrieved documents is indeed lower than for tasks felt ‘Somewhat Complete’.
Table 1. Distribution of useful documents over task completion
|Extremly complete (N=3)||69||26.2||18||6.8||36||13.7||123||46.8||140||53.2||263||41|
|Somewhat complete (N=25)||134||3.2||282||6.8||804||19.4||1220||29.4||2930||70.6||4150||48.8|
|Not complete (N=37)||134||2.0||366||5.5||1035||15.6||1535||23.1||5118||76.9||6653||41.5|
|Mean (N=65)5.2|| ||102|| ||28.8|| ||44.3|| ||126.0|| ||170.2||44.3|
Table 2. NDCG scores for task completion. Statistical significance (p=.001–.03) in gray+italics in rel. to italics.
| || ||Extr. Complete||3||0.39||0.28||0.32|
|All doc. Types||Somewhat||23||0.33||0. 2 9||0.25||0.24|
| || ||Not Complete||36||0.27||0.16||0.14|
| || ||Extr. Complete||3||0.46||0.30||0.33|
| || ||Not Complete||25||0.39||0.24||0.27|
| || ||Extr. Complete||1||0.06||0.00||0.00|
|PDF full text||Somewhat||23||0.38||0.32||0.30||0.29|
| || ||Not Complete||32||0.31||0.16||0.17|
| || ||Extr. Complete||2||0.32||0.32||0.32|
| || ||Not Complete||34||0.31||0.16||0.16|
However, when we observe the usefulness scores at document type level the analysis displays a much fuzzier picture. For Book records the usefulness scores are actually higher when work tasks are perceived ‘Not Complete’, compared to the ‘Somewhat’ category, across all the DCVs. Only for PDFs the difference in usefulness score between the ‘Somewhat’ and ‘Not Complete’ categories is marked (in italics), but only statistically significant at nDCG10 (.32 vs. .16). For research question two we may state that the perceived degree of work task completion does influence negatively the retrieval performance measured by degree of usefulness of the retrieved documents – but only significantly within rank 10 for PDF documents or when individual document types are integrated.
These observations concerning the research questions are quite interesting from an interactive IR evaluation point of view. The results show that the number of useful documents observed by assessors' prior to their perception of degree of task completeness does not influence statistically the feeling of task completion.