First, a sub-scale was developed for each attribute of engagement: aesthetics, affect, challenge, motivation, interest, novelty, feedback, perceived control, and attention. To systematically develop these sub-scales, existing scales and instruments were examined that have been used in the research literature; items were also derived from the interview transcripts from a previous study (O'Brien & Toms, 2008).
With regard to existing measures of the attributes, this investigation spanned the marketing, psychology, information science, and human-computer interaction literature. Appendix A shows, for each of the sub-scales, the source from which items were derived. In total, 100 items were considered for inclusion in the final scale. These items varied in focus and physical make-up. For example, some were developed to study users' interactions with computers (e.g. users' aesthetic appraisals of websites [Pace, 2004]), whereas others were intended for general environments that may or may not involve computers (e.g. The Subjective Leisure Scale [Unger & Kernan, 1983]). For instruments designed for use in human-computer interaction, some were specific to a particular domain. For instance, the novelty attribute was measured by Mathwick & Rigdon, 2004 and Novak, Hoffman & Yung, 2000 in online shopping, while Huang, 2003; Zhang & von Drann, 2000 assessed it in general web searching. Other measures were not domain specific (e.g. The Situational Motivation Scale [Guay, Vallerand & Blanchard, 2000]). The focus of some existing measures was on people's psychological states, (Witmer & Singer, 1998) or personality traits (Litman & Speilberger, 2003; Reio, 1997), but other items were specific to the computer interface. For instance, Schmidt, et al. (2003) looked at features of computer interfaces that drew or detracted from attention, such as advertisements or pop-up windows. In some cases, the attribute was the sole phenomenon being investigated, as with Lavie and Tractinsky's (2004) aesthetics instrument. Measures of feedback (Schmidt, Bauerly, Liu, & Sridharan, 2003; Zhang & von Drann, 2000), affect (Ghani, Supnick, & Ryan, 2001; Mathwick & Rigdon, 2004; Pace, 2004; Novak, Hoffman, & Yung, 2001), and attention (e.g. Choi & Kim; 2004; Huang, 2003; Novak, Hoffman, & Yung, 2000; Webster & Ho, 1997; Webster, Trevino, & Ryan, 1993), and challenge (Mathwick & Rigdon, 2004; Pace, 2004; Novak, Hoffman, & Yung, 2001) were located in scales that purported to measure broader constructs, such as play and flow.
Since overall impressions of system use influence perceptions of other attributes of experience (DeLone & McLean, 1992), we incorporated items pertaining to ‘loyalty’ to an application (e.g. Choi & Kim, 2001) and ‘intention to use’ (Webster & Ahuja, 2004) into the overall scale. We also felt it was important to acknowledge previous attempts to measure engagement and incorporated Webster & Ho's (1997) seven items.
In total, 450 items (100 from existing instruments,350 from the interview transcripts) were included. Items were initially screened according to their potential to be used in multiple computer domains. First, those specific to one domain and duplicate items were removed. Second, items needed to be uniform in their formatting. Some existing measures and interview citations consisted of one-word descriptions. For instance, Park, Choi, & Kim (2004) used adjectives (e.g. balanced, complex) to derive 13 aesthetic dimensions, while Huang (2003) plotted users' perceptions of websites along a continuum of opposite pairs of terms (e.g. nice-awful or entertaining-weary). As might be expected with human speech, interview statements were not always complete sentences: e.g. ‘not all clumped together’ (referring to interface layout) and ‘just engrossed in what I was doing.’
The items were further examined with respect to their wording and formatting. Ambiguous terms (e.g. could, should, or might) and vague quantifiers (e.g. occasionally, most, or very) were removed to avoid confounds based on different interpretations by respondents. In addition, the ‘tone’ of the questions was checked to ensure the presence of both negatively and positively phrased items to avoid the position effect (DeVellis, 2003; Peterson, 2000). This latter step was carried out in conjunction with an independent coder who also judged the saliency and semantic clarity of each item. This process reduced the set to 124 items.