Advantages and Disadvantages
Overall, we found both advantages and disadvantages for crowdsourced usability testing (see Table 1).
Table 1. Advantages and Disadvantages of crowdsourced usability tests over lab usability tests
|More Participants||Lower Quality Feedback|
|High Speed||Less Interaction|
|Various Backgrounds||Less Focused User Groups|
Advantages: Recruiting participants from crowdsourcing platforms is much easier than asking people to come to the lab to perform a usability test. So it is easier to obtain more data from crowdsourcing usability tests.
Lab usability tests usually take about an hour per session. They cannot be done in parallel with each other (unless there are multiple lab spaces and multiple testers). The whole process might take days or even weeks to be done. Crowdsourcing usability testing saves travel, greeting time, and setting up processes. They can be done simultaneously so the whole process can be completed within hours.
The potential cost savings for crowdsourced usability tests are significant as well. While we used unpaid student volunteers for the traditional lab usability test, lab usability tests typically entail paying a participant for a one-hour session with a sum larger than their hourly wage. In comparison, the hourly rate for crowd workers is typically about $1.25. Of course, the total cost for usability tests is not only compensation to crowd workers. Time and monetary costs for test facilitators, labs, equipment, and travel can all be potentially lower in crowdsourced tests.
Because the time and monetary cost of crowdsourced usability tests is relatively low, it can be more affordably iterated. When a usability test is first designed, it can be run as a pilot test to see if there is any problem with the test itself. It can then be improved before being launched to more participants just as we did in this study. Crowdsourced usability testing may also be easier to be run throughout the development and maintaining process of a website because of its high speed and low cost.
Because crowd workers participate from all around the world, it is remarkably easy to conduct a test with participants from various backgrounds. This is especially beneficial to websites whose users are geographically dispersed. Indeed, it would be an easy matter to launch parallel, crowd-sourced usability tests, each with a different user audience specified. In lab settings, the time and monetary costs rise significantly if companies want to test participants from other locations. While remote usability testing is possible (Bias & Huang, 2010), the set-up and test times are still additive.
Disadvantages: The quantity of feedback from a single crowdsourcing participant is much lower than the quantity of feedback from a single lab test participant. Many crowd workers seem to just want to get the HITs done as fast as possible in order to get paid with little care as to quality.
As such, the quality of usability test results from crowdsourcing is noticeably lower than those from the lab tests. Workers seemed to be much less engaged in the test.
There is no built-in way to interact with workers while they are doing the job on mTurk (though one can run an external HIT which one designs to run on one's own website; while this can require substantially more work, one can program any functionality one wants to have). But assuming we are talking about internal HITs on mTurk, we cannot provide further instructions to workers in real-time if they are unclear about any of the tasks or questions (they can send email). Similarly, there is no way to ask participants to “think out loud” while they are performing the tasks. If there is anything unclear or interesting in their feedback, it is very difficult to ask participants to elaborate.
mTurk workers seem unlikely to spend time giving substantial feedback to open-ended questions. A few words or a sentence is the most likely response to any open-ended questions. Deriving useful feedback from such answers for usability design can be quite challenging.
Specific user groups are difficult to identify. Such specific user groups may be unlikely to have a useful presence at present among online crowd workers. For instance, users with low computer literacy are unlikely to have an account with an online crowdsource platform like mTurk or uTest.
Spamming also appears to be common on mTurk. Because the purpose of usability testing is to find problems users may have, or mistakes they may make when using a website, it can be challenging to define good Gold Unit questions to detect spammers. While one can manually look at participant responses to detect cheating, this is far from ideal and the criteria are hard to define a priori. Such a manual process reduces the benefit which is one of the main motivations for employing crowdsourcing.
Comparing to Traditional Usability Testing
While not a controlled experiment, with the same tasks and user audiences tested in each method, nonetheless we wished to compare the results we obtained with each, qualitatively, in hopes of continuing the dialogue of which usability methods are best employed in which circumstances. The results of the crowdsourcing usability study and the lab usability study had notable similarities as well as differences. Lab usability testing and crowdsourced usability testing were performed with different numbers of participants; the demographics were different; the times spent on tests were different; the specific tasks were different; the monetary costs were also different (Table 2).
The time spent by participants in the crowdsourced usability test was significantly less than the time spent by participants in the lab usability test. In the lab usability tests, it took approximately 30 minutes for each participant to perform the test. It also took time for the participant and test facilitator to schedule the test.
Crowdsourced usability test participants had a wide variety of backgrounds. From the demographic questionnaire results, the participants' ages ranged from 19 to 51. Most of them (68%) had Bachelor's degrees, but there were also workers with Associate degrees, Master's degrees or Doctoral degrees. In comparison, participants in the lab usability test ranged in age from 24 to 33 and all had at least some graduate-level education.
The usability problems of the website identified by both groups overlapped significantly despite their differences (Table 3). Major problems such as menu overlap and irrelevant pictures were identified by both lab test participants and crowd workers.
Lab usability test participants and crowdsourced usability test participants each identified problems that the other group did not. For example, lab test participants identified the lack of sort function in some pages, while crowd workers identified difficulty in finding the search box. The identification of different problems could easily be explained by the different tasks each group performed and their relative familiarity with the website.
Table 2. Comparison between lab usability test and crowdsourced usability test
| ||Lab Usability Test||Crowdsourced Usability Test|
|Participants||5||105 (18 spammers)|
|Age||24 to 33||19 to 65|
|Education level||Bachelor's degree and Master's degree||All levels|
|Experience with similar websites||Yes: 100%||Yes: 62%No: 38%|
|Speed||Approximately 30 min. per session.||Less than 5 hours total.|
|ParticipantCosts||None||$2.92 for pilot test$23.41 for second-round test$55 for third-round test(Avg: $0.77/tester)|
Table 3. Usability problems found from lab usability test and crowdsourced usability test
|Major Problems Identified||Lab Usability Test||Crowdsourcing Usability Test|
|Font size too small||✓||✓|
|Information not cross-linked||✓|
|Lack of sort function||✓|
|Search box difficult to locate||✓|
Another issue we encountered was that in the lab tests, whenever a task or a question was not sufficiently clear, participants could ask for more instructions. With crowdsourced tests, in contrast, workers could not request any type of clarification in such circumstances (only via email, which never happened). When uncertain, crowd workers must therefore act upon their best guess, which may be wrong. Task design for usability tests, especially those to be done in crowdsourced usability tests, must be specific and unambiguous.
The same issue arises in interpreting feedback from test participants. In lab tests, we can always ask participants for more details if they say something like “The navigation menus are confusing.” In crowdsourcing tests, it is more difficult (though not impossible) to send workers follow-up questions for explanations of what they meant by a given response. Unclear feedback is less helpful than more specific feedback.