• Open Access

Assessing the quality and trustworthiness of citizen science data

Authors


Correspondence to: Jane Hunter, The University of Queensland, Brisbane QLD 4072, Australia.

E-mail: j.hunter@uq.edu.au

SUMMARY

The Internet, Web 2.0 and Social Networking technologies are enabling citizens to actively participate in ‘citizen science’ projects by contributing data to scientific programmes via the Web. However, the limited training, knowledge and expertise of contributors can lead to poor quality, misleading or even malicious data being submitted. Subsequently, the scientific community often perceive citizen science data as not worthy of being used in serious scientific research—which in turn, leads to poor retention rates for volunteers. In this paper, we describe a technological framework that combines data quality improvements and trust metrics to enhance the reliability of citizen science data. We describe how online social trust models can provide a simple and effective mechanism for measuring the trustworthiness of community-generated data. We also describe filtering services that remove unreliable or untrusted data and enable scientists to confidently reuse citizen science data. The resulting software services are evaluated in the context of the CoralWatch project—a citizen science project that uses volunteers to collect comprehensive data on coral reef health. Copyright © 2012 John Wiley & Sons, Ltd.

1 INTRODUCTION

Citizen science projects have grown dramatically in recent years. They combine Web-based social networks with community-based information systems to harness collective intelligence and apply it to specific scientific problems. Online communities of volunteers are now contributing data to projects that range from astronomy [1] to bird watching [2] and air quality [3]. Such projects are democratizing science in that they enable public citizens to actively participate in scientific programmes and allow them to access and use both their own data and the collective data generated by others.

However, there are some inherent weaknesses to citizen science and crowd sourcing projects. The limited training, knowledge and expertise of contributors and their relative anonymity can lead to poor quality, misleading or even malicious data being submitted [4]. The absence of the ‘scientific method’ [5] and the use of non-standardized and poorly designed methods of data collection [6] often lead to incomplete or inaccurate data. Also, the lack of commitment from volunteers in collecting field data [4, 7] can lead to gaps in the data across time and space. Subsequently, these issues have caused many in the scientific community to perceive citizen science data as not worthy of being considered in serious scientific research [8].

Data cleansing and data quality improvement technologies can improve the data quality to a limited extent. For example, fairly simple techniques can be applied to validate data input (e.g. syntax, format and values) by checking compliance against schemas. More complex data quality assessment is possible by comparing the data sets with alternative sources or historical trends. However, these approaches are limited if there is no comparable or longitudinal data for trend analysis. An alternative and complementary approach to data quality enhancement services is to exploit social network analysis tools to assess the perceived trustworthiness of the contributor. A number of trust models and trust metrics have been developed by researchers in the context of Web 2.0 [9-11]—but to date, none have been applied to citizen science data. Our hypothesis is that trust and reputation metrics (such as those developed to provide recommender services in online social networks) can provide a simple and effective mechanism to filter unreliable data. Moreover, by combining trust/reputation metrics with data validation services, we can significantly improve the quality and reliability of the community-generated data, enabling its confident reuse.

In this paper, we describe a technological framework that combines data quality metrics and trust metrics to provide a measure of the reliability of citizen science data. This framework provides mechanisms for improving and measuring the quality of citizen science data through both subjective and objective assessments of the data. It also enables trust between individuals in an online citizen science community to be measured, inferred and aggregated to generate trust metrics for contributed data based on its provenance. In addition, the system provides filtering, querying, visualization and reporting services for scientists that take into account the reliability of the data and its source.

2 OBJECTIVES

The primary objective of this work is to develop software services for improving the quality and measuring the trust and reliability of citizen science data, so it can be confidently reused by scientists. More specifically, the aims are as follows:

  • to identify a set of criteria for measuring data quality in citizen science projects;
  • to develop a set of services for improving data quality in citizen science projects;
  • to evaluate, analyse, refine and optimize these data quality enhancement services—in the context of a citizen science project;
  • to identify a set of criteria or attributes for measuring trust of citizen science data. For example, these might include the following:
    • the contributor's role and qualifications (primary/secondary/PhD student, volunteer, council worker, scientist),
    • the contributor's ranking by other members (direct, inferred or calculated using social trust algorithms),
    • the quality and amount of past data that contributors have submitted,
    • the frequency and period of contributing,
    • the extent of training programmes that a contributor has completed, and
    • consistency with past data or trends,
  • to compare alternative trust models and algorithms for measuring trust and identify those approaches most applicable to citizen science projects;
  • to develop tools for capturing the trust-related attributes and for calculating aggregate trust values (e.g. the optimum weightings that should be applied to the previously mentioned criteria to determine the most accurate measure of the data's trust);
  • to evaluate, analyse, refine and optimize these trust measurement algorithms, tools and services—in the context of a specific citizen science project;
  • to measure the improvements in data quality that result from using trust metrics to filter or remove untrusted data or untrusted contributors; and
  • to investigate and identify optimum mechanisms for displaying and communicating the trust, quality of data and reliability of contributors to other members of the community, especially scientists who are considering reusing the community-generated data.

3 CASE STUDY

The CoralWatch project is a citizen science project being managed by the University of Queensland that aims to ‘improve the extent of information on coral bleaching events and coral bleaching trends’ [12]. Currently, the CoralWatch project has over 1300 members from 80 countries around the world. As of May 2011, its members have contributed over 30 000 surveys. CoralWatch provides simple colour charts (Figure 1) that can be used by anyone (scientists, tourists, divers, school students) to provide useful monitoring data on the extent of coral bleaching at particular reefs. Data collected through the CoralWatch programme includes: reef name, latitude and longitude, coral species, coral colour, water temperature, date and time and the method by which the data are collected, for example, snorkelling, reef walking or fishing. As well as collecting monitoring data, the project also aims to educate the public about the causes and impact of bleaching on coral reefs.

Figure 1.

Use of Coral Health Chart in the field.

New members register through the CoralWatch website.1 Once registered, the member can request a Do It Yourself (DIY) Coral Health Monitoring Kit. The kit provides a field guide for recording observations. Each observation includes the coral types, species and colour intensity of the coral (observed by comparing it with a chart). ‘The colour charts are based on the actual colours of bleached and healthy corals. Each colour square corresponds to a concentration of symbionts contained in the coral tissue. The concentration of symbionts is directly linked to the health of the coral’ [13]. The user generates an online survey by recording observations (species, colour, latitude, longitude, etc.) along specific transects and inputting the data to the CoralWatch database via an online data entry page.

4 DATA QUALITY ISSUES

A detailed analysis of a subset of the legacy CoralWatch data (approx. 18 560 records, collected between July 2003 and September 2009) was carried out in order to assess the quality of the legacy data. A significant number of errors were identified. Figure 2 illustrates the distribution of error types and the extent of errors in the data. It shows that the most errors were associated with the Global Positioning System (GPS) data (~64% of records) and that the most common errors were due to missing lat and long values, incorrect signs on the lat and long values or transposed lat and long values. Although such errors are easily identified and corrected, if not corrected, these observations are practically useless from a scientific perspective. There were also a significant number of errors in the volunteers' contact details—making it difficult to attribute errors to individuals, to hold individuals responsible for the data or to contact volunteers to clarify, confirm or correct outlying data. The causes of the majority of the errors were as follows:

  • lack of validation and consistency checking;
  • lack of automated metadata/data extraction;
  • lack of user authentication and automatic attribution of data to individuals;
  • absence of a precisely defined and extensible data model and corresponding schema;
  • lack of data quality assessment measures;
  • lack of feedback to volunteers on their data; and
  • lack of graphing, trend analysis and visualization tools that enable errors or missing data to be easily detected.
Figure 2.

Survey of errors in the legacy CoralWatch data.

By our estimation, over 70% of the errors could be prevented by developing and incorporating new services within the CoralWatch Portal that focus on the issues described earlier, including validation and verification services that are invoked at the time of data input, prior to ingest.

5 APPROACH AND METHODOLOGY

In this section, we describe the different components of the proposed framework (Figure 3) that enable the objectives described in Section 2 to be achieved. Evaluation of these tools and services are carried out in the context of the CoralWatch project, described in Section 3.

Figure 3.

Overview of the proposed framework.

5.1 Data quality and data validation

Wand et al. [14] define data quality as a multidimensional measure of accuracy, completeness, consistency and timeliness. These dimensions can be used to specify whether data are of a high quality by measuring specific deficiencies in the mapping of the data from the real-world system state to the information system state. Such dimensions can be used: to develop data quality audit guidelines and procedures for improving the data quality; to guide the data collection process in the field; and to compare the outcomes of different studies.

Currently, most organizations develop data quality measures on an ad hoc basis to solve specific data quality issues where practical and usable data quality metrics are lacking [15]. In many cases, these data quality measures are applied as a one-off static process either before or as the data enter the database. This is also apparent in citizen science projects where data quality measures are generally performed during the submission process only. Lee et al. [16] recommend that data quality metrics should be viewed as dynamic, continuous and embedded in an overall data quality improvement process as part of a data collection system. To achieve data quality improvement in citizen science context, it is necessary to identify the criteria for high quality data. To do this, we employ a data quality measure cycle that includes the following:

  1. identifying the data quality dimensions;
  2. performing data quality measures;
  3. analysing the results and identifying discrepancies; and
  4. implementing tools that provide necessary actions to improve the quality of data.

To identify the necessary data quality dimensions for a citizen science project, we conducted questionnaires and interviews with the stakeholders of the data. Table 1 [17] lists the data quality dimensions that were deemed most relevant to citizen science data.

Table 1. A set of data quality dimensions [17].
DimensionsDefinitions
AccessibilityThe extent to which information is available, or easily and quickly retrievable.
Appropriate amount of informationThe extent to which the volume of the information is appropriate for the task at hand.
BelievabilityThe extent to which the information is regarded as true and credible.
CompletenessThe extent to which the information is not missing and is of sufficient breadth and depth for the task at hand.
Concise representationThe extent to which the information is compactly represented.
Consistent representationThe extent to which the information is presented in the same format.
Ease of manipulationThe extent to which the information is easy to manipulate and apply to different tasks.
Free-of-errorThe extent to which information is correct and reliable.
InterpretabilityThe extent to which information is in appropriate languages, symbols, units and the definitions are clear.
ObjectivityThe extent to which the information is unbiased, unprejudiced and impartial.
RelevancyThe extent to which the information is applicable and helpful for the task at hand.
ReputationThe extent to which the information is highly regarded in terms of source or content.
SecurityThe extent to which access to information is restricted appropriately to maintain its security.
TimelinessThe extent to which the information is sufficiently up-to-date for the task at hand.
UnderstandabilityThe extent to which the information is easily comprehended.
Value-addedThe extent to which the information is beneficial and provides advantages from its use.

In the case of CoralWatch, the syntactic aspects of data quality are easy to measure—and in many cases easy to correct. They include problems with latitude and longitude ranges, spelling errors, invalid temperature values and formatting errors. To reduce the syntactic errors, we implemented a metadata and data suggestion/validation process that employs XML schemas and controlled vocabularies to restrict input to permitted values/ranges and to validate the data. Registered, authenticated members submit their data through a user friendly Web page that performs form validation and checking before the data are ingested. For example, country lists and reef names are validated against the GPS data provided by the member, using the Bio Geomancer dereferencing service [18]. Input data are run through the data quality cycle on submission, and the data is assigned a rating value based on the outcome of the quality measure process. If the data does not pass the data quality assessment, it will be marked ‘unvalidated’.

Checking the syntax of the data is simple compared with checking the correctness of the actual values. For example, comparing the contributed data to ground truth is difficult if there is no ground truth data available for comparison. In the case of the CoralWatch data, it can, with some difficulty and imprecision, be roughly correlated against related datasets such as ReefCheck data [19], NASA MODIS satellite imagery [20] and AIMS bleaching events data [21]. These organizations collect data using other techniques such as sensors, satellite imagery and sea surface temperature to assess the health of coral reef. Hence, such data sets provide an imperfect benchmark against that we may be able to identify outliers or generate a rough indication of data quality.

5.2 Exploiting social trust metrics

A considerable amount of research effort has recently been focused on trust, reputation and recommender systems in the context of e-commerce (eBay), social networking sites and social tagging sites [22]. The measured reputation value in these systems is usually a measure of the reliability or quality of users, products (books, films, music), posts, services or user activities. The methods by which these systems calculate and represent a reputation value, vary significantly. For example, online marketplace sites such as eBay and Amazon consider reputation as a single value (represented as number, star or bar rating) that is independent of the context. The information used to calculate the value of reputation is derived from other agents that have interacted previously with the target agent [9]. None of these previous systems [9-11, 22] has investigated the application of trust metrics to citizen science data. In this paper, we demonstrate how we apply and extend the reputation model developed by Golbeck [9] to calculate reputation within the context of a citizen science project.

5.2.1 Calculating reputation

Within citizen science projects, trust can be measured by assessing a range of attributes. These include the following:

  • direct rating between members;
  • inferred ranking or measure of trustworthiness—inferred across the social network using social trust algorithm;
  • direct rating of observations and surveys;
  • the contributor's role and qualifications (primary student, secondary student, PhD student, volunteer, council worker, scientist);
  • the quality of past data that has been contributed;
  • the extent of training programmes that the contributor has completed;
  • the amount of past data contributed; and
  • the frequency and period of contributing.

To calculate an estimate of reputation for entities (both users and data), we aggregate a combination of the attributes (such as those listed previously) and apply weightings to each attribute based on its importance. Figure 4 illustrates our model for calculating a unified reputation value for an object (rateeObj) based on the criteria listed earlier.

Figure 4.

Unified reputation calculator model.

Each time an object is created in the system, the reputationCalculatorService is called to create a reputationProfile for that object. This contains the currentReputationValue for the object that is calculated by executing an algorithm using the provided criteria. A reputationProfile can use more than one criterion to derive a reputationValue. The algorithm includes a function that extracts the different attributes about the rateeObj such as the following:

  • total number of contributions;
  • duration of contributing;
  • the volunteer's role (e.g. scientist = 5 stars); and
  • direct or inferred rating.

The system generates reputationValues for the attributes associated with each entity—as an ordered list from lowest to highest. Typically, reputation values are either numbers (e.g. {1,2,3,4,5} or strings {bad, fine, good, excellent}). In the case of CoralWatch, we use a 5 star rating (1–5 stars). The reputationCalculatorService also registers all the rater objects raterObj (which can be another user in the system or an automatic rating agent) in the reputationProfile for a rateeObj. An automatic rating agent can be a process that detects the quality of the submitted data and uses the reputationCalculatorService to calculate currentReputationValue for a user based on their latest submission. It also keeps track of pastReputationValues recorded in the reputationProfile.

5.2.2 Inferring reputation across the network

Golbeck [9] used a Recommender System to calculate reputation based on user profile similarity. A collaborative filtering algorithm is used to calculate a predictive movie rating for a user that will best match the user's attributes. If the user does not have a direct trust value for a person who rated a particular movie, the system will move one step out in the trust network to find connections to users that rated the movie. The process is repeated until a path is found between the user (user i) and a person who rated the movie (user s). Using the TidalTrust algorithm, shown in the succeeding text, a predictive trust value is calculated between user i and user s, where tij is the trust between user i and user j and tjs is the trust between user j and user s.

display math

We use a similar approach for generating inferred trust values for each member of the CoralWatch trust network. For example, to calculate an aggregate trust measure between a particular user and a particular contributor, who do not know each other, we perform the following:

  1. enable individuals to rate their trust of other individuals in the network and record these trust ratings;
  2. for the given user and given contributor, look for a direct path between them in the network;
  3. if no direct connection is found between the given user and contributor, the process moves out one step to find neighbouring users in the network, until a path is found between the user and the contributor;
  4. the process then uses the aforementioned algorithm to estimate a measure of inferred trust between the user and the contributor; and
  5. the system then calculates the aggregate trust value for the contributor (from the specific users perspective) by weighting the trust value calculated earlier and combining it with the other weighted attributes (e.g. role, amount of contributed data, duration of contributing).

Figure 5 illustrates the application of this social trust inferencing algorithm to calculate ‘trustworthiness’ between members of the CoralWatch network—some who do and do not directly know each other.

Figure 5.

Visualization of the CoralWatch Social Trust Network.

6 SYSTEM IMPLEMENTATION

6.1 System architecture

Figure 6 provides an overview of the architecture of the revised CoralWatch system. The system utilizes the PostgreSQL object-relational database management system for storing and processing CoralWatch data. PostgreSQL uses the PL/R language extension that supports R statistical functions (e.g. statistical analysis of CoralWatch data to determine whether a bleaching event has occurred). PostGIS [23] also provides geospatial operations such as high speed spatial queries, shape union and difference, geometry types such as points, polygons, multipolygons and geometry collections. The CoralWatch portal provides the interface by which users can upload their data, view surveys and reports, download data and interact with other users.

Figure 6.

System architecture for the CoralWatch project.

The server component is built using Java and JSP. The server interfaces with third party systems and clients through the following:

  1. a Web browser (e.g. Internet Explorer) with an OpenLayers mapping interface (e.g. Bing maps) and SIMILE timeline —that enable spatio-temporal search and browsing across the data;
  2. a Smartphone interface (e.g. an iPhone app) that enables data to be uploaded directly from the field (with automatic extraction of GPS data, date and time); and
  3. customized integration tools that harvest data from related repositories (e.g. IMOS, NASA MODIS satellite imagery, AIMS Coral Bleaching data) and map it to a common ontology, so it can be correlated against CoralWatch data. These datasets provide the benchmark or ‘ground truth’ for measuring the quality of the volunteers' data.

6.2 User interface for assigning and displaying trust

Users first need to register via the CoralWatch Web site. Registration requires users to enter their contact details, current role (secondary student, undergrad, postgrad, post-doc, research fellow, professor, teacher, volunteer), expertise and professional qualifications. Once registered, a user profile is stored and they are assigned a user id and password.

Authenticated users create a new survey by first entering the metadata for the survey. The metadata includes the survey's location (reef name, latitude and longitude), date/time, weather conditions and water temperature. A validation process then checks compliance of the input data against an XML schema and controlled vocabularies. Once the user has created a new survey, they can enter the set of observations of coral species and colour (Figure 7) along the transect.

Figure 7.

Submitting data via the CoralWatch Online Data Input/Upload Form.

Every time the user submits an observation, the data are instantaneously analysed on the server side. The charts generated from the data analysis show the colour distribution across the transect. Users can determine whether a bleaching event has occurred by analysing the change in colour over time.

Once the survey data have been entered, the next step is to calculate trust metrics for it. To date, we have developed simple tagging tools whereby members of the network can assign trust rankings to other members. The aggregate community trust value on a member is calculated by weighting and aggregating both direct and inferred trust values plus additional attributes (e.g. role, expertise, quality of past data, frequency and extent of past contributions) as described in Section 5.2. The calculated aggregate trust value is displayed as a 1 to 5 star rating in the user's profile (Figure 8)—this information is visible only to the system administrator. Users are not aware of the trust value that has been assigned to them and that is associated with the data uploaded by them.

Figure 8.

User profile showing trust as 1–5 star rating.

6.3 Trust-aware querying and visualization

Finally, we have developed simple filtering, querying and presentation methods that take into account the quality and trustworthiness of the individual surveys. Figure 9 shows the interactive user interface that enables users to browse, query and analyse the data. Surveys are displayed simultaneously on both the map and the timeline above the map (via drawing pin markers). When the timeline is dragged horizontally to a specific date, the surveys that were conducted around that date are displayed on the map. Users can click on the survey markers on either the timeline or the map to display the detailed survey metadata and observational data.

Figure 9.

Interactive user interface for exploring the CoralWatch survey data.

Users are also able to specify the level of trust required. For example, they can enter queries such as: ‘Show me all CoralWatch observations for Masthead Reef between 2007 and 2009 with a ranking of 3 or more stars’. Alternatively, users can display all surveys, but they are colour-coded according to their trust metric (red = 1 star, purple = 2 stars, yellow = 3 stars, white = 4 stars and green = 5 stars).

7 EVALUATION

Following the implementation of the revised CoralWatch system described earlier, evaluations were carried out on different aspects of the system to determine the extent of improvements to data quality and reliability, and the effectiveness of the user interface, system functionality and performance

  • We compared the type and distribution of syntactic errors in the submitted data before and after the implementation of the data validation services. As anticipated—through the use of pull-down menus, controlled vocabularies, range checking and XML schema compliance, we reduced the number of syntactic errors in the input data by over 70%.
  • A survey of feedback from the users and administrators of the CoralWatch data indicated that the trust metrics associated with individual users should be hidden—so as not to deter volunteers—but that trust metrics associated with specific datasets should be explicit. Poor trust metrics associated with individual volunteers could be used to target online training modules. Good trust metrics could be used to reward, encourage and retain volunteers.
  • The response from users to the ranking/tagging tools and the improved filtering, search and browse interfaces—was that these tools were relatively simple to use and greatly improved users' ability to understand the temporal, seasonal and spatial trends in coral bleaching events.
  • Deliberate submission of consistently poor data by dummy users was picked up eventually by other members of the network who assigned low rankings to these contributors. But there was a delay period during which the data and the user was unknown and assigned an ‘average’ ranking—which was not sufficient to filter it out.

Further effort is required in order to:

  • identify the best algorithms, weightings and approaches for measuring trust attributes and for calculating overall trust;
  • measure the performance, accuracy, efficiency and scalability of the trust metric tools as the size of the community and the database grows; and
  • monitor the impact of the system on the number and frequency of contributing volunteers, the retention of existing volunteers and the attraction of new volunteers.

8 FUTURE WORK

In future, we would like to investigate the application and viability of Attack Resistance trust metrics [12] in the context of citizen science data. The Attack Resistance trust metric is designed to filter out bogus or malicious users from a social network thus reducing the submission of invalid or deliberately misleading data. A Friend of a Friend (FOAF) Role-based Access Control Standard [24] can be adopted to define the relationships between members in a citizen science project. The named relationships will be the basis for certification levels of this approach. A simple relationship model of a trust network is represented by (Figure 10) with named edges

Figure 10.

Named relationships/arcs within a social network.

Each edge between nodes will be assigned a certification level that will be used to calculate the capacities of accounts. Periodic execution of this trust metric will remove any bad nodes (uncertified accounts) within the network. This will ensure that only certified and genuine volunteers remain in the system.

We also plan to extend and evaluate the tagging tools to enable ranking/tagging of geo-located citizen science data. A good example is the approach of Rezel et al. [25] that enables users to add tags to data/observations through a mapping interface. For example, users will be able to attach ranking tags and other annotations to specific observations to highlight data quality issues

9 CONCLUSIONS

Citizen science is democratizing science in that it enables public citizens and the scientific community to work together in monitoring, managing, maintaining and understanding the environment around us. A literature review has revealed that there is a critical need for methods to improve the quality and trust of citizen science data—and that there exists a range of technologies from the data quality and social trust fields that can potentially be combined to maximize the quality and reuse of citizen science data.

Using the CoralWatch project as a case study, we have implemented a system that demonstrates that it is possible to significantly improve the quality of community-generated observational data through a set of validation and verification tools. We have also shown that it is possible to calculate a measure of the reliability or trustworthiness of citizen science data using a weighted aggregation of both direct and inferred attributes. By explicitly enabling this metric to be displayed to users, and taken into account by querying, filtering and reporting services, we have enhanced the potential reuse of citizen science data by scientists.

Ancillary