Internet biosurveillance utilizes unstructured data from diverse web-based sources to provide early warning and situational awareness of public health threats. The scope of source coverage ranges from local media in the vernacular to international media in widely read languages. Internet biosurveillance is a timely modality that is available to government and public health officials, healthcare workers, and the public and private sector, serving as a real-time complementary approach to traditional indicator-based public health disease surveillance methods. Internet biosurveillance also supports the broader activity of epidemic intelligence. This overview covers the current state of the field of Internet biosurveillance, and provides a perspective on the future of the field.
Internet biosurveillance, or digital disease detection , utilizes unstructured data from diverse web-based sources to provide early warning and situational awareness of human, animal and plant infectious diseases, as well as chemical, radiological and nuclear threats . The discipline emerged in the mid-1990s, relying primarily on text media for its information, and has evolved into a globally recognized field [3, 4]. With the increasing volume of information and new media types available via the Internet, the field has grown to include social media, participatory sources, and non-text-based sources. The scope of source coverage ranges from local media in the vernacular to international media in widely read languages. Online official reporting sources are typically used to supplement and verify such informal Internet sources.
Internet biosurveillance is a timely modality that is available to government and public health officials, healthcare workers, and the public and private sector, serving as a real-time complementary approach to traditional indicator-based public health disease surveillance methods [5, 6]. Internet biosurveillance also supports the broader activity of epidemic intelligence (EI). This review covers the current state of the field, and provides a perspective on its future.
This is not a ‘systematic review’; rather, this article outlines a general process of Internet biosurveillance according to established best practices, and discusses common technologies employed in extant systems. Each step of the process is collectively described, drawing upon personal experiences of system builders and practitioners, as well as published studies. The authors contributing to this article are either affiliated with Internet biosurveillance systems, are end-users of Internet biosurveillance systems, and/or have published recently in the field. Authors from the following active Internet biosurveillance systems are represented: BioCaster , the Global Public Health Intelligence Network (GPHIN) , HealthMap , the Medical Information System (MedISys) (Steinberger et al., IDRC, 2008, Short and Extended Abstracts, pp. 612–614, http://publications.jrc.ec.europa.eu/repository/handle/111111111/13078 (accessed 9 February 2013)), the Program for Monitoring Emerging Diseases (ProMED-mail) , and the Pattern Understanding and Learning System (PULS) .
The process of Internet biosurveillance varies, but, in general, includes: (i) the collection and storage of data from the Internet; (ii) processing those data to produce information; (iii) assembling that information into analyses; and (iv) dissemination of analyses to end-users (Fig. 1). Each part of the process can entail many technical steps, which are described below. Information vetting can occur through fully automated, human-moderated or partially moderated approaches throughout the process. Multilingual data are managed via human linguists, machine translation, and natural language-processing technology.
Collection and storage
Internet biosurveillance systems rely on data from a variety of sources. Publicly available, informal sources include text-based news sites (e.g. New York Times and Thanh Nien News) and social media sources (e.g. Twitter , Facebook, and blogs); more recently, sources that utilize public input (e.g. FluTrackers, Flu Near You, and crowdsourcing platforms ) have gained popularity and credibility. Information from these sources is often available in real time as an event is developing. This information is validated and supplemented by official, publically available information sources (e.g. public health agencies, ministries of health, the WHO, the World Organization for Animal Health, and the Food and Agriculture Organization). Systems also may utilize sources with paid content (e.g. newswires and news aggregators). Audio and video sources provide non-text-based information. Sources range widely in geographical coverage, from local to international, and cover all languages with publicly available media.
Data are retrieved from the Internet via two predominant modalities: media aggregators and system-specific web monitoring. As an example of the latter, Internet biosurveillance systems monitor the web by scraping (that is, specific web pages are accessed and stored) or crawling (that is, in addition to storing one specific web page, links on that page and links of links are accessed and stored).
Systems re-visit a list of predefined sites at regular intervals (typically, once to several times each day) in order to process data in a timely manner for early alerting. For paid or access-limited content, items might be accessed via a secure connection. News items from online news sites and social media are converted to a common format after retrieval, to enable searching and content mining. Public health agencies and ministries of health often provide their own feeds with official information. Feeds from aggregator news sites (e.g. Google and Yahoo) can be used to provide additional coverage. Content is extracted from the HTML code, with proper removal of advertisements and any other irrelevant text.
Social media data stem mostly from Twitter  and Facebook, which can be retrieved via their application programming interface. Access may be limited to a certain volume, and is subject to change according to the provider's Terms of Service. As some social media users are unaware that they publish their opinions worldwide, privacy issues arise under some jurisdictions, even with the publicly available data. Participatory data can be included via dedicated apps (e.g. iPhone and Android) or websites where users can leave comments (e.g. http://www.flutrackers.com/; http://www.healthmap.org/outbreaksnearme/) [15, 16].
Once data are retrieved from the Internet, they must be processed to make them amenable for analysis. We emphasize that, because different types of users have different needs, there is no single, overarching goal for the data-processing step. Nevertheless, the following categories represent important steps in biosurveillance data processing: translation, relevancy ranking, ontology, event extraction, and de-duplication.
Although Arabic, Chinese, English, French, Spanish and Portuguese dominate the world's online news media, news of an outbreak event can appear in any language, and is often reported first in a local language. Systems have choices to make regarding the approach to translation. For example, they can build customized pipelines for a few languages, or they can translate each source language into a common target language. The decision is influenced by factors such as the availability of resources in each language, the time available to maintain each resource, and the translation quality required. For example, BioCaster employs full text translation first and uses only English language selection algorithms, whereas MedISys and HealthMap are language-specific in terms of the keywords employed to search Internet data. GPHIN employs both language-specific keywords and algorithms to extract relevant data from the Internet and news aggregator databases , whereas PULS employs language-specific linguistic analysis and ontologies and inference rules to extract relevant data.
The next stage in processing is to assess the relevancy of the report according to some measure of the user's interest. Defining the user's interest as a set of guidelines, a decision tree or as a collection of examples is a crucial stage in system building, and provides a reference standard against which to evaluate various algorithms. Once this has been done, various approaches can be implemented, including supervised classifiers such as Naïve Bayes or Support Vector Machines with learn-to-rank, and Boolean keyword searches, which include logical operators such as AND and OR . These techniques are language-specific, but it is also possible to deploy automated methods that are language-independent, such as clustering followed by automated labelling.
Ontologies have proven useful in many domains (e.g. the life sciences) for structuring relationships between concepts. Biosurveillance requires a conceptual knowledge of diseases, microorganisms, signs and symptoms, and geography. A number of ontological resources have been developed or re-used for public health, although these are not generally as well known as those in experimental biology or clinical fields, such as the Unified Medical Language System. Among those developed specifically for public health are GIDEON (commercial, openly available), BioCaster (open source), and GPHIN (non-commercial, limited access). Such ontologies provide knowledge needed by Internet biosurveillance systems to make intelligent judgements about the terms appearing in news reports. For example, a mention of Yersinia pestis may imply that the disease under consideration is bubonic plague. However, not all ambiguities can be resolved with the static knowledge contained in an ontology. One of the most practical problems is toponym disambiguation (i.e. place names). For example, a mention of a disease outbreak in ‘Cambridge’ might resolve to any of several places worldwide, including the UK or the USA.
Once a set of topics of potential interest has been identified, specific biological events are extracted from the data. This can be accomplished in different ways. As one example, simple keyword recognition algorithms are often used to categorize incoming news items. In this approach, an article is categorized according to predefined keywords (see example in Table 1). Boolean combinations (e.g. AND, OR, NOT) and proximity searches (i.e. search for articles where two or more separately matching term occurrences are within a specified word or character distance) can then be applied .
Table 1. Examples of multilingual keywords used for identification of dengue fever in MedISys
More detailed aspects of an outbreak can be extracted by event meta-data extraction, in which the aspects of interest are known and defined a priori. Examples of commonly detected aspects include the name of the disease, the species affected, the date of the outbreak, the numbers of cases and deaths, and the location of the outbreak. Event meta-data extraction uses the extensively researched technology known as information extraction, which is the basis of PULS and BioCaster . Less common aspects include distal indicators of political and social response, such as ward closures or the deployment of international organizations to the affected region. Often, the techniques used are linguistic patterns developed with specific rule systems, but supervised, semi-supervised and unsupervised machine-learning approaches have also been evaluated .
Effective de-duplication is essential for events with wide coverage, so that nearly identical stories appearing in many sources do not overwhelm the user. De-duplication may involve the detection of reports that are identical in content, which are handled in practice with clustering techniques as outlined above. Reports may also be identical in the aspects of the outbreak that they report. De-duplicating these reports in practice is challenging, and can require deeper-meaning analysis. Nevertheless, there are often subtle but important aspects of an event that may not be easily captured, such as the revision of victim numbers, the change in a patient's condition, or a comparison between a novel and a known agent. De-duplication should ideally be sensitive to these grey areas, and pass forward such articles for human analysis.
At this stage of the process, a biosurveillance system will have produced a structured collection of events that are potentially relevant to end-users. However, only a subset of these may be highly useful, given a particular user's interests. For example, a case of seasonal influenza in a celebrity, although widely reported, may be less relevant than a few reports of a cluster of novel influenza among farmers. Given the conflict between the volume of data to be analysed and the limited ability of humans to review large amounts of information quickly, it is often desirable to process the articles through an automated trend and anomaly detection capability in order to increase throughput and timeliness. The objective is to infer which events are more urgent or unusual in a timely manner, so that the user can investigate further and potentially initiate risk analysis. The challenge is to model what is already known (i.e. what is normal or expected), and to decide whether the current event is significantly at variance as early as possible. We focus on two complementary classes of approach in this section: trend analysis and anomaly detection.
The temporal nature of Internet biosurveillance data produces longitudinal patterns and trends. Precursors and indicators of outbreaks can be tracked over time to show the precedence of an event before symptoms or the populace pass thresholds for warning. Timelines can also be used to track classifiers, keywords, locations, or terms, and indicate temporal traces of events for significance against predefined baselines. Visualizing topical trends and shifts over time based on such lexicons can facilitate the detection of unexpected disease events. Standard time-series algorithms and other signal-processing techniques are often used to model these temporal trends [21-23].
Anomaly detection attempts to put the features of the event into context in order to determine some level of significance. Context is usually considered to be spatial and/or temporal or a mixture of the two, and can be based on simple event counts of a particular disease type or on multiple features of the event. However, in situations where terminology begins to specialize or diverge (e.g. ‘mad cow’ to ‘bovine spongiform encephalopathy’, or ‘swine flu’ to ‘H1N1′), the anomaly detection can be attenuated.
Achieving the ultimate public health goals of biosurveillance systems—to facilitate early outbreak detection, thereby allowing timely interventions, limiting the severity and extent of spread—depends on the clear and rapid distribution of information. Internet-based biosurveillance systems use different means of disseminating information, depending on user needs and resources and the nature of the information.
Most systems use a combination of actively ‘pushing’ material to users and allowing users to ‘pull’ material when desired. ProMED-mail, one of the earliest Internet-based biosurveillance systems, uses mailing lists (e-mail) and listserv software, where users can subscribe to specific resources (e.g. animal or plant diseases). GPHIN uses a pushing function to send alerts about events that have been identified as significant to subscribers. Some services (e.g. HealthMap) allow users to specify parameters for pushed information, such as specific diseases, categories of disease, and geographical locations. SMS text messages, mobile telephone networks and social networks (e.g. Twitter) actively send information to anyone subscribing to a feed.
In addition, most Internet biosurveillance systems have a dedicated website where users may query and filter material on demand. Although they are passive, websites allow users to obtain specific information when it is needed, and they usually provide the capacity to search for specific data (e.g. specific disease categories, locations, or time periods). Geographical mapping, which is automatically generated and displayed by several current systems, allows users to visualize clustering of events over time and space. More recently, smartphone apps have been developed that allow a combination of active and passive dissemination of information (and also allow users to report data back to the system).
With the rationale that it is not always possible to predict who will need a specific piece of information, many systems make their data available freely to anyone. Other systems make their information available to selected groups or individuals. Selectivity of dissemination may be based on the need to restrict access to confidential information (e.g. the Epi-X system of the US CDC, which is available only to vetted public health officials), or a paid subscription model may be used in order to recoup the costs of creating and maintaining the system.
Illustration of Internet biosurveillance: Madeira Island dengue fever outbreak, October 2012
To illustrate how an event is detected and observed to evolve through the lens of an Internet biosurveillance system, consider the October 2012 dengue fever outbreak in the Autonomous Region (island) of Madeira, a Portuguese territory located approximately 1000 km from the mainland . It was the first dengue outbreak in Europe since 1928. With the keyword-based approach outlined in Table 1, MedISys  identified several Portuguese media articles on 5 September 2012, reporting that ‘the mosquito Aedes aegypti struck again in force on Madeira’ and ‘left pharmacies without repellents and ointments’ (peak A in Fig. 2) [3, 26].
The data showed a sudden increase in dengue fever reporting in the Portuguese press, and MedISys issued an alert on Wednesday 3 October 2012 (peak B in Fig. 2). In more than 40 news articles, two confirmed and 22 suspected cases of dengue were reported. The story was run in newspapers in other European Union (EU) countries (Spain, Finland, etc.) on 4 October (peak C). On 5 October, 34 cases were reported as confirmed. The story was reported in the French and Belgian press on 10 October and in the UK press on 12 October, following a Reuters news wire story. An update from the Portuguese health authorities (Direcção-Geral da Saúde) was broadly discussed in the news on 8 November (peak D), and 517 confirmed cases were mentioned. The publication of the European Centre for Disease Prevention and Control (ECDC) Rapid Risk Assessment (RRA) update on 20 November met wide coverage, with over 80 articles being published within and outside the EU on 21 November (peak E).
Internet biosurveillance played an important role in triggering an early public health response to this event (the grey bar in Fig. 2). On 3 October, the ECDC noticed a MedISys automated alert, and immediately began the process of verification by contacting the national health authorities of Portugal and gathering additional information from external experts in order to finalize an RRA for the EU population. Following this action, on 4 October, preliminary information about the outbreak was confidentially shared by the Portuguese health authorities with the EU/European Economic Area member states through the Early Warning and Reporting System (EWRS). The EWRS is the EU official communication restricted web-platform, and enables national authorities to exchange information on confirmed communicable disease events of potential international concern .
Early in the outbreak (near peak C in Fig. 2) on 6 October, the first ECDC RRA was internally finalized, and it was shared a few days later (10 October) with the EU/European Economic Area national health authorities through the EWRS. On 11 October, as agreed with the Portuguese authorities, the ECDC RRA was also made available online for the general public on the ECDC website . In this outbreak, Internet biosurveillance played an important role in making international public health agencies aware of a potential outbreak earlier than would have been the case otherwise. This resulted in an early warning about the risk of infection in travellers returning from Madeira, where tourism is an important part of the economy. It also highlighted the risk of importation of dengue virus to continental Europe via air and sea cargo at the onset of the outbreak .
Outbreak data for human, animal and plant disease, available through informal media channels via the Internet, have been demonstrated to provide detection of anomalous disease events prior to official reporting [30-32]. In general, Internet media have the advantage of being timely, comprehensive, and available in any language from local and international sources. Such information can help to focus traditional surveillance efforts, and provides key data that can be used for a range of important public health purposes . The value and pertinence of Internet biosurveillance have been demonstrated [34-36], and the approach has been integrated into the revised International Health Regulations . Internet biosurveillance therefore contributes to early warning and situational awareness, and aims to trigger public health responses to mitigate outbreaks of infectious disease.
Biosurveillance as an input to EI
Internet biosurveillance has influenced the way in which EI is gathered. To meet its objective of early warning, EI typically combines one or more Internet biosurveillance systems that are complementary to one another, in order to gain a broad view of topics and regions of interest. EI is widely used by national and trans-national public health organizations (e.g. the US CDC, the ECDC, the Public Health Agency of Canada, the French Institute for Public Health Surveillance (InVS), and the WHO) to strengthen their early detection functions [38-40]. The scope of EI and its final objective are broad, and vary according to the mandate and objectives of the implementing institution. For example, EI can be adapted to specific goals, including the early detection of public health emergencies, of specific infectious diseases only , and of public health events during mass gatherings . Nevertheless, core functions and EI can be defined as the process of early detection, collection, verification, analysis and organization of information in relation to public health events [42, 43]. EI processes integrate both formal and informal sources of information (e.g. Internet biosurveillance and traditional public health surveillance).
From the end-user perspective, the first EI step is the detection of pertinent raw signals. Official sources of health information (e.g. ministries of health, and surveillance networks) are typically easily identified, and their content is meant to support public health analysis. However, access to these may be difficult and constrained (for example, the information may be available only in the national language, and access to the information may be restricted), and their frequency of publication may not be appropriate for early disease detection. Therefore, informal sources (e.g. Internet media, discussion forums, and social networks) often represent the main source of signals. To collect and process large volumes of such material requires the use of Internet biosurveillance systems.
From the many raw signals observed from Internet biosurveillance systems, EI teams select information according to selection criteria defined by their public health institution. Following this, signals are verified; it is this verification phase that discriminates biosurveillance from EI. Verification consists of confirming and supplementing available information from additional and reliable sources, which are mainly networks of public health experts such as public health institutes, international institutions such as the WHO, World Organization for Animal Health, and ECDC, regional networks such as EpiSouth, laboratories, and non-governmental organizations.
Once verified, events are analysed to assess potential public health significance and potential national and/or international implications. Each is considered within its context and in the light of available scientific knowledge regarding spread, severity, and the efficacy of appropriate control measures . Finally, following this analysis, the detected health threats are communicated to alert health authorities and to inform the public health community.
Needs for future research
Above, we have described the current state of the field of Internet biosurveillance, from data collection to data utilization for EI. Internet technology has significantly advanced the disease surveillance landscape; however, gaps in biosurveillance processes exist, and many challenges lie ahead in the field; some of those are described below.
Real-time signal detection
Sifting through the vast array of multimedia information on the Internet in real time is challenging. The noise of non-specific reports and misinformation complicates signal detection. Moreover, identifying anomalous activity without an established multi-year baseline of reporting for a given disease in a particular region is an obstacle. Anomaly detection is a capability in some biosurveillance systems at present, but there is a need for more robust anomaly detection approaches, including better entity extraction, visual analytical modalities, clustering methods, etc. . Moreover, more work is needed on capturing and analysing the data from multilingual sources through linguistic algorithms or automated translation.
Internet biosurveillance data typically cannot be analysed with traditional epidemiological approaches, owing to a lack of timely data verification and validation. For example, recognizing false-positive and false-negative events is problematic, owing to the lack of official comparison data or delays in diagnostic testing . Frequencies of reports or events are often used for anomaly detection. However, identifying a common denominator (e.g. reports, events, articles, and sources) for analysis, and assigning a weight to sources based on accuracy, scope, and publication frequency, are not well established.
Collaboration, networking, and participatory epidemiology
Public self-reporting of events is increasingly recognized as benefiting disease detection. Extracting the data from participatory platforms (e.g. FluNearYou, Twitter, and Facebook) and utilizing it for early detection and surveillance is a critical area of current focus. For example, DIZIE, a project developed at the National Institute of Informatics in Tokyo, Japan, is used to visualize the extent to which Twitter data can detect/track infectious disease outbreaks . More work is needed in this area, as health information sharing on social networking platforms has become prolific . Users and public health experts can utilize this data in real time to track and assess disease situations .
Platforms with user-customizable features based on their specific needs and interests may make participatory modalities more attractive to a wider range of users. Also, more interactive functions for users (e.g. scoring option and comment field), may facilitate user interactions and information dissemination. An example of sharing and networking is the fully functional system for early alerting and reporting of potential chemical, biological, radiological, and nuclear events that has been developed by the Global Health Security Action Group through an extensive collaboration between the Joint research Centre of the European Commission and a team of risk assessment specialists from the G7+ Mexico countries .
The authors declare that they have no conflicts of interest.