An architecture for scaling federated search



Federated search has the tremendous potential to make a wide range of diverse information and viewpoints available to scientists, researchers and the public. Traditional federated search engines provide access to a relatively small number of content sources, – typically several dozen or fewer. But, depending on the discipline, there may exist hundreds of databases with relevant content. And, given the value of searching content sources in fields that are seemingly unrelated, a researcher will benefit from simultaneously searching hundreds, or thousands of sources. The greater the number of relevant and diverse high quality sources a researcher can access, the faster he or she will make discoveries that advance science and improve the quality of our lives.

The current paradigm for federated search suffers from a number of problems that hinder the development of large and scalable federated search engines. Search speed, relevance ranking, and source selection all suffer in today's paradigm as the number of sources increases. Deep Web Technologies, a Santa Fe New Mexico-based federated search technology company, is pioneering the effort to build applications that overcome these obstacles. Deep Web Technologies has created a hierarchical “divide-and-conquer” architecture that distributes the federated search work flow to eliminate the traditional bottlenecks and allow for massive scalability.

Deep Web Technologies built the search engine behind, a global gateway to government-produced and government-supported science research information. employs Deep Web Technologies' hierarchical approach to search sources that are themselves federated search engines. searches 140 sources through this approach. Deep Web Technologies is building, by mid-2009, a 500-source science research portal. Deep Web Technologies will describe, in the ASIS 2009 poster session, its architecture and how it facilitates large-scale scalability of federated search applications.

Background is a noteworthy example of a federated search application that promotes diversity of ideas at an international level by aggregating high quality science information from government and government-sanctioned sources from 55 countries contributing 49 databases [now 52 databases] and portals and representing approximately 73% of the world's population [1]. Medical Librarian Hope Leman, in a review of WorldWideScience [2], illustrates beautifully the value of an international science portal for casting a wide research net and for finding important information in unexpected places:

“Now here is an example of why this project can lead to improvements in the quality of life for the ill worldwide. One of the results I got for ALS was from the journal Internal Medicine published by the Japanese Society of Internal Medicine. Now, I am quite interested in Japanese views of ALS, given that they have a much higher rate of full ventilation of patients than is true in the US. That is an interesting phenomenon in itself, suggesting that Japanese caregivers and clinicians have a greater willingness to care for patients under these often demanding conditions. And the article I found, “Salivary Chromogranin A: Useful and Quantitative Biochemical Marker of Affective State in Patients with Amyotrophic Lateral Sclerosis,” might sound arcane. But it actually had the very moving conclusion that it is imperative that ways of measuring mood be found for ALS patients, many of whom lose the ability to speak and some of whom become locked in. “Useful biochemical markers of the affective state in advanced patients have not yet been developed.” What a wonderful world we live in where search engines like WorldWideScience render findable scholarship produced in societies not one's own that sets you to thinking about issues that had not before entered your ken.” is an excellent example of a federated search application that employs an innovative approach, hierarchical federation, to efficiently search numerous content sources. Hierarchical federation allows for the combining of multiple federated search engines, each of which performs a portion of the searching, aggregating, and relevance ranking of content.


Figure 1.

Hierarchical federation of sources

Figure I illustrates the hierarchical approach. is a federated search portal that searches 52 sources, one of which,, is itself a federated search portal. searches 40 sources, one of which is a federated search portal, the E-print Network. From a single search page on, a user can search 140 sources.

Deep Web Technologies is expanding its science research portal,, to federate 500 sources using the hierarchical approach by mid-2009. In order to make large-scale federated search viable as a paradigm that is implemented by other vendors and organizations, Deep Web Technologies is actively researching the following elements of a complete solution:

  • 1.Distributed computing to spread the computation and network loads, i.e. to load balance. In particular, aggregation of search results from different sources and their relevance ranking lends itself to distributed computing.
  • 2.A mechanism for providing failover to redundant hardware components.
  • 3.Automated source selection.
  • 4.A streamlined approach for creating, testing, monitoring, and updating thousands of sources.
  • 5.A mechanism to query and select a subset of sources from a federated search engine. This is required to eliminate searching duplicate sources across multiple engines.
  • 6.Development of standards for development of hierarchical federations so that federated search engines from different vendors can inter-operate.

Poster Session Proposal

Deep Web Technologies proposes to discuss the hierarchical federation architecture, obstacles, and accomplishments to date. Additionally, Deep Web Technologies will share its vision of popularizing large-scale federated search applications and tell what it believes it would take to achieve the vision. The poster would illustrate the key components of hierarchical federation and how they interact.