Get access

Cloud-based parallel solution for estimating statistical significance of megabyte-scale DNA sequences

Authors

  • Ahmad M. Hosny,

    Corresponding author
    1. Department of Scientific Computing, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
    • Correspondence to: Ahmad M. Hosny, Department of Scientific Computing, Faculty of Computer and Information Sciences, Ain Shams University, Abbassia, Cairo, 11566, Egypt.

      E-mail: ahmad.hosny@yahoo.com

    Search for more papers by this author
  • Howida A. Shedeed,

    1. Department of Scientific Computing, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
    Search for more papers by this author
  • Ashraf S. Hussein,

    1. Department of Scientific Computing, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
    Search for more papers by this author
  • Mohamed F. Tolba

    1. Department of Scientific Computing, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt
    Search for more papers by this author

  • A preliminary conference version of this paper with preliminary results appeared in the Informatics and Systems Conference Proceedings of INFOS 2012 [1].

Abstract

Confidence in a pairwise local sequence alignment is a fundamental problem in bioinformatics. For huge DNA sequences, this problem is highly compute-intensive because it involves evaluating hundreds of local alignments to construct an empirical score distribution. Recent parallel solutions support only kilobyte-scale sequence sizes and/or are based on sophisticated infrastructures that are not available for most of the research labs. This paper presents an efficient parallel solution for evaluating the statistical significance for a pair of huge DNA sequences using cloud infrastructures. This solution can receive requests from various researchers via web-portal and allocate resources according to their demand. In this way, the benefits of cloud-based services can be achieved. The fundamental innovation of this research work is proposing an efficient solution that utilizes both shared and distributed memory architectures via cloud technology to enhance the performance of evaluating the statistical significance for pair of DNA sequences. Therefore, the restriction on the sequence sizes is released to be in megabyte-scale, which was not supported before for the statistical significance problem. The performance evaluation of the proposed solution was carried out on Microsoft's cloud and compared with the existing parallel solutions. The results show that the processing speed outperforms the recent cluster solutions that target the same problem. In addition, the performance metrics exhibit linear behavior for the addressed number of instances. Copyright © 2012 John Wiley & Sons, Ltd.

Ancillary