Data management for eScience in Brazil

Authors

  • Fabio Porto,

    Corresponding author
    1. Data Extreme Lab (DEXL), National Laboratory of Scientific Computing, Petropolis-RJ, Brazil
    • Data Extreme Lab (DEXL), National Laboratory of Scientific Computing, Av. Getúlio Vargas, 333 Petropolis-RJ Brazil

    Search for more papers by this author
  • Bruno Schulze

    1. Distributed Scientific Computing (COMCIDIS), National Laboratory of Scientific Computing, Petropolis-RJ, Brazil
    Search for more papers by this author

In the 5th edition, submitted papers were subjected to a peer review process with up to three reviews per submission. The edition was organized into three tracks with research papers presentations; a keynote talk and a poster session. The scientific application track included two papers. The paper by Bustos et al. [1] used time series analysis in the forecast of fluids in oils reservoirs. The paper by Cugler et al. [2] is one whose version has been extended as an invited paper for this special issue and explores a new field of managing animal sound recordings using database technology. The adoption of techniques from high processing computing in eScience is the theme of track 2. The first two papers discussed techniques involved in natural phenomena modeling and simulation. The first paper [3] discusses the adoption of high performance computing (HPC) environment in the modeling of protein sequence. Next, the paper by Sabino et al. [4] presents the challenges of using clusters of graphics processing units to compute 3-D wave propagation simulations, using finite difference methods. Lastly, the paper by Chirigati et al. [5], also selected for submitting an extended version to this special issue, discusses techniques for deploying scientific workflows in HPC. Finally, the track eScience services included paper in roughly two lines. In the first line, Costa and colleagues [7] explore the availability of deploying eScience in the cloud as services. In the second line, the two other works are related to data modeling including metrics of quality in bio ontologies [8] and a database integration strategy for ecological datasources [9].

The 5th BreSCI workshop also received Prof. Luiz Nicolaci da Costa, astronomer of the National Observatory in Brazil. Prof. da Costa is the head of the LIneA laboratory, responsible for the storage, analysis, and publishing of astronomy catalogs, produced by large surveys. In his talk, Prof. Nicolaci presented the Portal, a software infrastructure from which scientists can query and process scientific pipelines over catalogs. The workshop ended with a poster session.

After the workshop, the steering committee of the BreSci, indicated the best papers, whose authors were invited to extend their contributions to a submission to this special issue. Four papers were invited and submitted extended versions, of which two were finally accepted for publication and take part of this special issue.

SPECIAL ISSUE PAPERS

The first paper, by Cluger et al. [2], ‘An architecture for semantic retrieval of animal sound recordings’ presents innovative research in biodiversity data management. The authors investigate techniques to support management of recordings of animal sounds captured in natura and mixed with contextual information, such as environmental conditions and social events. The approach proposes the retrieval of animal sound recordings on the basis of the analysis of contextual information associated to the recordings as metadata and the use of controlled vocabularies with ontological inference support. The authors present a first prototype that implements part of the ideas proposed in the paper.

The second accepted paper is entitled ‘Chiron: A Parallel Engine for Algebraic Scientific Workflows’ by Ogasawara et al. [6] describes a parallel workflow engine designed to efficiently process data centric scientific workflows. The system assigns known algebraic operators to workflow activities, leveraging the latter processing semantics. On the basis of the analysis of algebraic operators semantics, the system can compute an optimized workflow. Moreover, during execution, activities and data compose a processing unit, called activation, which can be freely scheduled on a HPC cluster. Finally, Chiron implements two activation dispatching modes: static and dynamic. Chiron has been tested with complex scientific workflows from the Oil and Gaz industry and has shown very promising results, placing itself as a good candidate for composing the eScience workflow ecosystem.

DATA MANAGEMENT FOR ESCIENCE

The 5th BreSci offered an up-to-date window to the status of the development of eScience in Brazil. One can observe that some disciplines have invested more eagerly in transposing their experiments to in silico environments. Notably are the cases of biology, astronomy, biodiversity, and oil and gas. These disciplines were pushed by the development of new digital instruments and computational simulations, whose products are numerical results. In the case of biology, the traditional qualitative and descriptive research is leaving the stage for quantitative analysis of DNA information. High throughput sequencing machines are the basis for the interpretation of genomes in support of agronomic research at EMBRAPA, as well as various genome and metagenome projects at the National Laboratory of Scientific Computing, LabInfo laboratory. Astronomy has also moved a huge part of their scientific life cycle to in silico environments. Important international collaboration projects, called surveys, are using photometric images from earth-based telescopes to capture sky shots and use computing algorithms to analyze the images and automatically identify billion of stars and galaxies to compose a catalog of sky objects. More recent projects are expected to produce terabyte of data per observation night that must be managed, analyzed, and published. Since the Sloan Digital Sky Survey, the community has understood the importance of storing catalogs in relational databases. However, as the amount of collected data grows current relational database systems are not anymore sufficient to efficiently manage the data [10]. There are many parallel initiatives trying to bridge this gap. SciDB [11] is a multi-array clustered database system that supports multidimensional, space-time astronomy data. The LSST project [12] is building its own system QServ to scale to thousands of database partitions and to support distributed-parallel query processing. The case of astronomy has raised important issues in data management for huge science projects.

  • Data must be partitioned through hundreds or thousands of different physical nodes.
  • Two types of data consuming applications: form-based queries and scientific workflows.
  • Data and process should run in the same node avoiding data transfer.
  • Data are usually not updated. Complete new releases substitute previous data.

In this context, systems to be developed in support for huge volume of data, as the astronomy surveys, shall consider data partitioning as data storage strategies. This is in some aspect new to scientific computing infrastructure running tightly coupled methods, which has resorted to HPC platforms where nodes have very few permanent storage areas and storage devices are accessed via high throughput network connections. In a scientific workflow environment, for instance, in which workflow activities communicate through files, the huge data transfer may jeopardize gains obtained from activities parallelism. As astronomy data are stored in databases and, eventually, processed by scientific workflows, another issue appears in the lack of integration between scientific workflows management systems and distributed relational database systems. As a matter of fact the, two systems run in complete independence precluding any integrated optimization, such as placing activities in the same nodes as the data partition they shall process. Some parallel database systems have been extended to integrate parallelism à la MapReduce to process over partitioned data. Whereas such solutions are a nice response to queries over huge databases, they are incapable of running general program pipelines, as in scientific workflows.

Given such panorama, the challenges for eScience in Brazil and worldwide are huge. The papers in this special issue deal with data management and scientific workflow processing and collaborate therein to bridge the gap between science and eScience. We expect the Brazilian eScience community to continue facing these challenges in future editions of the Brazilian science workshop.

ACKNOWLEDGEMENTS

We would like to thank the authors for contributing papers on their research on Data Management for eScience for this special issue, and all the reviewers for providing constructive reviews and in helping to shape this special issue. Finally, we would like to thank Prof. Geoffrey Fox for providing us an opportunity to bring this special issue to the research community.

Ancillary