Get access

A study on using uncertain time series matching algorithms for MapReduce applications

Authors

  • Nikzad Babaii Rizvandi,

    Corresponding author
    1. Network group, National ICT Australia (NICTA), Sydney, Australia
    • Center for Distributed and High Performance Computing, School of Information Technologies, University of Sydney, Sydney, Australia
    Search for more papers by this author
  • Javid Taheri,

    1. Center for Distributed and High Performance Computing, School of Information Technologies, University of Sydney, Sydney, Australia
    Search for more papers by this author
  • Reza Moraveji,

    1. Center for Distributed and High Performance Computing, School of Information Technologies, University of Sydney, Sydney, Australia
    2. Network group, National ICT Australia (NICTA), Sydney, Australia
    Search for more papers by this author
  • Albert Y. Zomaya

    1. Center for Distributed and High Performance Computing, School of Information Technologies, University of Sydney, Sydney, Australia
    Search for more papers by this author

Correspondence to: Nikzad Babaii Rizvandi, Center for Distributed and High Performance Computing, School of Information Technologies, University of Sydney, Sydney, Australia.

E-mail: nikzad@it.usyd.edu.au

SUMMARY

In this paper, we study CPU utilization time patterns of several MapReduce applications. After extracting running patterns of several applications, the patterns along with their statistical information are saved in a reference database to be later used to tweak system parameters to efficiently execute future unknown applications. To achieve this goal, CPU utilization patterns of new applications along with its statistical information are compared with the already known ones in the reference database to find/predict their most probable execution patterns. Because of different pattern lengths, dynamic time warping (DTW) is utilized for such comparison; a statistical analysis is then applied to DTWs' outcomes to select the most suitable candidates. Furthermore, under a hypothesis, we also proposed another algorithm to classify applications under similar CPU utilization patterns. Finally, dependency between minimum distance/maximum similarity of applications and scalability (in both input size and number of virtual nodes) is studied. Here, we used widely used applications (WordCount, Distributed Grep, and Terasort) as well as an Exim MainLog parsing application to evaluate our hypothesis in automatic tweaking MapReduce configuration parameters in executing similar applications scalable on both size of input data and number of virtual nodes. Results are very promising and showed the effectiveness of our approach on a private cloud with up to 25 virtual nodes. Concurrency and Computation: Practice and Experience, 2012. Copyright © 2012 John Wiley & Sons, Ltd.

Ancillary