Programmers can no longer rely solely on micro-architectural and technology improvements to have their programs running faster. In today's multicore chips, parallel code needs to be explicitly written to extract any benefits from the extra available processing power. A recently proposed technique to parallelize general-purpose programs' loops at the binary level, called decoupled software pipeline (DSWP), has shown good performance numbers only under the assumption of a fast hardware intercore communication queue. In this paper, we propose Java-DSWP, a source-level DSWP-based parallelization technique that is much simpler than original DSWP and can be used to effectively parallelize Java applications. In addition, we propose and evaluate a software intercore communication scheme that enables code parallelized through Java-DSWP to be executed in commodity machines, thus not requiring a hardware intercore communication queue to be efficient, as DSWP does. We analyze three memory communication queue implementations and show experimental results that reveal an average 48% speedup on some SPCjvm2008 benchmarks. Copyright © 2012 John Wiley & Sons, Ltd.