Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications
MapReduce is a popular programming paradigm for running large-scale data-intensive computation. Recently, many frameworks that implement that paradigm have been developed. To leverage such frameworks, however, developers need to familiarize with each framework's API and rewrite their code. We present CORA, a new tool that automatically translates sequential Java programs to the MapReduce paradigm. Rather than building a compiler by tediously designing pattern-matching rules to identify code fragments to translate from the input, CORA translates the input program in two steps: first, CORA uses program synthesis to identify input code fragments and search for a program summary (i.e., a functional specification) of each fragment. The summary is expressed using a high-level intermediate language resembling the MapReduce paradigm. Next, each found summary is verified to be semantically equivalent to the original using a theorem prover. CORA then generates executable code from the summary, using either the Hadoop, Spark, or Flink API. We have evaluated CORA by automatically converting real-world sequential Java benchmarks to MapReduce. The resulting benchmarks perform up to 32.2x faster compared to the original, and are all translated without designing any pattern-matching rules.
READ FULL TEXT