Luke Hoersten is sharing code with you

Bitbucket is a code hosting site. Unlimited public and private repositories. Free for small teams.

Don't show this again

LukeHoersten / Java Disco Worker http://discoproject.org/

Disco (http://discoproject.org) Java worker implementation with a focus on keeping things simple and type-safe. The goal is to allow map-reduce jobs to be submitted and run from a single Java jar.

Clone this repository (size: 154.0 KB): HTTPS / SSH
hg clone https://bitbucket.org/LukeHoersten/java-disco-worker
hg clone ssh://hg@bitbucket.org/LukeHoersten/java-disco-worker

Java Disco Worker overview

Recent commits See more »

Author Revision Comments Message Labels Date
Luke Hoersten d0c6e30341f3 Added jobName to map and reduce function callbacks.
Luke Hoersten f3a1b7f2d740 Added more unit tests.
Luke Hoersten 9d1c2b424ad5 Backed out changeset: d1b43aab2387
Luke Hoersten 5a4310982ea2 Forgot to move unit tests to the discoproject.org namespace.
Luke Hoersten 01be435d1327 Removed json.jar from build script and added some raw util unit tests.

Java Disco Worker

Overview

Disco Java Worker implementation with a focus on keeping things simple and type-safe. The goal is to allow map-reduce jobs to be submitted and run from a single Java jar.

NOTE: This project is still very beta and in active development. I'm operating under the "release early, release often" principle.

Usage

Overview

MapReduce basically has three steps. The first step, called the "client" by Disco, defines the MapReduce "job" (input space and references to the map and reduce functions) and submits the "job" to Disco. The second step is the actual "map" function which runs for each input. The third step is the reduce which takes the results of all the map functions and combines them if needed.

The disco command line tool must be installed on the box used to submit the job.

Implementing

  1. Define the map and reduce phases by implementing the DiscoMapFunction interface which has a map function. This function will do the work to be done in parallel on the cluster. Do the same for DiscoReduceFunction.

  2. Make a main function which creates the DiscoJob. This is the "client".

    1. The DiscoJob describes the input space and takes three arguments:

      • jobName
      • inputs - Each element of this list will have a map instance run.
      • args - Fixed arguments to pass to each map run.
    2. Call setMapFunction on the DiscoJob instance with the class containing the map function to run for each input. Do the same with setReduceFunction.

    3. Submit the job with a call to submit on the DiscoJob instance.

  3. Set the DISCO_MASTER environment variable to point to the master node. For example, something like export DISCO_MASTER=disco://localhost:8989.

  4. Pack the java-disco-worker-*.jar, DiscoMapFunction implementation, and DiscoReduceFunction implementation into the same jar with your client main function. Run the jar with java -cp <your.jar> yourMainClass. The VM args and Classpath will be passed on to the worker as well.

Hacking

Build Dependencies

To-Do List

Generic

  • Reduce phase doesn't correctly handle the dir:// input scheme.
  • Outputs aren't correctly reported as relative to the jobhome.
  • Java JobPack implementation is incomplete. Currently the CLI disco job command is used to submit jobs.

DiscoTaskInputFetcher

  • Doesn't support including and excluding based on inputs which have already been downloaded
  • Doesn't handle fail, retry, or pause messages.
  • Doesn't support failed inputs.
  • Doesn't implement the local filesystem optimization instead of HTTP access when input files are already locally available.

Repositories & Issues

Please submit bugs to the GitHub Issue Tracker