Luke Hoersten is sharing code with you
Bitbucket is a code hosting site. Unlimited public and private repositories. Free for small teams.
Don't show this againJava Disco Worker overview
Recent commits See more »
| Author | Revision | Comments | Message | Labels | Date |
|---|---|---|---|---|---|
|
|
d0c6e30341f3 |
Added jobName to map and reduce function callbacks. |
|
||
|
|
f3a1b7f2d740 |
Added more unit tests. |
|
||
|
|
9d1c2b424ad5 |
Backed out changeset: d1b43aab2387 |
|
||
|
|
5a4310982ea2 |
Forgot to move unit tests to the discoproject.org namespace. |
|
||
|
|
01be435d1327 |
Removed json.jar from build script and added some raw util unit tests. |
|
Java Disco Worker
Overview
Disco Java Worker implementation with a focus on keeping things simple and type-safe. The goal is to allow map-reduce jobs to be submitted and run from a single Java jar.
NOTE: This project is still very beta and in active development. I'm operating under the "release early, release often" principle.
Usage
Overview
MapReduce basically has three steps. The first step, called the "client" by Disco, defines the MapReduce "job" (input space and references to the map and reduce functions) and submits the "job" to Disco. The second step is the actual "map" function which runs for each input. The third step is the reduce which takes the results of all the map functions and combines them if needed.
The disco command line tool must be installed on the box used to submit the job.
Implementing
-
Define the map and reduce phases by implementing the
DiscoMapFunctioninterface which has amapfunction. This function will do the work to be done in parallel on the cluster. Do the same forDiscoReduceFunction. -
Make a
mainfunction which creates theDiscoJob. This is the "client".-
The
DiscoJobdescribes the input space and takes three arguments:jobNameinputs- Each element of this list will have a map instance run.args- Fixed arguments to pass to each map run.
-
Call
setMapFunctionon theDiscoJobinstance with the class containing the map function to run for each input. Do the same withsetReduceFunction. -
Submit the job with a call to
submiton theDiscoJobinstance.
-
-
Set the
DISCO_MASTERenvironment variable to point to the master node. For example, something likeexport DISCO_MASTER=disco://localhost:8989. -
Pack the
java-disco-worker-*.jar,DiscoMapFunctionimplementation, andDiscoReduceFunctionimplementation into the same jar with your clientmainfunction. Run the jar withjava -cp <your.jar> yourMainClass. The VM args and Classpath will be passed on to the worker as well.
Hacking
Build Dependencies
To-Do List
Generic
- Reduce phase doesn't correctly handle the
dir://input scheme. - Outputs aren't correctly reported as relative to the jobhome.
- Java JobPack
implementation is incomplete. Currently the CLI
disco jobcommand is used to submit jobs.
DiscoTaskInputFetcher
- Doesn't support including and excluding based on inputs which have already been downloaded
- Doesn't handle fail, retry, or pause messages.
- Doesn't support failed inputs.
- Doesn't implement the local filesystem optimization instead of HTTP access when input files are already locally available.
Repositories & Issues
Please submit bugs to the GitHub Issue Tracker