The Job Pack¶

The job pack contains all the information needed for creating and running a Disco job.

The first time any task of a job executes on a Disco node, the job pack for the job is retrieved from the master, and The Job Home is unzipped into a job-specific directory.

The Job Dict¶

The job dict is a JSON dictionary.

jobdict.pipeline¶
A list of tuples (tuples are lists in JSON), with each tuple specifying a stage in the pipeline in the following form:
stage_name, grouping
Stage names in a pipeline have to be unique. grouping should be one of split, group_label, group_all, group_node and group_node_label.

See also

Pipeline Data Flow in Disco Jobs
jobdict.input¶
A list of inputs, with each input specified in a tuple of the following form:
label, size_hint, url_location_1, url_location_2, ...
The label and size_hint are specified as integers, while each url_location is a string. The size_hint is a hint indicating the size of this input, and is only used to optimize scheduling.
jobdict.worker¶

The path to the worker binary, relative to the job home. The master will execute this binary after it unpacks it from The Job Home.

jobdict.prefix¶

String giving the prefix the master should use for assigning a unique job name.

Note

Only characters in [a-zA-Z0-9_] are allowed in the prefix.

jobdict.owner¶

String name of the owner of the job.

jobdict.save_results¶

Boolean that when set to true tells Disco to save the job results to DDFS. The output of the job is then the DDFS tag name containing the job results.

New in version 0.5.

Note

The following applies to jobdict attributes in jobpack version 0x0001. Support for this version might be removed in a future release.

jobdict.input: A list of urls or a list of lists of urls. Each url is a string.

Note

An inner list of urls gives replica urls for the same data. This lets you specify redundant versions of an input file. If a list of redundant inputs is specified, the scheduler chooses the input that is located on the node with the lowest load at the time of scheduling. Redundant inputs are tried one by one until the task succeeds. Redundant inputs require that map? is specified.

Note

In the pipeline model, the label associated with each of these inputs are all 0, and all inputs are assumed to have a size_hint of 0.

Deprecated since version 0.5.

jobdict.map?: Boolean telling whether or not this job should have a map phase.

Deprecated since version 0.5.

jobdict.reduce?: Boolean telling whether or not this job should have a reduce phase.

Deprecated since version 0.5.

jobdict.nr_reduces¶: Non-negative integer that used to tell the master how many reduces to run. Now, if the value is not 1, then the number of reduces actually run by the pipeline depends on the labels output by the tasks in the map stage.

Deprecated since version 0.5.

jobdict.scheduler¶

max_cores - use at most this many cores (applies to both map and reduce). Default is 2**31.
force_local - always run task on the node where input data is located; never use HTTP to access data remotely.
force_remote - never run task on the node where input data is located; always use HTTP to access data remotely.

New in version 0.5.2.

Job Environment Variables¶

A JSON dictionary of environment variables (with string keys and string values). The master will set these in the environment before running the jobdict.worker.

The Job Home¶

The job home directory serialized into ZIP format. The master will unzip this before running the jobdict.worker. The worker is run with this directory as its working directory.

In addition to the worker executable, the job home can be populated with files that are needed at runtime by the worker. These could either be shared libraries, helper scripts, or parameter data.

Note

The .disco subdirectory of the job home is reserved by Disco.

The job home is shared by all tasks of the same job on the same node. That is, if the job requires two map task and two reduce task executions on a particular node, then the job home will be unpacked only once on that node, but the worker executable will be executed four times in the job home directory, and it is also possible for some of these executions to be concurrent. Thus, the worker should take care to use unique filenames as needed.

Additional Job Data¶

Arbitrary data included in the job pack, used by the worker. A running worker can access the job pack at the path specified by jobfile in the response to the TASK message.

Creating and submitting a Job Pack¶

The jobpack can be constructed and submitted using the disco job command.