The Job Pack¶
The job pack contains all the information needed for creating and running a Disco job.
The first time any task of a job executes on a Disco node, the job pack for the job is retrieved from the master, and The Job Home is unzipped into a job-specific directory.
See also
The Python disco.job.JobPack
class.
File format:
+---------------- 4
| magic / version |
|---------------- 8 -------------- 12 ------------- 16 ------------- 20
| jobdict offset | jobenvs offset | jobhome offset | jobdata offset |
|------------------------------------------------------------------ 128
| ... reserved ... |
|--------------------------------------------------------------------|
| jobdict |
|--------------------------------------------------------------------|
| jobenvs |
|--------------------------------------------------------------------|
| jobhome |
|--------------------------------------------------------------------|
| jobdata |
+--------------------------------------------------------------------+
The current supported jobpack version is 0x0002
. Limited support
is provided for jobpacks of version 0x0001
.
The Job Dict¶
The job dict is a JSON dictionary.
jobdict.
pipeline
¶A list of tuples (tuples are lists in JSON), with each tuple specifying a stage in the pipeline in the following form:
stage_name, groupingStage names in a pipeline have to be unique.
grouping
should be one of split, group_label, group_all, group_node and group_node_label.See also
jobdict.
input
¶A list of inputs, with each input specified in a tuple of the following form:
label, size_hint, url_location_1, url_location_2, ...The
label
andsize_hint
are specified as integers, while eachurl_location
is a string. Thesize_hint
is a hint indicating the size of this input, and is only used to optimize scheduling.
jobdict.
worker
¶The path to the worker binary, relative to the job home. The master will execute this binary after it unpacks it from The Job Home.
jobdict.
prefix
¶String giving the prefix the master should use for assigning a unique job name.
Note
Only characters in
[a-zA-Z0-9_]
are allowed in the prefix.
jobdict.
save_results
¶Boolean that when set to true tells Disco to save the job results to DDFS. The output of the job is then the DDFS tag name containing the job results.
New in version 0.5.
Note
The following applies to jobdict attributes in jobpack version
0x0001
. Support for this version might be removed in a future
release.
-
jobdict.
input
A list of urls or a list of lists of urls. Each url is a string.
Note
An inner list of urls gives replica urls for the same data. This lets you specify redundant versions of an input file. If a list of redundant inputs is specified, the scheduler chooses the input that is located on the node with the lowest load at the time of scheduling. Redundant inputs are tried one by one until the task succeeds. Redundant inputs require that
map?
is specified.Note
In the pipeline model, the
label
associated with each of these inputs are all 0, and all inputs are assumed to have asize_hint
of 0.Deprecated since version 0.5.
-
jobdict.map?
Boolean telling whether or not this job should have a map phase.
Deprecated since version 0.5.
-
jobdict.reduce?
Boolean telling whether or not this job should have a reduce phase.
Deprecated since version 0.5.
-
jobdict.
nr_reduces
¶ Non-negative integer that used to tell the master how many reduces to run. Now, if the value is not 1, then the number of reduces actually run by the pipeline depends on the labels output by the tasks in the map stage.
Deprecated since version 0.5.
-
jobdict.
scheduler
¶ - max_cores - use at most this many cores (applies
to both map and reduce). Default is
2**31
. - force_local - always run task on the node where input data is located; never use HTTP to access data remotely.
- force_remote - never run task on the node where input data is located; always use HTTP to access data remotely.
New in version 0.5.2.
- max_cores - use at most this many cores (applies
to both map and reduce). Default is
Job Environment Variables¶
A JSON dictionary of environment variables (with string keys
and string values). The master will set these in the environment
before running the jobdict.worker
.
The Job Home¶
The job home directory serialized into ZIP format.
The master will unzip this before running the jobdict.worker
.
The worker is run with this directory as its working
directory.
In addition to the worker executable, the job home can be populated with files that are needed at runtime by the worker. These could either be shared libraries, helper scripts, or parameter data.
Note
The .disco subdirectory of the job home is reserved by Disco.
The job home is shared by all tasks of the same job on the same node. That is, if the job requires two map task and two reduce task executions on a particular node, then the job home will be unpacked only once on that node, but the worker executable will be executed four times in the job home directory, and it is also possible for some of these executions to be concurrent. Thus, the worker should take care to use unique filenames as needed.
Additional Job Data¶
Arbitrary data included in the job pack, used by the worker.
A running worker can access the job pack at the path specified by jobfile
in the response to the TASK message.
Creating and submitting a Job Pack¶
The jobpack can be constructed and submitted using the disco job
command.