disco.worker.pipeline – Pipeline Disco Worker Interface¶
disco.worker.pipeline.worker – Pipeline Disco Runtime Environment¶
When a Job is constructed using the
Worker defined in this
module, Disco runs the
disco.worker.pipeline.worker module for
every job task. This module reconstructs the
Worker on the
node where it is run, in order to execute the pipeline of
stages that were specified in it.
Stage(name='', init=None, process=None, done=None, input_hook=<function input_hook>, input_chain=, output_chain=, combine=False, sort=False)¶
The name argument specifies the name of the stage. The init argument, if specified, is the first function called in the task. This function initializes the state for the task, and this state is passed to the other task entry points.
The process function is the main entry of the task. It is called once for every input to the task. The order of the inputs can be controlled to some extent by the input_hook function, which allows the user to specify the order in which the input labels should be iterated over.
The done function, if specified, is called once after all inputs have been passed to the process function.
Hence, the order of invocation of the task entry points of a stage are: input_hook, init, process, and done, where init and done are optional and called only once, while process is called once for every task input.
- name (string) – the name of the stage
- init (func) – a function that initializes the task state for later entry points.
- func – a function that gets called once for every input
- done (func) – a function that gets called once after all the inputs have been processed by process
- input_hook (func) – a function that is passed the input labels for the task, and returns the sequence controlling the order in which the labels should be passed to the process function.
- input_chain (list of
input_stream()functions) – this sequence of input_streams controls how input files are parsed for the process function
- output_chain (list of
output_stream()functions) – this sequence of output_streams controls how data output from the process function is marshalled into byte sequences persisted in output files.
TaskInfo(jobname, host, stage, group, label)¶
Alias for field number 3
Alias for field number 1
Alias for field number 0
Alias for field number 4
Alias for field number 2
disco.pipeline.Worker, which implements the pipelined Disco job model.
Parameters: pipeline (list of pairs) – sequence of pairs of grouping operations and stage objects. See Pipeline Data Flow in Disco Jobs.
The default input label hook for a stage does no re-ordering of the labels.