disco.worker.pipeline – Pipeline Disco Worker Interface

disco.worker.pipeline.worker – Pipeline Disco Runtime Environment

When a Job is constructed using the Worker defined in this module, Disco runs the disco.worker.pipeline.worker module for every job task. This module reconstructs the Worker on the node where it is run, in order to execute the pipeline of stages that were specified in it.

class disco.worker.pipeline.worker.DiscoTask

DiscoTask(output,)

output

Alias for field number 0

class disco.worker.pipeline.worker.Stage(name='', init=None, process=None, done=None, input_hook=function input_hook, input_chain=, []output_chain=, []combine=False, sort=False)

A Stage specifies various entry points for a task in the stage.

The name argument specifies the name of the stage. The init argument, if specified, is the first function called in the task. This function initializes the state for the task, and this state is passed to the other task entry points.

The process function is the main entry of the task. It is called once for every input to the task. The order of the inputs can be controlled to some extent by the input_hook function, which allows the user to specify the order in which the input labels should be iterated over.

The done function, if specified, is called once after all inputs have been passed to the process function.

Hence, the order of invocation of the task entry points of a stage are: input_hook, init, process, and done, where init and done are optional and called only once, while process is called once for every task input.

Parameters:
  • name (string) – the name of the stage
  • init (func) – a function that initializes the task state for later entry points.
  • func – a function that gets called once for every input
  • done (func) – a function that gets called once after all the inputs have been processed by process
  • input_hook (func) – a function that is passed the input labels for the task, and returns the sequence controlling the order in which the labels should be passed to the process function.
  • input_chain (list of input_stream() functions) – this sequence of input_streams controls how input files are parsed for the process function
  • output_chain (list of output_stream() functions) – this sequence of output_streams controls how data output from the process function is marshalled into byte sequences persisted in output files.
class disco.worker.pipeline.worker.TaskInfo

TaskInfo(jobname, host, stage, group, label)

group

Alias for field number 3

host

Alias for field number 1

jobname

Alias for field number 0

label

Alias for field number 4

stage

Alias for field number 2

class disco.worker.pipeline.worker.Worker(**kwargs)

A disco.pipeline.Worker, which implements the pipelined Disco job model.

Parameters:pipeline (list of pairs) – sequence of pairs of grouping operations and stage objects. See Pipeline Data Flow in Disco Jobs.
jobdict(job, **jobargs)

Creates The Job Dict for the Worker.

Makes use of the following parameters, in addition to those defined by the Worker itself:

Uses getitem() to resolve the values of parameters.

Returns:the job dict.
disco.worker.pipeline.worker.input_hook(state, input_labels)

The default input label hook for a stage does no re-ordering of the labels.