Glossary¶

blob

An arbitrary file stored in Disco Distributed Filesystem.

See also Blobs.

client

The program which submits a job to the master.

data locality

Performing computation over a set of data near where the data is located. Disco preserves data locality whenever possible, since transferring data over a network can be prohibitively expensive when operating on massive amounts of data.

See locality of reference.

DDFS

See Disco Distributed Filesystem.

Erlang

garbage collection (GC)

DDFS has a tag-based filesystem, which means that a given blob could be addressed via multiple tags. This means that blobs can only be deleted once the last reference to it is deleted. DDFS uses a garbage collection procedure to detect and delete such unreferenced data.

grouping

A grouping operation is performed on the inputs to a stage; each resulting group becomes the input to a single task in that stage. A grouping operation is what connects two adjacent stages in a Disco pipeline together.

The possible grouping operations that can be done are split, group_node, group_label, group_node_label, and group_all.

group_all

A grouping operation that groups all the inputs to a stage into a single group, regardless of the labels and nodes of the inputs.

This grouping is typically used to define reduce stages that contain a single reduce task.

group_label

A grouping operation that groups all the inputs with the same label into a single group, regardless of the nodes the inputs reside on. Thus, the number of tasks that run in a group_label stage is controlled by the number of labels generated by the tasks in the previous stage.

This grouping is typically used to define reduce stages that contain a reduce task for each label.

group_node

A grouping operation that groups all the inputs on the same node into a single group, regardless of the labels of the inputs. Thus, the number of tasks that run in a group_node stage depends on the number of distinct cluster nodes on which the tasks in the previous stage (who actually generated output) actually executed.

This grouping can be used to condense the intermediate data generated on a cluster node by the tasks in a stage, in order to reduce the potential network resources used to transfer this data across the cluster to the tasks in the subsequent stage.

This grouping is typically used to define shuffle stages.

group_node_label

A grouping operation that groups all the inputs with the same label on the same node into a single group.

This grouping can be used to condense the intermediate data generated on a cluster node by the tasks in a stage, in order to reduce the potential network resources used to transfer this data across the cluster to the tasks in the subsequent stage.

This grouping is typically used to define shuffle stages.

split

A grouping operation that groups each single input into its own group, regardless of its label or the node it resides on. Thus, the number of tasks that run in a split stage is equal to the number of inputs to that stage.

This grouping is typically used to define map stages.

immutable

See immutable object.

job

A set of map and/or reduce tasks, coordinated by the Disco master. When the master receives a disco.job.JobPack, it assigns a unique name for the job, and assigns the tasks to workers until they are all completed.

See also disco.job

job functions

Job functions are the functions that the user can specify for a disco.worker.classic.worker. For example, disco.worker.classic.func.map(), disco.worker.classic.func.reduce(), disco.worker.classic.func.combiner(), and disco.worker.classic.func.partition() are job functions.

job dict

The first field in a job pack, which contains parameters needed by the master for job execution.

See also The Job Dict and disco.job.JobPack.jobdict.

job home

The working directory in which a worker is executed. The master creates the job home from a job pack, by unzipping the contents of its jobhome field.

See also The Job Home and disco.job.JobPack.jobhome.

job pack

The packed contents sent to the master when submitting a new job. Includes the job dict and job home, among other things.

See also The Job Pack and disco.job.JobPack.

JSON

JavaScript Object Notation.

See Introducing JSON.

label

Each output file created by a task is annotated with an integer label chosen by the task. This label is used by grouping operations in the pipeline.

map

The first phase of a conventional mapreduce job, in which tasks are usually scheduled on the same node where their input data is hosted, so that local computation can be performed.

Also refers to an individual task in this phase, which produces records that may be partitioned, and reduced. Generally there is one map task per input.

mapreduce

A paradigm and associated framework for distributed computing, which decouples application code from the core challenges of fault tolerance and data locality. The framework handles these issues so that jobs can focus on what is specific to their application.

master

Distributed core that takes care of managing jobs, garbage collection for DDFS, and other central processes.

See also Technical Overview.

partitioning

The process of dividing output records into a set of labeled bins, much like tags in DDFS. Typically, the output of map is partitioned, and each reduce operates on a single partition.

pid

A process identifier. In Disco this usually refers to the worker pid.

See process identifier.

pipeline

The structure of a Disco job as a linear sequence of stages.

reduce

The last phase of a conventional mapreduce job, in which non-local computation is usually performed.

Also refers to an individual task in this phase, which usually has access to all values for a given key produced by the map phase. Grouping data for reduce is achieved via partitioning.

replica

Multiple copies (or replicas) of blobs are stored on different cluster nodes so that blobs are still available inspite of a small number of nodes going down.

re-replication

When a node goes down, the system tries to create additional replicas to replace copies that were lost at the loss of the node.

SSH

Network protocol used by Erlang to start slaves.

See SSH.

shuffle

The implicit middle phase of a conventional mapreduce job, in which a single logical input for a reduce task is created for each label from all the inputs with that label generated by the tasks in a map stage.

This phase typically creates intensive network activity between the cluster nodes. This load on the network can be reduced in a Disco pipeline by judicious use of node-local grouping operations, by condensing the intermediate data generated on a node before it gets transmitted across the network.

slave

The process started by the Erlang slave module.

See also Technical Overview.

stage

A stage consists of a task definition, and a grouping operation. The grouping operation is performed on the inputs of a stage; each resulting input group becomes the input to a single task.

stdin

The standard input file descriptor. The master responds to the worker over stdin.

See standard streams.

stdout

The standard output file descriptor. Initially redirected to stderr for a Disco worker.

See standard streams.

stderr

The standard error file descriptor. The worker sends messages to the master over stderr.

See standard streams.

tag

A labeled collection of data in DDFS.

See also Tags.

task

A task is essentially a unit of work, provided to a worker.

See also disco.task.

worker

A worker is responsible for carrying out a task. A Disco job specifies the executable that is the worker. Workers are scheduled to run on the nodes, close to the data they are supposed to be processing.

See also

The Python Worker module, and The Disco Worker Protocol.

ZIP

Archive/compression format, used e.g. for the job home.

See ZIP.

Read the Docs v: develop

Versions: latest; stable; 0.5.1; 0.5; 0.4.5; develop

Downloads: pdf; htmlzip; epub

On Read the Docs: Project Home; Builds

Free document hosting provided by Read the Docs.