GlossaryΒΆ
- blob
An arbitrary file stored in Disco Distributed Filesystem.
See also Blobs.
- client
- The program which submits a job to the master.
- data locality
Performing computation over a set of data near where the data is located. Disco preserves data locality whenever possible, since transferring data over a network can be prohibitively expensive when operating on massive amounts of data.
- DDFS
- See Disco Distributed Filesystem.
- Erlang
- See Erlang.
- garbage collection (GC)
- DDFS has a tag-based filesystem, which means that a given blob could be addressed via multiple tags. This means that blobs can only be deleted once the last reference to it is deleted. DDFS uses a garbage collection procedure to detect and delete such unreferenced data.
- grouping
A grouping operation is performed on the inputs to a stage; each resulting group becomes the input to a single task in that stage. A grouping operation is what connects two adjacent stages in a Disco pipeline together.
The possible grouping operations that can be done are split, group_node, group_label, group_node_label, and group_all.
- group_all
A grouping operation that groups all the inputs to a stage into a single group, regardless of the labels and nodes of the inputs.
This grouping is typically used to define reduce stages that contain a single reduce task.
- group_label
A grouping operation that groups all the inputs with the same label into a single group, regardless of the nodes the inputs reside on. Thus, the number of tasks that run in a group_label stage is controlled by the number of labels generated by the tasks in the previous stage.
This grouping is typically used to define reduce stages that contain a reduce task for each label.
- group_node
A grouping operation that groups all the inputs on the same node into a single group, regardless of the labels of the inputs. Thus, the number of tasks that run in a group_node stage depends on the number of distinct cluster nodes on which the tasks in the previous stage (who actually generated output) actually executed.
This grouping can be used to condense the intermediate data generated on a cluster node by the tasks in a stage, in order to reduce the potential network resources used to transfer this data across the cluster to the tasks in the subsequent stage.
This grouping is typically used to define shuffle stages.
- group_node_label
A grouping operation that groups all the inputs with the same label on the same node into a single group.
This grouping can be used to condense the intermediate data generated on a cluster node by the tasks in a stage, in order to reduce the potential network resources used to transfer this data across the cluster to the tasks in the subsequent stage.
This grouping is typically used to define shuffle stages.
- split
A grouping operation that groups each single input into its own group, regardless of its label or the node it resides on. Thus, the number of tasks that run in a split stage is equal to the number of inputs to that stage.
This grouping is typically used to define map stages.
- immutable
- See immutable object.
- job
A set of map and/or reduce tasks, coordinated by the Disco master. When the master receives a
disco.job.JobPack
, it assigns a unique name for the job, and assigns the tasks to workers until they are all completed.See also
disco.job
- job functions
- Job functions are the functions that the user can specify for a
disco.worker.classic.worker
. For example,disco.worker.classic.func.map()
,disco.worker.classic.func.reduce()
,disco.worker.classic.func.combiner()
, anddisco.worker.classic.func.partition()
are job functions. - job dict
The first field in a job pack, which contains parameters needed by the master for job execution.
See also The Job Dict and
disco.job.JobPack.jobdict
.- job home
The working directory in which a worker is executed. The master creates the job home from a job pack, by unzipping the contents of its jobhome field.
See also The Job Home and
disco.job.JobPack.jobhome
.- job pack
The packed contents sent to the master when submitting a new job. Includes the job dict and job home, among other things.
See also The Job Pack and
disco.job.JobPack
.- JSON
JavaScript Object Notation.
See Introducing JSON.
- label
- Each output file created by a task is annotated with an integer label chosen by the task. This label is used by grouping operations in the pipeline.
- map
The first phase of a conventional mapreduce job, in which tasks are usually scheduled on the same node where their input data is hosted, so that local computation can be performed.
Also refers to an individual task in this phase, which produces records that may be partitioned, and reduced. Generally there is one map task per input.
- mapreduce
A paradigm and associated framework for distributed computing, which decouples application code from the core challenges of fault tolerance and data locality. The framework handles these issues so that jobs can focus on what is specific to their application.
See MapReduce.
- master
Distributed core that takes care of managing jobs, garbage collection for DDFS, and other central processes.
See also Technical Overview.
- partitioning
- The process of dividing output records into a set of labeled bins, much like tags in DDFS. Typically, the output of map is partitioned, and each reduce operates on a single partition.
- pid
A process identifier. In Disco this usually refers to the worker pid.
See process identifier.
- pipeline
- The structure of a Disco job as a linear sequence of stages.
- reduce
The last phase of a conventional mapreduce job, in which non-local computation is usually performed.
Also refers to an individual task in this phase, which usually has access to all values for a given key produced by the map phase. Grouping data for reduce is achieved via partitioning.
- replica
- Multiple copies (or replicas) of blobs are stored on different cluster nodes so that blobs are still available inspite of a small number of nodes going down.
- re-replication
- When a node goes down, the system tries to create additional replicas to replace copies that were lost at the loss of the node.
- SSH
Network protocol used by Erlang to start slaves.
See SSH.
- shuffle
The implicit middle phase of a conventional mapreduce job, in which a single logical input for a reduce task is created for each label from all the inputs with that label generated by the tasks in a map stage.
This phase typically creates intensive network activity between the cluster nodes. This load on the network can be reduced in a Disco pipeline by judicious use of node-local grouping operations, by condensing the intermediate data generated on a node before it gets transmitted across the network.
- slave
The process started by the Erlang slave module.
See also Technical Overview.
- stage
- A stage consists of a task definition, and a grouping operation. The grouping operation is performed on the inputs of a stage; each resulting input group becomes the input to a single task.
- stdin
The standard input file descriptor. The master responds to the worker over stdin.
See standard streams.
- stdout
The standard output file descriptor. Initially redirected to stderr for a Disco worker.
See standard streams.
- stderr
The standard error file descriptor. The worker sends messages to the master over stderr.
See standard streams.
- tag
A labeled collection of data in DDFS.
See also Tags.
- task
A task is essentially a unit of work, provided to a worker.
See also
disco.task
.- worker
A worker is responsible for carrying out a task. A Disco job specifies the executable that is the worker. Workers are scheduled to run on the nodes, close to the data they are supposed to be processing.
See also
- ZIP
Archive/compression format, used e.g. for the job home.
See ZIP.