GlossaryΒΆ

client
The program which submits a job to the master.
blob

An arbitrary file stored in Disco Distributed Filesystem.

See also Blobs.

data locality

Performing computation over a set of data near where the data is located. Disco preserves data locality whenever possible, since transferring data over a network can be prohibitively expensive when operating on massive amounts of data.

See locality of reference.

DDFS
See Disco Distributed Filesystem.
Erlang
See Erlang.
garbage collection (GC)
DDFS has a tag-based filesystem, which means that a given blob could be addressed via multiple tags. This means that blobs can only be deleted once the last reference to it is deleted. DDFS uses a garbage collection procedure to detect and delete such unreferenced data.
immutable
See immutable object.
map

The first phase of a job, in which tasks are usually scheduled on the same node where their input data is hosted, so that local computation can be performed.

Also refers to an individual task in this phase, which produces records that may be partitioned, and reduced. Generally there is one map task per input.

master

Distributed core that takes care of managing jobs, garbage collection for DDFS, and other central processes.

See also Technical Overview.

job

A set of map and/or reduce tasks, coordinated by the Disco master. When the master receives a disco.job.JobPack, it assigns a unique name for the job, and assigns the tasks to workers until they are all completed.

See also disco.job

job functions
Job functions are the functions that the user can specify for a disco.worker.classic.worker. For example, disco.worker.classic.func.map(), disco.worker.classic.func.reduce(), disco.worker.classic.func.combiner(), and disco.worker.classic.func.partition() are job functions.
job dict

The first field in a job pack, which contains parameters needed by the master for job execution.

See also The Job Dict and disco.job.JobPack.jobdict.

job home

The working directory in which a worker is executed. The master creates the job home from a job pack, by unzipping the contents of its jobhome field.

See also The Job Home and disco.job.JobPack.jobhome.

job pack

The packed contents sent to the master when submitting a new job. Includes the job dict and job home, among other things.

See also The Job Pack and disco.job.JobPack.

JSON

JavaScript Object Notation.

See Introducing JSON.

mapreduce

A paradigm and associated framework for distributed computing, which decouples application code from the core challenges of fault tolerance and data locality. The framework handles these issues so that jobs can focus on what is specific to their application.

See MapReduce.

partitioning
The process of dividing output records into a set of labelled bins, much like tags in DDFS. Typically, the output of map is partitioned, and each reduce operates on a single partition.
pid

A process identifier. In Disco this usually refers to the worker pid.

See process identifier.

reduce

The last phase of a job, in which non-local computation is usually performed.

Also refers to an individual task in this phase, which usually has access to all values for a given key produced by the map phase. Grouping data for reduce is achieved via partitioning.

replica
Multiple copies (or replicas) of blobs are stored on different cluster nodes so that blobs are still available inspite of a small number of nodes going down.
re-replication
When a node goes down, the system tries to create additional replicas to replace copies that were lost at the loss of the node.
SSH

Network protocol used by Erlang to start slaves.

See SSH.

slave

The process started by the Erlang slave module.

See also Technical Overview.

stdin

The standard input file descriptor. The master responds to the worker over stdin.

See standard streams.

stdout

The standard output file descriptor. Initially redirected to stderr for a Disco worker.

See standard streams.

stderr

The standard error file descriptor. The worker sends messages to the master over stderr.

See standard streams.

tag

A labelled collection of data in DDFS.

See also Tags.

task

A task is essentially a unit of work, provided to a worker. A Disco job is made of map and reduce tasks.

See also disco.task.

worker

A worker is responsible for carrying out a task. A Disco job specifies the executable that is the worker. Workers are scheduled to run on the nodes, close to the data they are supposed to be processing.

ZIP

Archive/compression format, used e.g. for the job home.

See ZIP.

Read the Docs v: 0.4.5
Versions
latest
0.4.5
Downloads
PDF
HTML
Epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.