Pushing Chunked Data to DDFSΒΆ
ddfs chunk data:bigtxt /path/to/bigfile.txt
Hint
If the local /path/to/bigfile.txt
is in the current directory,
you must use ./bigfile.txt
, or another path containing / chars,
in order to specify it.
Otherwise ddfs
will think you are specifying a tag.
The creation of chunks is record-aware; i.e. chunks will be created on
record boundaries, and ‘ddfs chunk’ will not split a single record
across separate chunks. The default record parser breaks records on
line boundaries; you can specify your own record parser using the
reader
argument to the ddfs.chunk
function, or the -R
argument to ddfs chunk
.
The chunked data in DDFS is stored in Disco’s internal format, which
means that when you read chunked data in your job, you will need to
use the disco.worker.task_io.chain_reader
. Hence, as is typical, if your
map tasks are reading chunked data, specify
map_reader=disco.worker.task_io.chain_reader
in your job.