Hadoop Streaming
Hadoopstreaming is a utility that comes with the Hadoop distribution. The utilityallows you to create and run Map/Reduce jobs with any executable or script asthe mapper and/or the reducer. For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper /bin/cat
-reducer /bin/wc
How Streaming Works
Inthe above example, both the mapper and the reducer are executables that readthe input from stdin (line by line) and emit the output to stdout. The utilitywill create a Map/Reduce job, submit the job to an appropriate cluster, andmonitor the progress of the job until it completes.
Whenan executable is specified for mappers, each mapper task will launch theexecutable as a separate process when the mapper is initialized. As the mappertask runs, it converts its inputs into lines and feed the lines to the stdin ofthe process. In the meantime, the mapper collects the line oriented outputsfrom the stdout of the process and converts each line into a key/value pair,which is collected as the output of the mapper. By default, the prefix of a line up to the first tabcharacter is the keyand the rest of the line (excluding the tab character) will be the value. If there is no tabcharacter in the line, then entire line is considered as key and the value isnull. However, this can be customized, as discussed later.
Whenan executable is specified for reducers, each reducer task will launch theexecutable as a separate process then the reducer is initialized. As thereducer task runs, it converts its input key/values pairs into lines and feedsthe lines to the stdin of the process. In the meantime, the reducer collectsthe line oriented outputs from the stdout of the process, converts each lineinto a key/value pair, which is collected as the output of the reducer. Bydefault, the prefix of a line up to the first tab character is the key and therest of the line (excluding the tab character) is the value. However, this canbe customized, as discussed later.
Thisis the basis for the communication protocol between the Map/Reduce frameworkand the streaming mapper/reducer.
Youcan supply a Java class as the mapper and/or the reducer. The above example isequivalent to:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper org.apache.hadoop.mapred.lib.IdentityMapper
-reducer /bin/wc
Usercan specify stream.non.zero.exit.is.failure as true or false to make astreaming task that exits with a non-zero status to be Failure or Successrespectively. By default, streaming tasks exiting with non-zero status areconsidered to be failed tasks.
Streaming Command Options
Streamingsupports streaming command options as well as genericcommand options. The general command line syntax is shown below.
Note: Be sure to place the generic options before thestreaming options, otherwise the command will fail. For an example, see MakingArchives Available to Tasks.
bin/hadoop command [genericOptions] [streamingOptions]
TheHadoop streaming command options are listed here:
Parameter |
Optional/Required |
Description |
-input directoryname or filename |
Required |
Input location for mapper |
-output directoryname |
Required |
Output location for reducer |
-mapper executable or JavaClassName |
Required |
Mapper executable |
-reducer executable or JavaClassName |
Required |
Reducer executable |
-file filename |
Optional |
Make the mapper, reducer, or combiner executable available locally on the compute nodes |
-inputformat JavaClassName |
Optional |
Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default |
-outputformat JavaClassName |
Optional |
Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default |
-partitioner JavaClassName |
Optional |
Class that determines which reduce a key is sent to |
-combiner streamingCommand or JavaClassName |
Optional |
Combiner executable for map output |
-cmdenv name=value |
Optional |
Pass environment variable to streaming commands |
-inputreader |
Optional |
For backwards-compatibility: specifies a record reader class (instead of an input format class) |
-verbose |
Optional |
Verbose output |
-lazyOutput |
Optional |
Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write) |
-numReduceTasks |
Optional |
Specify the number of reducers |
-mapdebug |
Optional |
Script to call when map task fails |
-reducedebug |
Optional |
Script to call when reduce task fails |
Specifying a Java Class as theMapper/Reducer
Youcan supply a Java class as the mapper and/or the reducer.
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper org.apache.hadoop.mapred.lib.IdentityMapper
-reducer /bin/wc
Youcan specify stream.non.zero.exit.is.failure as true or false to make astreaming task that exits with a non-zero status to be Failure or Success respectively.By default, streaming tasks exiting with non-zero status are considered to befailed tasks.
Packaging Files With Job Submissions
Youcan specify any executable as the mapper and/or the reducer. The executables donot need to pre-exist on the machines in the cluster; however, if they don't,you will need to use "-file" option to tell the framework to packyour executable files as a part of job submission. For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper myPythonScript.py
-reducer /bin/wc
-file myPythonScript.py
Theabove example specifies a user defined Python executable as the mapper. Theoption "-file myPythonScript.py" causes the python executable shippedto the cluster machines as a part of job submission.
Inaddition to executable files, you can also package other auxiliary files (suchas dictionaries, configuration files, etc) that may be used by the mapperand/or the reducer. For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper myPythonScript.py
-reducer /bin/wc
-file myPythonScript.py
-file myDictionary.txt
Specifying Other Plugins for Jobs
Justas with a normal Map/Reduce job, you can specify other plugins for a streamingjob:
-inputformat JavaClassName
-outputformat JavaClassName
-partitioner JavaClassName
-combiner streamingCommand or JavaClassName
Theclass you supply for the input format should return key/value pairs of Textclass. If you do not specify an input format class, the TextInputFormat is usedas the default. Since the TextInputFormat returns keys of LongWritable class,which are actually not part of the input data, the keys will be discarded; onlythe values will be piped to the streaming mapper.
Theclass you supply for the output format is expected to take key/value pairs ofText class. If you do not specify an output format class, the TextOutputFormatis used as the default.
Setting Environment Variables
Toset an environment variable in a streaming command use:
-cmdenv EXAMPLE_DIR=/home/example/dictionaries/
Generic Command Options
Streamingsupports streamingcommand options as well as generic command options. The general commandline syntax is shown below.
Note: Be sure to place the generic options before thestreaming options, otherwise the command will fail. For an example, see MakingArchives Available to Tasks.
bin/hadoop command [genericOptions] [streamingOptions]
TheHadoop generic command options you can use with streaming are listed here:
Parameter |
Optional/Required |
Description |
-conf configuration_file |
Optional |
Specify an application configuration file |
-D property=value |
Optional |
Use value for given property |
-fs host:port or local |
Optional |
Specify a namenode |
-jt host:port or local |
Optional |
Specify a job tracker |
-files |
Optional |
Specify comma-separated files to be copied to the Map/Reduce cluster |
-libjars |
Optional |
Specify comma-separated jar files to include in the classpath |
-archives |
Optional |
Specify comma-separated archives to be unarchived on the compute machines |
Specifying Configuration Variableswith the -D Option
Youcan specify additional configuration variables by using "-D<property>=<value>".
Specifying Directories
Tochange the local temp directory use:
-D dfs.data.dir=/tmp
Tospecify additional local temp directories use:
-D mapred.local.dir=/tmp/local
-D mapred.system.dir=/tmp/system
-D mapred.temp.dir=/tmp/temp
Note: For more details on jobconf parameters see: mapred-default.html
Specifying Map-Only Jobs
Often,you may want to process input data using a map function only. To do this,simply set mapred.reduce.tasks to zero. The Map/Reduce framework will notcreate any reducer tasks. Rather, the outputs of the mapper tasks will be thefinal output of the job.
-D mapred.reduce.tasks=0
Tobe backward compatible, Hadoop Streaming also supports the "-reduceNONE" option, which is equivalent to "-D mapred.reduce.tasks=0".
Specifying the Number of Reducers
Tospecify the number of reducers, for example two, use:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-D mapred.reduce.tasks=2
-input myInputDirs
-output myOutputDir
-mapper org.apache.hadoop.mapred.lib.IdentityMapper
-reducer /bin/wc
Customizing How Lines are Split intoKey/Value Pairs
Asnoted earlier, when the Map/Reduce framework reads a line from the stdout ofthe mapper, it splits the line into a key/value pair. By default, the prefix ofthe line up to the first tab character is the key and the rest of the line(excluding the tab character) is the value.
However,you can customize this default. You can specify a field separator other thanthe tab character (the default), and you can specify the nth (n >= 1)character rather than the first character in a line (the default) as theseparator between the key and value. For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-D stream.map.output.field.separator=.
-D stream.num.map.output.key.fields=4
-input myInputDirs
-output myOutputDir
-mapper org.apache.hadoop.mapred.lib.IdentityMapper
-reducer org.apache.hadoop.mapred.lib.IdentityReducer
Inthe above example, "-D stream.map.output.field.separator=." specifies"." as the field separator for the map outputs, and the prefix up tothe fourth "." in a line will be the key and the rest of the line(excluding the fourth ".") will be the value. If a line has less thanfour "."s, then the whole line will be the key and the value will bean empty Text object (like the one created by new Text("")).
Similarly,you can use "-D stream.reduce.output.field.separator=SEP" and"-D stream.num.reduce.output.fields=NUM" to specify the nth fieldseparator in a line of the reduce outputs as the separator between the key andthe value.
Similarly,you can specify "stream.map.input.field.separator" and"stream.reduce.input.field.separator" as the input separator forMap/Reduce inputs. By default the separator is the tab character.
Working with Large Files and Archives
The-files and -archives options allow you to make files and archives available tothe tasks. The argument is a URI to the file or archive that you have alreadyuploaded to HDFS. These files and archives are cached across jobs. You canretrieve the host and fs_port values from the fs.default.name config variable.
Note: The -files and -archives options are generic options.Be sure to place the generic options before the command options, otherwise thecommand will fail. For an example, see The-archives Option. Also see OtherSupported Options.
Making Files Available to Tasks
The-files option creates a symlink in the current working directory of the tasksthat points to the local copy of the file.
Inthis example, Hadoop automatically creates a symlink named testfile.txt in thecurrent working directory of the tasks. This symlink points to the local copyof testfile.txt.
-files hdfs://host:fs_port/user/testfile.txt
Usercan specify a different symlink name for -files using #.
-files hdfs://host:fs_port/user/testfile.txt#testfile
Multipleentries can be specified like this:
-files hdfs://host:fs_port/user/testfile1.txt,hdfs://host:fs_port/user/testfile2.txt
Making Archives Available to Tasks
The-archives option allows you to copy jars locally to the current workingdirectory of tasks and automatically unjar the files.
Inthis example, Hadoop automatically creates a symlink named testfile.jar in thecurrent working directory of tasks. This symlink points to the directory thatstores the unjarred contents of the uploaded jar file.
-archives hdfs://host:fs_port/user/testfile.jar
Usercan specify a different symlink name for -archives using #.
-archives hdfs://host:fs_port/user/testfile.tgz#tgzdir
Inthis example, the input.txt file has two lines specifying the names of the twofiles: cachedir.jar/cache.txt and cachedir.jar/cache2.txt."cachedir.jar" is a symlink to the archived directory, which has thefiles "cache.txt" and "cache2.txt".
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-archives 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar'
-D mapred.map.tasks=1
-D mapred.reduce.tasks=1
-D mapred.job.name="Experiment"
-input "/user/me/samples/cachefile/input.txt"
-output "/user/me/samples/cachefile/out"
-mapper "xargs cat"
-reducer "cat"
$ ls test_jar/
cache.txt cache2.txt
$ jar cvf cachedir.jar -C test_jar/ .
added manifest
adding: cache.txt(in = 30) (out= 29)(deflated 3%)
adding: cache2.txt(in = 37) (out= 35)(deflated 5%)
$ hadoop dfs -put cachedir.jar samples/cachefile
$ hadoop dfs -cat /user/me/samples/cachefile/input.txt
cachedir.jar/cache.txt
cachedir.jar/cache2.txt
$ cat test_jar/cache.txt
This is just the cache string
$ cat test_jar/cache2.txt
This is just the second cache string
$ hadoop dfs -ls /user/me/samples/cachefile/out
Found 1 items
/user/me/samples/cachefile/out/part-00000 <r 3> 69
$ hadoop dfs -cat /user/me/samples/cachefile/out/part-00000
This is just the cache string
This is just the second cache string
More Usage Examples
Hadoop Partitioner Class
Hadoophas a library class, KeyFieldBasedPartitioner,p> that is useful for many applications. This class allows the Map/Reduceframework to partition the map outputs based on certain key fields, not thewhole keys. For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-D stream.map.output.field.separator=.
-D stream.num.map.output.key.fields=4
-D map.output.key.field.separator=.
-D mapred.text.key.partitioner.options=-k1,2
-D mapred.reduce.tasks=12
-input myInputDirs
-output myOutputDir
-mapper org.apache.hadoop.mapred.lib.IdentityMapper
-reducer org.apache.hadoop.mapred.lib.IdentityReducer
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
Here,-Dstream.map.output.field.separator=. and -D stream.num.map.output.key.fields=4 are asexplained in previous example. The two variables are used by streaming toidentify the key/value pair of mapper.
Themap output keys of the above Map/Reduce job normally have four fields separatedby ".". However, the Map/Reduce framework will partition the mapoutputs by the first two fields of the keys using the -Dmapred.text.key.partitioner.options=-k1,2 option. Here, -D map.output.key.field.separator=.specifies the separator for the partition. This guarantees that all thekey/value pairs with the same first two fields in the keys will be partitionedinto the same reducer.
Thisis effectively equivalent to specifying the first two fields as the primary keyand the next two fields as the secondary. The primary key is used forpartitioning, and the combination of the primary and secondary keys is used forsorting. A simple illustration is shown here:
Outputof map (the keys)
11.12.1.2
11.14.2.3
11.11.4.1
11.12.1.1
11.14.2.2
Partitioninto 3 reducers (the first 2 fields are used as keys for partition)
11.11.4.1
-----------
11.12.1.2
11.12.1.1
-----------
11.14.2.3
11.14.2.2
Sortingwithin each partition for the reducer(all 4 fields used for sorting)
11.11.4.1
-----------
11.12.1.1
11.12.1.2
-----------
11.14.2.2
11.14.2.3
Hadoop Comparator Class
Hadoophas a library class, KeyFieldBasedComparator,that is useful for many applications. This class provides a subset of featuresprovided by the Unix/GNU Sort. For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
-D stream.map.output.field.separator=.
-D stream.num.map.output.key.fields=4
-D map.output.key.field.separator=.
-D mapred.text.key.comparator.options=-k2,2nr
-D mapred.reduce.tasks=12
-input myInputDirs
-output myOutputDir
-mapper org.apache.hadoop.mapred.lib.IdentityMapper
-reducer org.apache.hadoop.mapred.lib.IdentityReducer
Themap output keys of the above Map/Reduce job normally have four fields separatedby ".". However, the Map/Reduce framework will sort the outputs bythe second field of the keys using the -Dmapred.text.key.comparator.options=-k2,2nr option. Here, -n specifies that the sortingis numerical sorting and -rspecifies that the result should be reversed. A simple illustration is shownbelow:
Outputof map (the keys)
11.12.1.2
11.14.2.3
11.11.4.1
11.12.1.1
11.14.2.2
Sortingoutput for the reducer(where second field used for sorting)
11.14.2.3
11.14.2.2
11.12.1.2
11.12.1.1
11.11.4.1
Hadoop Aggregate Package
Hadoophas a library package called Aggregate.Aggregate provides a special reducer class and a special combiner class, and alist of simple aggregators that perform aggregations such as "sum","max", "min" and so on over a sequence of values. Aggregateallows you to define a mapper plugin class that is expected to generate"aggregatable items" for each input key/value pair of the mappers.The combiner/reducer will aggregate those aggregatable items by invoking theappropriate aggregators.
Touse Aggregate, simply specify "-reducer aggregate":
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-D mapred.reduce.tasks=12
-input myInputDirs
-output myOutputDir
-mapper myAggregatorForKeyCount.py
-reducer aggregate
-file myAggregatorForKeyCount.py
Thepython program myAggregatorForKeyCount.py looks like:
#!/usr/bin/python
import sys;
def generateLongCountToken(id):
return "LongValueSum:" + id + " " + "1"
def main(argv):
line = sys.stdin.readline();
try:
while line:
line = line[:-1];
fields = line.split(" ");
print generateLongCountToken(fields[0]);
line = sys.stdin.readline();
except "end of file":
return None
if __name__ == "__main__":
main(sys.argv)
Hadoop Field Selection Class
Hadoophas a library class, org.apache.hadoop.mapred.lib.FieldSelectionMapReduce, thateffectively allows you to process text data like the unix "cut"utility. The map function defined in the class treats each input key/value pairas a list of fields. You can specify the field separator (the default is thetab character). You can select an arbitrary list of fields as the map outputkey, and an arbitrary list of fields as the map output value. Similarly, thereduce function defined in the class treats each input key/value pair as a listof fields. You can select an arbitrary list of fields as the reduce output key,and an arbitrary list of fields as the reduce output value. For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-D map.output.key.field.separa=.
-D mapred.text.key.partitioner.options=-k1,2
-D mapred.data.field.separator=.
-D map.output.key.value.fields.spec=6,5,1-3:0-
-D reduce.output.key.value.fields.spec=0-2:5-
-D mapred.reduce.tasks=12
-input myInputDirs
-output myOutputDir
-mapper org.apache.hadoop.mapred.lib.FieldSelectionMapReduce
-reducer org.apache.hadoop.mapred.lib.FieldSelectionMapReduce
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
Theoption "-D map.output.key.value.fields.spec=6,5,1-3:0-" specifieskey/value selection for the map outputs. Key selection spec and value selectionspec are separated by ":". In this case, the map output key willconsist of fields 6, 5, 1, 2, and 3. The map output value will consist of allfields (0- means field 0 and all the subsequent fields).
Theoption "-D reduce.output.key.value.fields.spec=0-2:5-" specifieskey/value selection for the reduce outputs. In this case, the reduce output keywill consist of fields 0, 1, 2 (corresponding to the original fields 6, 5, 1).The reduce output value will consist of all fields starting from field 5(corresponding to all the original fields).
Frequently Asked Questions
How do I use Hadoop Streaming to runan arbitrary set of (semi) independent tasks?
Oftenyou do not need the full power of Map Reduce, but only need to run multipleinstances of the same program - either on different parts of the data, or onthe same data, but with different parameters. You can use Hadoop Streaming todo this.
How do I process files, one per map?
Asan example, consider the problem of zipping (compressing) a set of files acrossthe hadoop cluster. You can achieve this using either of these methods:
1. Hadoop Streaming and custom mapper script:
o Generate afile containing the full HDFS path of the input files. Each map task would getone file name as input.
o Create amapper script which, given a filename, will get the file to local disk, gzipthe file and put it back in the desired output directory
2. The existing Hadoop Framework:
o Add thesecommands to your main function:
o FileOutputFormat.setCompressOutput(conf, true);
o FileOutputFormat.setOutputCompressorClass(conf, org.apache.hadoop.io.compress.GzipCodec.class);
o conf.setOutputFormat(NonSplitableTextInputFormat.class);
o conf.setNumReduceTasks(0);
o Write yourmap function:
o public void map(WritableComparable key,
o Writable value,
o OutputCollector output,
o Reporter reporter)
o throws IOException {
o output.collect((Text)value, null);
o }
o Note that theoutput filename will not be the same as the original filename
How many reducers should I use?
Seethe Hadoop Wiki for details: Reducer
If I set up an alias in my shellscript, will that work after -mapper?
Forexample, say I do: alias c1='cut -f1'. Will -mapper "c1" work?
Usingan alias will not work, but variable substitution is allowed as shown in thisexample:
$ hadoop dfs -cat samples/student_marks
alice 50
bruce 70
charlie 80
dan 75
$ c2='cut -f2'; $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-D mapred.job.name='Experiment'
-input /user/me/samples/student_marks
-output /user/me/samples/student_out
-mapper "$c2" -reducer 'cat'
$ hadoop dfs -ls samples/student_out
Found 1 items/user/me/samples/student_out/part-00000 <r 3> 16
$ hadoop dfs -cat samples/student_out/part-00000
50
70
75
80
Can I use UNIX pipes?
Forexample, will -mapper "cut -f1 | sed s/foo/bar/g" work?
Currentlythis does not work and gives an "java.io.IOException: Broken pipe"error. This is probably a bug that needs to be investigated.
What do I do if I get the "Nospace left on device" error?
Forexample, when I run a streaming job by distributing large executables (forexample, 3.6G) through the -file option, I get a "No space left on device"error.
Thejar packaging happens in a directory pointed to by the configuration variablestream.tmpdir. The default value of stream.tmpdir is /tmp. Set the value to adirectory with more space:
-D stream.tmpdir=/export/bigspace/...
How do I specify multiple inputdirectories?
Youcan specify multiple input directories with multiple '-input' options:
hadoop jar hadoop-streaming.jar -input '/user/foo/dir1' -input '/user/foo/dir2'
How do I generate output files withgzip format?
Insteadof plain text files, you can generate gzip files as your generated output. Pass'-D mapred.output.compress=true -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec' asoption to your streaming job.
How do I provide my own input/outputformat with streaming?
Atleast as late as version 0.14, Hadoop does not support multiple jar files. So,when specifying your own custom classes you will have to pack them along withthe streaming jar and use the custom jar instead of the default hadoopstreaming jar.
How do I parse XML documents usingstreaming?
Youcan use the record reader StreamXmlRecordReader to process XML documents.
hadoop jar hadoop-streaming.jar -inputreader "StreamXmlRecord,begin=BEGIN_STRING,end=END_STRING" ..... (rest of the command)
Anythingfound between BEGIN_STRING and END_STRING would be treated as one record formap tasks.
How do I update counters in streamingapplications?
Astreaming process can use the stderr to emit counter information. reporter:counter:<group>,<counter>,<amount> should besent to stderr to update the counter.
How do I update status in streamingapplications?
Astreaming process can use the stderr to emit status information. To set astatus, reporter:status:<message> should besent to stderr.
How do Iget the JobConf variables in a streaming job's mapper/reducer?
Duringthe execution of a streaming job, the names of the "mapred"parameters are transformed. The dots ( . ) become underscores ( _ ). Forexample, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar.In your code, use the parameter names with the underscores.
How do I get the JobConf variables ina streaming job's mapper/reducer?
Duringthe execution of a streaming job, the names of the "mapred"parameters are transformed. The dots ( . ) become underscores ( _ ). Forexample, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar.In your code, use the parameter names with the underscores.