以下题目注释为自己添加,如果有不正确的,希望有大牛指正,谢谢
地址:http://www.cnblogs.com/jarlean/archive/2013/04/08/3008308.html
Q1. Name the most common InputFormats defined in Hadoop? Which one is default ? (Text是默认的格式) Following 2 are most common InputFormats defined in Hadoop - TextInputFormat - KeyValueInputFormat - SequenceFileInputFormat
Q2. What is the difference between TextInputFormatand KeyValueInputFormat class TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper(text将偏移值作为key,真实值为value) KeyValueInputFormat: Reads text file and parses lines into key, val pairs. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.(这种格式的数据为key value组合值,中间用tab分隔) Q3. What is InputSplit in Hadoop When a hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called Input Split(输入分片,提供给map进程的块信息) Q4. How is the splitting of file invoked in Hadoop Framework It is invoked by the Hadoop framework by running getInputSplit()method of the Input format class (like FileInputFormat) defined by the user(通过用户定义的类,运行getInputSplit方法,完成Hadoop的分片操作) Q5. Consider case scenario: In M/R system, - HDFS block size is 64 MB - Input format is FileInputFormat - We have 3 files of size 64K, 65Mb and 127Mb then how many input splits will be made by Hadoop framework? Hadoop will make 5 splits as follows(不足块大小的,hadoop不占额外空间,超过块大小的,hadoop先填满一个块,然后将剩余的数据写入下一个空块中) - 1 split for 64K files - 2 splits for 65Mb files - 2 splits for 127Mb file