zoukankan      html  css  js  c++  java
  • MapReduce(2): How does Mapper work

    In the previous post, we've illustrated how Hadoop MapReduce prepares input for Mappers. Long story short, InputSplit convert physical storaged data into many logical unit, and each one will be processed by a RecordReader, who will generate input (K,V) pairs for Mapper. I used to be confused about how (K,V) pairs are generated, but actually it just breaks a 128M file into single lines (just an example), and each line is a (K,V) pair. A mapper process these pairs one by one untill the end of the file.

    A user-defined mapper, takes input (K,V) pairs from RecordReader, generate new key/value pair set at the output side.Usually we call the new (K,V) pairs as 'immediate (K,V) pairs'. For example: in the post (Using MapReduce on Azure), we define a Mapper as following:

    #!/usr/bin/env python
    """mapper.py"""
    
    import sys
    
    # input comes from STDIN (standard input)
    for line in sys.stdin:
        # remove leading and trailing whitespace
        line = line.strip()
        # split the line into words
        words = line.split()
        # increase counters
        for word in words:
            # write the results to STDOUT (standard output);
            # what we output here will be the input for the
            # Reduce step, i.e. the input for reducer.py
            #
            # tab-delimited; the trivial word count is 1
            print '%s	%s' % (word, 1)
    

     We can see, this mapper just breaks a line into words set, and ouput immediate (K,V) pairs, in which key is the word and value is 1.

    A funny but intuitive illustration for this process is cutting a car into pieces:

  • 相关阅读:
    北风设计模式课程---13、享元模式
    北风设计模式课程---观察者模式2
    北风设计模式课程---观察者模式
    北风设计模式课程---12、观察者模式
    Linux下yum订购具体解释
    不黑诺基亚,他们觉得萌萌达
    System.Threading.ThreadStateException
    SRS微信号和QQ组
    hibernate它 10.many2many单向
    UML 简单的总结
  • 原文地址:https://www.cnblogs.com/rhyswang/p/10946727.html
Copyright © 2011-2022 走看看