zoukankan      html  css  js  c++  java
  • Hadoop Combiners

    In the last post and in the preceding one we saw how to write a MapReduce program for finding the top-n items of a data set. The difference between the two was that the first program (which we call basic) emitted to the reducers every single item read from input, while the second (which we call enhanced) made a partial computation and emitted only a subset of the input. The enhanced top-n optimizes network transmissions (the less the key-value pairs emitted, the less network is used for transmitting them from mapper to reducer) and reduces the number of keys shuffled and sorted; but this is obtained at the cost of rewriting of the mapper. 

    If we look at the code of the mapper of the enhanced top-n , we can see that it implements the idea behind the reducer: it uses a Map for making a partial count of the words and emits every word only once; looking at the reducer's code, we see that it implements the same idea. If we could execute the code of the reducer of the basic top-n after the mapper has run on every machine (with its subset of data), we would obtain exactly the same result than rewriting the mapper as in the enhanced. This is exactly what Hadoop combiners do: they're executed just after the mapper on every machine for improving performance. For telling Hadoop which class to use as a combiner, we can use the Job.setCombinerClass() method. 

    Caution: using the reducer as a combiner works only if the function we're computing is both commutative (a + b = b + a) and associative (a + (b + c) = (a + b) + c). 
    Let's make an example. Suppose we're analyzing the traffic of a website and we have an input file with the number of visits per day like this (YYYYMMDD value):

    20140401 100
    20140331 1000
    20140330 1300
    20140329 5100
    20140328 1200
    

    We want to find which is the day with the highest number of visits. 
    Let's say that we have two mappers; the first one receives the first three lines and the second receives the last two. If we write the mapper to emit every line, the reducer will evaluate something like this:

    max(100, 1000, 1300, 5100, 1200) -> 5100
    

    and the max is 5100. 
    If we use the reducer as a combiner, the reducer will evaluate something like this:

    max( max(100, 1000, 1300), max(5100, 1200)) -> max( 1300, 5100) -> 5100
    

    because each of the two mapper will evaluate locally the max function. In this case the result will be 5100 as well, since the function we're evaluating (the max function) is both commutative and associative. 

    Let's say that now we need to compute the average number of visits per day. If we write the mapper to emit every line of the input file, the reducer will evaluate this:

    mean(100, 1000, 1300, 5100, 1200) -> 1740
    

    which is 1740. 
    If we use the reducer as a combiner, the reducer will evaluate something like this:

    mean( mean(100, 1000, 1300), mean(5100, 1200)) -> mean( 800, 3150) -> 1975
    

    because each of the two mapper will evaluate locally the max function. In this case the result will be 1975, which is obviously wrong. 

    So, if we're computing a commutative and associative function and we want to improve the performance of our job, we can use our reducer as a combiner; if we want to improve performance but we're computing a function that is not commutative and associative, we have to rewrite the mapper or to write a new combiner from stratch.

    from: http://andreaiacono.blogspot.com/2014/03/hadoop-combiners.html

  • 相关阅读:
    深入了解 Flink 网络栈(二):监控、指标和处理背压
    物联网安全技术提高区块链应用数据的可信度
    威胁快报|Bulehero挖矿蠕虫升级,PhpStudy后门漏洞加入武器库
    Ververica Platform-阿里巴巴全新Flink企业版揭秘
    重磅 | 阿里云与MongoDB达成战略合作,成为全球唯一提供最新版MongoDB的云厂商
    阿里巴巴叶军:政企数字化转型,现在是最重要的时机
    Canonical 开源 MicroK8 | 云原生生态周报 Vol. 25
    nyoj42——连通图加欧拉(连通图板子)dfs
    nyoj38——最小生成树
    nyoj20——有向无环图深搜模板
  • 原文地址:https://www.cnblogs.com/GarfieldEr007/p/5281226.html
Copyright © 2011-2022 走看看