zoukankan      html  css  js  c++  java
  • Hadoop: Add third-party libraries to MapReduce job

    来自:http://hadoopi.wordpress.com/2014/06/05/hadoop-add-third-party-libraries-to-mapreduce-job/

    Anybody working with Hadoop should have already faced a same common issue: How to add third-party libraries to your MapReduce job.

    Add libjars option

    The first solution, maybe the most common one, consists on adding libraries using -libjars parameter on CLI. To make it work, your class MyClass must useGenericOptionsParser class. Easiest way is to implement the Hadoop Tool interface as described in post Hadoop: Implementing the Tool interface for MapReduce driver.

    $ export LIBJARS=/path/jar1,/path/jar2
    $ hadoop jar /path/to/my.jar com.wordpress.hadoopi.MyClass -libjars ${LIBJARS} value
    

    This will obviously work only when playing with CLI, so how the heck can we add such external jar files when not using CLI ?

    Add jar files to Hadoop classpath

    You could certainly upload external jar files to each tasktracker and update HADOOOP_CLASSPATH accordingly, but are you really willing to bother Ops team each time you need to add a new jar ? Works well on a single server node, but are you going to upload such jar across all of the 10, 100 or even more Hadoop nodes ? This approach does not scale at all !

    Create a fat jar

    Another approach is to create a fat jar, which is a JAR that contains your classes as well as your third-party classes (see this Cloudera blog post for more details). Be aware this Jar will not only contain your classes, but might also include all your project dependencies (such as Hadoop libraries) unless you explicitly exclude them (using provided tag).
    Here is an example of maven plugin you will need to set up

                <plugin>
                    <artifactId>maven-assembly-plugin</artifactId>
                    <configuration>
                        <archive>
                            <manifest>
                                <mainClass></mainClass>
                            </manifest>
                        </archive>
                        <descriptorRefs>
                            <descriptorRef>
                                 jar-with-dependencies
                            </descriptorRef>
                        </descriptorRefs>
                    </configuration>
                    <executions>
                        <execution>
                            <id>make-assembly</id>
                            <phase>package</phase>
                            <goals>
                                <goal>single</goal>
                            </goals>
                        </execution>
                    </executions>
                </plugin>
    

    Following a “mvn clean package” command, your fat JAR will be located in maven project’s target directory as follows

    drwxr-xr-x  2 antoine  staff        68 Jun 10 09:30 archive-tmp
    drwxr-xr-x  3 antoine  staff       102 Jun 10 09:29 classes
    drwxr-xr-x  3 antoine  staff       102 Jun 10 09:29 generated-sources
    drwxr-xr-x  3 antoine  staff       102 Jun 10 09:29 generated-test-sources
    drwxr-xr-x  3 antoine  staff       102 Jun 10 09:29 maven-archiver
    drwxr-xr-x  4 antoine  staff       136 Jun 10 09:29 myproject-1.0-SNAPSHOT
    -rw-r--r--  1 antoine  staff  63880020 Jun 10 09:30 myproject-1.0-SNAPSHOT-jar-with-dependencies.jar
    drwxr-xr-x  4 antoine  staff       136 Jun 10 09:29 surefire-reports
    drwxr-xr-x  4 antoine  staff       136 Jun 10 09:29 test-classes
    

    In above example, note the actual size of your JAR file (61MB). Quite fat, isn’t it ?
    You can ensure all dependencies have been added by firing up below command

    $ jar -tf myproject-1.0-SNAPSHOT-jar-with-dependencies.jar
    
    META-INF/
    META-INF/MANIFEST.MF
    com/aamend/hadoop/allMyClasses.class
    ...
    com/others/allMyDependencies.class
    ...
    

    Use Distributed cache

    I am always following such approach when using third-party libraries in my MapReduce jobs. One would say such approach is not elegant, but I can work without annoying anyone from Ops team :). I first create a directory “lib” in my HDFS home directory (“/user/hadoopi/”). You could even use “/tmp”, it does not matter. I then create a static method that

    1. Locate the jar file that includes the class I need
    2. Upload this jar to Hadoop HDFS
    3. Add the uploaded jar file to Hadoop distributed cache

    Simply add the following lines to some Utils class.

        private static void addJarToDistributedCache(
                Class classToAdd, Configuration conf)
            throws IOException {
    
            // Retrieve jar file for class2Add
            String jar = classToAdd.getProtectionDomain().
                    getCodeSource().getLocation().
                    getPath();
            File jarFile = new File(jar);
    
            // Declare new HDFS location
            Path hdfsJar = new Path("/user/hadoopi/lib/"
                    + jarFile.getName());
    
            // Mount HDFS
            FileSystem hdfs = FileSystem.get(conf);
    
            // Copy (override) jar file to HDFS
            hdfs.copyFromLocalFile(false, true,
                new Path(jar), hdfsJar);
    
            // Add jar to distributed classPath
            DistributedCache.addFileToClassPath(hdfsJar, conf);
        }
    

    The only thing you need to remember is to add this class prior to Job submission…

        public static void main(String[] args) throws Exception {
    
            // Create Hadoop configuration
            Configuration conf = new Configuration();
    
            // Add 3rd-party libraries
            addJarToDistributedCache(MyFirstClass.class, conf);
            addJarToDistributedCache(MySecondClass.class, conf);
    
            // Create my job
            Job job = new Job(conf, "Hadoop-classpath");
            .../...
        }
    

    Here you are, your MapReduce is now able to use any external JAR file.

  • 相关阅读:
    带你玩转Flink流批一体分布式实时处理引擎
    都2022年了,你的前端工具集应该有vueuse
    云图说|图解DGC:基于华为智能数据湖解决方案的一体化数据治理平台
    面试官: Flink双流JOIN了解吗? 简单说说其实现原理
    4种Spring Boot 实现通用 Auth 认证方式
    这8个JS 新功能,你应该去尝试一下
    Scrum Master需要具备哪些能力和经验
    dart系列之:时间你慢点走,我要在dart中抓住你
    dart系列之:数学什么的就是小意思,看我dart如何玩转它
    dart系列之:还在为编码解码而烦恼吗?用dart试试
  • 原文地址:https://www.cnblogs.com/sunxucool/p/3845113.html
Copyright © 2011-2022 走看看