zoukankan      html  css  js  c++  java
  • Hadoop: Setup Maven project for MapReduce in 5mn

    Hadoop: Setup Maven project for MapReduce in 5mn

    I am sure I am not the only one who ever struggled with Hadoop eclipse plugin installation. This plugin strongly depends on your environment (eclipse, ant, jdk) and hadoop distribution and version. Moreover, it only provides the Old API for MapReduce.
    It is so simple to create a maven project for Hadoop that wasting time trying to build this plugin becomes totally useless. I am describing on this article how to setup a first maven hadoop project for Cloudera CDH4 on eclipse.

    Prerequisite

    maven 3
    jdk 1.6
    eclipse with m2eclipse plugin installed
    

    Add Cloudera repository

    Cloudera jar files are not available on default Maven central repository. You need to explicitly add cloudera repo in your settings.xml (under ${HOME}/.m2/settings.xml).

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    <?xml version="1.0" encoding="UTF-8"?>
    <settings>
        <profiles>
            <profile>
                <id>standard-extra-repos</id>
                <activation>
                    <activeByDefault>true</activeByDefault>
                </activation>
                <repositories>
                    <repository>
                        <!-- Central Repository -->
                        <id>central</id>
                        <url>http://repo1.maven.org/maven2/</url>
                        <releases>
                            <enabled>true</enabled>
                        </releases>
                        <snapshots>
                            <enabled>true</enabled>
                        </snapshots>
                    </repository>
                    <repository>
                        <!-- Cloudera Repository -->
                        <id>cloudera</id>
                        <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
                        <releases>
                            <enabled>true</enabled>
                        </releases>
                        <snapshots>
                            <enabled>true</enabled>
                       </snapshots>
                    </repository>
                </repositories>
            </profile>
        </profiles>
    </settings>

    Create Maven project

    On eclipse, create a new Maven project as follow

    maven

    maven2

    maven3

    Add Hadoop Nature

    For Cloudera distribution CDH4, open pom.xml file and add the following dependencies

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>jdk.tools</groupId>
                <artifactId>jdk.tools</artifactId>
                <version>1.6</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-hdfs</artifactId>
                <version>2.0.0-cdh4.0.0</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-auth</artifactId>
                <version>2.0.0-cdh4.0.0</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-common</artifactId>
                <version>2.0.0-cdh4.0.0</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-core</artifactId>
                <version>2.0.0-mr1-cdh4.0.1</version>
            </dependency>
            <dependency>
                <groupId>junit</groupId>
                <artifactId>junit-dep</artifactId>
                <version>4.8.2</version>
            </dependency>
        </dependencies>
    </dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-auth</artifactId>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-core</artifactId>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.10</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.1</version>
                <configuration>
                    <source>1.6</source>
                    <target>1.6</target>
                </configuration>
            </plugin>
        </plugins>
    </build>

    Download dependencies

    Now that you have added your Cloudera repository and created your project, download dependencies. This can be easily done by right-clicking on your eclipse project, “update Maven dependencies”.
    All these dependencies must have been added on your .m2 repository.

    [developer@localhost ~]$ find .m2/repository/org/apache/hadoop -name "*.jar" 
    .m2/repository/org/apache/hadoop/hadoop-tools/1.0.4/hadoop-tools-1.0.4.jar
    .m2/repository/org/apache/hadoop/hadoop-common/2.0.0-cdh4.0.0/hadoop-common-2.0.0-cdh4.0.0-sources.jar
    .m2/repository/org/apache/hadoop/hadoop-common/2.0.0-cdh4.0.0/hadoop-common-2.0.0-cdh4.0.0.jar
    .m2/repository/org/apache/hadoop/hadoop-core/2.0.0-mr1-cdh4.0.1/hadoop-core-2.0.0-mr1-cdh4.0.1-sources.jar
    .m2/repository/org/apache/hadoop/hadoop-core/2.0.0-mr1-cdh4.0.1/hadoop-core-2.0.0-mr1-cdh4.0.1.jar
    .m2/repository/org/apache/hadoop/hadoop-hdfs/2.0.0-cdh4.0.0/hadoop-hdfs-2.0.0-cdh4.0.0.jar
    .m2/repository/org/apache/hadoop/hadoop-streaming/1.0.4/hadoop-streaming-1.0.4.jar
    .m2/repository/org/apache/hadoop/hadoop-auth/2.0.0-cdh4.0.0/hadoop-auth-2.0.0-cdh4.0.0.jar
    [developer@localhost ~]$ 
    

    Create WordCount example

    Create your driver code

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    package com.aamend.hadoop.MapReduce;
     
    import java.io.IOException;
     
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
     
    public class WordCount {
     
        public static void main(String[] args) throws IOException,
                InterruptedException, ClassNotFoundException {
     
            Path inputPath = new Path(args[0]);
            Path outputDir = new Path(args[1]);
     
            // Create configuration
            Configuration conf = new Configuration(true);
     
            // Create job
            Job job = new Job(conf, "WordCount");
            job.setJarByClass(WordCountMapper.class);
     
            // Setup MapReduce
            job.setMapperClass(WordCountMapper.class);
            job.setReducerClass(WordCountReducer.class);
            job.setNumReduceTasks(1);
     
            // Specify key / value
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
     
            // Input
            FileInputFormat.addInputPath(job, inputPath);
            job.setInputFormatClass(TextInputFormat.class);
     
            // Output
            FileOutputFormat.setOutputPath(job, outputDir);
            job.setOutputFormatClass(TextOutputFormat.class);
     
            // Delete output if exists
            FileSystem hdfs = FileSystem.get(conf);
            if (hdfs.exists(outputDir))
                hdfs.delete(outputDir, true);
     
            // Execute job
            int code = job.waitForCompletion(true) ? 0 : 1;
            System.exit(code);
     
        }
     
    }

    Create Mapper class

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    package com.aamend.hadoop.MapReduce;
     
    import java.io.IOException;
     
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
     
    public class WordCountMapper extends
            Mapper<Object, Text, Text, IntWritable> {
     
        private final IntWritable ONE = new IntWritable(1);
        private Text word = new Text();
     
        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
     
            String[] csv = value.toString().split(",");
            for (String str : csv) {
                word.set(str);
                context.write(word, ONE);
            }
        }
    }

    Create your Reducer class

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    package com.aamend.hadoop.MapReduce;
     
    import java.io.IOException;
     
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;
     
    public class WordCountReducer extends
            Reducer<Text, IntWritable, Text, IntWritable> {
     
        public void reduce(Text text, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable value : values) {
                sum += value.get();
            }
            context.write(text, new IntWritable(sum));
        }
    }

    Build project

    Exporting jar file is actually out of the box using maven. Execute the following command

    mvn clean install
    

    You should see same output as below

    .../...
    
    [INFO] 
    [INFO] --- maven-jar-plugin:2.3.2:jar (default-jar) @ MapReduce ---
    [INFO] Building jar: /home/developer/Workspace/hadoop/MapReduce/target/MapReduce-0.0.1-SNAPSHOT.jar
    [INFO] 
    [INFO] --- maven-install-plugin:2.3.1:install (default-install) @ MapReduce ---
    [INFO] Installing /home/developer/Workspace/hadoop/MapReduce/target/MapReduce-0.0.1-SNAPSHOT.jar to /home/developer/.m2/repository/com/aamend/hadoop/MapReduce/0.0.1-SNAPSHOT/MapReduce-0.0.1-SNAPSHOT.jar
    [INFO] Installing /home/developer/Workspace/hadoop/MapReduce/pom.xml to /home/developer/.m2/repository/com/aamend/hadoop/MapReduce/0.0.1-SNAPSHOT/MapReduce-0.0.1-SNAPSHOT.pom
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD SUCCESS
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 9.159s
    [INFO] Finished at: Sat May 25 00:35:56 GMT+02:00 2013
    [INFO] Final Memory: 16M/212M
    [INFO] ------------------------------------------------------------------------
    

    And your jar file must be available on project’s target directory (additionally in your ${HOME}/.m2 local repository).

    maven5

    This jar is ready to be executed on your Hadoop environment.

    hadoop jar MapReduce-0.0.1-SNAPSHOT.jar com.aamend.hadoop.MapReduce.WordCount input output
    

    Each time I need to create a new Hadoop project, I simply copy pom.xml template described above, and that’s it..

  • 相关阅读:
    hadoop常用命令详细解释
    2019-05-20 Sublime Text 编辑
    2019-05-20 什么是分布式系统、分布式锁
    2019-05-19 centos7下删掉一个目录下所有的文件
    2019-05-17 ABRT has detected 1 problem(s). For more info run: abrt-cli list --since 1558053651
    2019-05-17 java.net.BindException: Address already in use: JVM_Bind <null>:8083
    2019-05-16mysql忘记密码怎么办
    2019-05-16查看MySQL版本sql语句
    2019-05-15 cenots7动态IP地址改为静态
    2019-05-14 MySQL通过dos命令操作数据库
  • 原文地址:https://www.cnblogs.com/haoliansheng/p/5130234.html
Copyright © 2011-2022 走看看