zoukankan      html  css  js  c++  java
  • Hadoop: Setup Maven project for MapReduce in 5mn

    Hadoop: Setup Maven project for MapReduce in 5mn

    I am sure I am not the only one who ever struggled with Hadoop eclipse plugin installation. This plugin strongly depends on your environment (eclipse, ant, jdk) and hadoop distribution and version. Moreover, it only provides the Old API for MapReduce.
    It is so simple to create a maven project for Hadoop that wasting time trying to build this plugin becomes totally useless. I am describing on this article how to setup a first maven hadoop project for Cloudera CDH4 on eclipse.

    Prerequisite

    maven 3
    jdk 1.6
    eclipse with m2eclipse plugin installed
    

    Add Cloudera repository

    Cloudera jar files are not available on default Maven central repository. You need to explicitly add cloudera repo in your settings.xml (under ${HOME}/.m2/settings.xml).

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    <?xml version="1.0" encoding="UTF-8"?>
    <settings>
        <profiles>
            <profile>
                <id>standard-extra-repos</id>
                <activation>
                    <activeByDefault>true</activeByDefault>
                </activation>
                <repositories>
                    <repository>
                        <!-- Central Repository -->
                        <id>central</id>
                        <url>http://repo1.maven.org/maven2/</url>
                        <releases>
                            <enabled>true</enabled>
                        </releases>
                        <snapshots>
                            <enabled>true</enabled>
                        </snapshots>
                    </repository>
                    <repository>
                        <!-- Cloudera Repository -->
                        <id>cloudera</id>
                        <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
                        <releases>
                            <enabled>true</enabled>
                        </releases>
                        <snapshots>
                            <enabled>true</enabled>
                       </snapshots>
                    </repository>
                </repositories>
            </profile>
        </profiles>
    </settings>

    Create Maven project

    On eclipse, create a new Maven project as follow

    maven

    maven2

    maven3

    Add Hadoop Nature

    For Cloudera distribution CDH4, open pom.xml file and add the following dependencies

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>jdk.tools</groupId>
                <artifactId>jdk.tools</artifactId>
                <version>1.6</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-hdfs</artifactId>
                <version>2.0.0-cdh4.0.0</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-auth</artifactId>
                <version>2.0.0-cdh4.0.0</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-common</artifactId>
                <version>2.0.0-cdh4.0.0</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-core</artifactId>
                <version>2.0.0-mr1-cdh4.0.1</version>
            </dependency>
            <dependency>
                <groupId>junit</groupId>
                <artifactId>junit-dep</artifactId>
                <version>4.8.2</version>
            </dependency>
        </dependencies>
    </dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-auth</artifactId>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-core</artifactId>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.10</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.1</version>
                <configuration>
                    <source>1.6</source>
                    <target>1.6</target>
                </configuration>
            </plugin>
        </plugins>
    </build>

    Download dependencies

    Now that you have added your Cloudera repository and created your project, download dependencies. This can be easily done by right-clicking on your eclipse project, “update Maven dependencies”.
    All these dependencies must have been added on your .m2 repository.

    [developer@localhost ~]$ find .m2/repository/org/apache/hadoop -name "*.jar" 
    .m2/repository/org/apache/hadoop/hadoop-tools/1.0.4/hadoop-tools-1.0.4.jar
    .m2/repository/org/apache/hadoop/hadoop-common/2.0.0-cdh4.0.0/hadoop-common-2.0.0-cdh4.0.0-sources.jar
    .m2/repository/org/apache/hadoop/hadoop-common/2.0.0-cdh4.0.0/hadoop-common-2.0.0-cdh4.0.0.jar
    .m2/repository/org/apache/hadoop/hadoop-core/2.0.0-mr1-cdh4.0.1/hadoop-core-2.0.0-mr1-cdh4.0.1-sources.jar
    .m2/repository/org/apache/hadoop/hadoop-core/2.0.0-mr1-cdh4.0.1/hadoop-core-2.0.0-mr1-cdh4.0.1.jar
    .m2/repository/org/apache/hadoop/hadoop-hdfs/2.0.0-cdh4.0.0/hadoop-hdfs-2.0.0-cdh4.0.0.jar
    .m2/repository/org/apache/hadoop/hadoop-streaming/1.0.4/hadoop-streaming-1.0.4.jar
    .m2/repository/org/apache/hadoop/hadoop-auth/2.0.0-cdh4.0.0/hadoop-auth-2.0.0-cdh4.0.0.jar
    [developer@localhost ~]$ 
    

    Create WordCount example

    Create your driver code

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    package com.aamend.hadoop.MapReduce;
     
    import java.io.IOException;
     
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
     
    public class WordCount {
     
        public static void main(String[] args) throws IOException,
                InterruptedException, ClassNotFoundException {
     
            Path inputPath = new Path(args[0]);
            Path outputDir = new Path(args[1]);
     
            // Create configuration
            Configuration conf = new Configuration(true);
     
            // Create job
            Job job = new Job(conf, "WordCount");
            job.setJarByClass(WordCountMapper.class);
     
            // Setup MapReduce
            job.setMapperClass(WordCountMapper.class);
            job.setReducerClass(WordCountReducer.class);
            job.setNumReduceTasks(1);
     
            // Specify key / value
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
     
            // Input
            FileInputFormat.addInputPath(job, inputPath);
            job.setInputFormatClass(TextInputFormat.class);
     
            // Output
            FileOutputFormat.setOutputPath(job, outputDir);
            job.setOutputFormatClass(TextOutputFormat.class);
     
            // Delete output if exists
            FileSystem hdfs = FileSystem.get(conf);
            if (hdfs.exists(outputDir))
                hdfs.delete(outputDir, true);
     
            // Execute job
            int code = job.waitForCompletion(true) ? 0 : 1;
            System.exit(code);
     
        }
     
    }

    Create Mapper class

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    package com.aamend.hadoop.MapReduce;
     
    import java.io.IOException;
     
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
     
    public class WordCountMapper extends
            Mapper<Object, Text, Text, IntWritable> {
     
        private final IntWritable ONE = new IntWritable(1);
        private Text word = new Text();
     
        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
     
            String[] csv = value.toString().split(",");
            for (String str : csv) {
                word.set(str);
                context.write(word, ONE);
            }
        }
    }

    Create your Reducer class

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    package com.aamend.hadoop.MapReduce;
     
    import java.io.IOException;
     
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;
     
    public class WordCountReducer extends
            Reducer<Text, IntWritable, Text, IntWritable> {
     
        public void reduce(Text text, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable value : values) {
                sum += value.get();
            }
            context.write(text, new IntWritable(sum));
        }
    }

    Build project

    Exporting jar file is actually out of the box using maven. Execute the following command

    mvn clean install
    

    You should see same output as below

    .../...
    
    [INFO] 
    [INFO] --- maven-jar-plugin:2.3.2:jar (default-jar) @ MapReduce ---
    [INFO] Building jar: /home/developer/Workspace/hadoop/MapReduce/target/MapReduce-0.0.1-SNAPSHOT.jar
    [INFO] 
    [INFO] --- maven-install-plugin:2.3.1:install (default-install) @ MapReduce ---
    [INFO] Installing /home/developer/Workspace/hadoop/MapReduce/target/MapReduce-0.0.1-SNAPSHOT.jar to /home/developer/.m2/repository/com/aamend/hadoop/MapReduce/0.0.1-SNAPSHOT/MapReduce-0.0.1-SNAPSHOT.jar
    [INFO] Installing /home/developer/Workspace/hadoop/MapReduce/pom.xml to /home/developer/.m2/repository/com/aamend/hadoop/MapReduce/0.0.1-SNAPSHOT/MapReduce-0.0.1-SNAPSHOT.pom
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD SUCCESS
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 9.159s
    [INFO] Finished at: Sat May 25 00:35:56 GMT+02:00 2013
    [INFO] Final Memory: 16M/212M
    [INFO] ------------------------------------------------------------------------
    

    And your jar file must be available on project’s target directory (additionally in your ${HOME}/.m2 local repository).

    maven5

    This jar is ready to be executed on your Hadoop environment.

    hadoop jar MapReduce-0.0.1-SNAPSHOT.jar com.aamend.hadoop.MapReduce.WordCount input output
    

    Each time I need to create a new Hadoop project, I simply copy pom.xml template described above, and that’s it..

  • 相关阅读:
    精选30道Java笔试题解答
    ASM
    Java Decompiler Plugin For Eclipse IDE
    AMQ5540, AMQ5541 and AMQ5542, application did not supply a user ID and password, 2035 MQRC_NOT_AUTHORIZED
    Shell脚本中的export
    Linux set unset命令
    shell中${}的妙用
    ubuntu alsa2
    ubuntu alsa
    计算机启动boot
  • 原文地址:https://www.cnblogs.com/haoliansheng/p/5130234.html
Copyright © 2011-2022 走看看