zoukankan      html  css  js  c++  java
  • Hadoop 学习笔记3 Develping MapReduce

    小笔记:

    Mavon是一种项目管理工具,通过xml配置来设置项目信息。

    Mavon POM(project of model).

     

    Steps:

    1. set up and configure the development environment.

    2. writing your map and reduce functions and run them in local (standalone) mode from the command line or within your IDE.

    3. unit test --> test on small dataset --> test on the full dataset after unleash in a cluster

        --> tuning 

    1. Configuration API

    • Components in Hadoop are configured using Hadoop’s own configuration API.
    • org.apache.hadoop.conf package
    • Configurations read their properties from resources — XML files with a simple structure for defining name-value pairs.

    For example, write a configuration-1.xml like:

    <?xml version="1.0"?>
    <configuration>
      <property>
         <name>color</name>
         <value>yellow</value>
         <description>Color</description>
      </property>
      <property>
         <name>size</name>
         <value>10</value>
         <description>Size</description>
      </property>
      <property>
         <name>weight</name>
         <value>heavy</value>
         <final>true</final>
         <description>Weight</description>
      </property>
      <property>
         <name>size-weight</name>
         <value>${size},${weight}</value>
         <description>Size and weight</description>
      </property>
    </configuration>
    

    then access it by coding below:

    Configuration conf = new Configuration();
    conf.addResource("configuration-1.xml");
    conf.addResource("configuration-2.xml"); // more than one resource are added orderly, and the latter will overwrite the former.

    assertThat(conf.get("color"), is("yellow")); assertThat(conf.getInt("size", 0), is(10)); assertThat(conf.get("breadth", "wide"), is("wide"));

    Note:

    • type information is not stored in the XML file;
    • instead, properties can be interpreted as a given type when they are read.
    • Also, the get() methods allow you to specify a default value, which is used if the property is not defined in the XML file, as in the case of breadth here.
    • more than one resource are added orderly, and the latter properties will overwrite the former.
    • However, properties that are marked as final cannot be overridden in later definitions.
    • system properties take priority:
    System.setProperty("size", "14")  
    
    • Options specified with -D take priority over properties from the configuration files.

          This will override the number of reducers set on the cluster or set in any client-side configuration files.

    % hadoop ConfigurationPrinter -D color=yellow | grep color
    

      

    2. Set up dev enviroment

    The Maven POMs (Project Object Model) are used to show the dependencies needed for building and testing MapReduce programs. Actually a xml file.

    • hadoop-client dependency, which contains all the Hadoop client-side classes needed to interact with HDFS and MapReduce.
    • For running unit tests, we use junit,
    • for writing MapReduce tests, we use mrunit.
    • The hadoop-minicluster library contains the “mini-” clusters that are useful for testing with Hadoop clusters running in a single JVM.

    Many IDEs can read Maven POMs directly, so you can just point them at the directory containing the pom.xml file and start writing code.

    Alternatively, you can use Maven to generate configuration files for your IDE. For example, the following creates Eclipse configuration files so you can import the project into Eclipse:

    % mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true
    

    3. Managing switching

    It is common to switch between running the application locally and running it on a cluster.

    • have Hadoop configuration files containing the connection settings for each cluster
    • we assume the existence of a directory called conf that contains three configuration files: hadoop-local.xml, hadoop-localhost.xml, and hadoopcluster.xml
    • For example, the following command shows a directory listing on the HDFS serverrunning in pseudodistributed mode on localhost:

            - conf  

    % hadoop  fs  -conf  conf/hadoop-localhost.xml  -ls
    
    Found 2 items
    drwxr-xr-x - tom supergroup 0 2014-09-08 10:19 input
    drwxr-xr-x - tom supergroup 0 2014-09-08 10:19 output
    

    4.  Starts MapReduce example:

    Mapper:  to get year and temperature from an input string

    public class MaxTemperatureMapper
         extends Mapper<LongWritable, Text, Text, IntWritable> {
    
    @Override
    public void map(LongWritable key, Text value, Context context)
         throws IOException, InterruptedException {
           String line = value.toString();
           String year = line.substring(15, 19);
           int airTemperature = Integer.parseInt(line.substring(87, 92));
    
           context.write(new Text(year), new IntWritable(airTemperature));
          }
    }
    

    Unit test for the Mapper:

    import java.io.IOException;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.mrunit.mapreduce.MapDriver;
    import org.junit.*;
    
    public class MaxTemperatureMapperTest {
       @Test
       public void processesValidRecord() throws IOException, InterruptedException {
            Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +
                                          // Year ^^^^
                    "99999V0203201N00261220001CN9999999N9-00111+99999999999");
                                          // Temperature ^^^^^
    
           new MapDriver<LongWritable, Text, Text, IntWritable>()
             .withMapper(new MaxTemperatureMapper())
             .withInput(new LongWritable(0), value)
             .withOutput(new Text("1950"), new IntWritable(-11))
             .runTest();
        }
    }
    

    Reducer:  to get the maxmium

    public class MaxTemperatureReducer
       extends Reducer<Text, IntWritable, Text, IntWritable> {
    
       @Override
       public void reduce(Text key, Iterable<IntWritable> values, Context context)
                 throws IOException, InterruptedException {
    
          int maxValue = Integer.MIN_VALUE;
      
          for (IntWritable value : values) {
               maxValue = Math.max(maxValue, value.get());
           }
    
          context.write(key, new IntWritable(maxValue));
        }
    }
    

    Unit test for the Reducer:

    @Test
    public void returnsMaximumIntegerInValues() throws IOException, InterruptedException {
    
       new ReduceDriver<Text, IntWritable, Text, IntWritable>()
           .withReducer(new MaxTemperatureReducer())
           .withInput(new Text("1950"),
                           Arrays.asList(new IntWritable(10), new IntWritable(5)))
           .withOutput(new Text("1950"), new IntWritable(10))
           .runTest();
    }
    

    5 . a write job driver 

    Using the Tool interface , it’s easy to write a driver to run a MapReduce job.

    Then run the driver locally.

    % mvn compile
    % export HADOOP_CLASSPATH=target/classes/
    % hadoop v2.MaxTemperatureDriver -conf conf/hadoop-local.xml 
    input/ncdc/micro output
    

    % hadoop v2.MaxTemperatureDriver -fs file:/// -jt local input/ncdc/micro output
    

    The local job runner uses a single JVM to run a job, so as long as all the classes that your job needs are on its classpath, then things will just work.

    6. Running on a cluster

    • a job’s classes must be packaged into a job JAR file to send to the cluster

     

     

  • 相关阅读:
    spring boot使用自定义注解+AOP实现对Controller层指定方法的日志记录
    spring事务管理中,注解方式和xml配置方式优先级谁高?
    synchronized修饰类中不同方法,调用的时候方法互斥吗
    java(spring boot)实现二维码生成(可以插入中间log和底部文字)
    java借助Robot给微信好友自动发消息(可发送表情包)
    js中Map类型的使用
    【转】Intellij笔记
    Tomcat6.0webappsevopWEB-INFclasses (系统找不到指定的路径)
    多线程多进程之其他
    文件操作
  • 原文地址:https://www.cnblogs.com/skyEva/p/5567276.html
Copyright © 2011-2022 走看看