zoukankan      html  css  js  c++  java
  • Hadoop完全分布式安装教程

     

     

    一、软件版本

    Hadoop版本号:hadoop-2.6.0.tar;

    VMWare版本号:VMware-workstation-full-11.0.0-2305329

    Ubuntu版本号:ubuntu-14.04.1-desktop-i386 其他版本也可

    Jdk版本号:jdk-6u45-linux-i586.bin

    后三项对版本要求不严格,如果使用Hbase1.0.0版本,需要JDK1.8以上版本。

    二、安装教程

    1、VMWare安装教程

           VMWare虚拟机是个软件,安装后可用来创建虚拟机,在虚拟机上再安装系统,在这个虚拟系统上再安装应用软件,所有应用就像操作一台真正的电脑,

    请直接到VMWare官方网站下载相关软件

    http://www.vmware.com/cn/products/workstation/workstation-evaluation

           以上链接如果因为官方网站变动发生变化,可以直接在搜索引擎中搜索VMWare来查找其下载地址,建议不要在非官方网站下载。

           安装试用版后有30天的试用期。

    2、Ubuntu安装教程

    打开VMWare点击创建新的虚拟机

    选择典型

    点击浏览


    选择ubuntu

      暂时只建两个虚拟机,注意分别给两个虚拟机起名为Ubuntu1和Ubuntu2;也可以按照自己的习惯取名,但是后续的许多配置文件要相应更改,会带来一些麻烦。

      密码也请记牢,后面会经常使用。

    3、安装VMWare-Tools

      

    Ubuntu中会显示有光盘插入了光驱

    双击打开光盘将光盘中VMwareTools-9.6.1-1378637.tar.gz复制到桌面,复制方法类似windows系统操作。

    点击Extract Here

    从菜单打开Ubuntu的控制终端

    cd Desktop/vmware-tools-distrib/

    sudo ./vmware-install.pl

    输入root密码,一路回车,重启系统

    注意: ubuntu安装后, root 用户默认是被锁定了的,不允许登录,也不允许“ su” 到 root 。

    允许 su 到 root

    非常简单,下面是设置的方法:

    注意:ubuntu安装后要更新软件源:

    cd /etc/apt

    sudo apt-get update

    安装各种软件比较方便

    4、用户创建

    创建hadoop用户组: sudo addgroup hadoop 

       创建hduser用户:sudo adduser -ingroup hadoop hduser

       注意这里为hduser用户设置同主用户相同的密码

       为hadoop用户添加权限:sudo gedit /etc/sudoers,在root ALL=(ALL) ALL下添加

    hduser ALL=(ALL) ALL。

    设置好后重启机器:sudo reboot

    切换到hduser用户登录;

    5、主机配置

    Hadoop集群中包括2个节点:1个Master,2个Salve,其中虚拟机Ubuntu1既做Master,也做Slave;虚拟机Ubuntu2只做Slave。

       配置hostname:Ubuntu下修改机器名称: sudo gedit /etc/hostname ,改为Ubuntu1;修改成功后用重启命令:hostname,查看当前主机名是否设置成功;

    此时可以用虚拟机克隆的方式再复制一个。(先关机 vmware 菜单--虚拟机-管理--克隆)

    注意:修改克隆的主机名为Ubuntu2。

      

       配置hosts文件:查看Ubuntu1和Ubuntu2的ip:ifconfig;

       打开hosts文件:sudo gedit /etc/hosts,添加如下内容:

       192.168.xxx.xxx Ubuntu1

       192.168.xxx.xxx Ubuntu2

     注意这里的ip地址需要学员根据自己的电脑的ip设置。

     在Ubuntu1上执行命令:ping Ubuntu2,若能ping通,则说明执行正确。

    6、SSH无密码验证配置

       安装ssh服务器,默认安装了ssh客户端:sudo apt-get install openssh-server;

       在Ubuntu1上生成公钥和秘钥:ssh-keygen -t rsa -P "" ;

       查看路径 /home/hduser/.ssh文件里是否有id_rsa和id_rsa.pub;
       将公钥赋给authorized_keys:cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys;

       无密码登录:ssh localhost;

       无密码登陆到Ubuntu2,在Ubuntu1上执行:ssh-copy-id Ubuntu2,查看Ubuntu2的/home/hduser/.ssh文件里是否有authorized_keys;

       在Ubuntu1上执行命令:ssh Ubuntu2,首次登陆需要输入密码,再次登陆则无需密码;

       若要使Ubuntu2无密码登录Ubuntu1,则在Ubutu2上执行上述相同操作即可。

    注:若无密码登录设置不成功,则很有可能是文件夹/文件权限问题,修改文件夹/文件权限即可。sudo chmod 777 “文件夹” 即可。

    7、Java环境配置

    获取opt文件夹权限:sudo chmod 777 /opt

    将java压缩包放在/opt/,root模式执行sudo ./jdk-6u45-linux-i586.bin

    配置jdk的环境变量:sudo gedit /etc/profile,将一下内容复制进去并保存

       # java

       export JAVA_HOME=/opt/jdk1.6.0_45

       export JRE_HOME=$JAVA_HOME/jre

       export CLASSPATH=$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH

       export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH

      

       执行命令,使配置生效:source /etc/profile;

       执行命令:java -version,若出现java版本号,则说明安装成功。

    8、hadoop集群安装

    8.1 安装

    将hadoop压缩包hadoop-2.6.0.tar.gz放在/home/hduser目录下,并解压缩到本地,重命名为hadoop;配置hadoop环境变量,执行:sudo gedit /etc/profile,将以下复制到profile内:

        #hadoop

    export HADOOP_HOME=/home/hduser/hadoop   

    export PATH=$HADOOP_HOME/bin:$PATH

    执行:source /etc/profile

    注意:Ubuntu1、ubuntu2都要配置以上步骤;

    8.2 配置

    主要涉及的配置文件有7个:都在/hadoop/etc/hadoop文件夹下,可以用gedit命令对其进行编辑。

    1进去hadoop配置文件目录

    cd  /home/hduser/hadoop/etc/hadoop/



    2配置 hadoop-env.sh文件-->修改JAVA_HOME

    gedit hadoop-env.sh

    添加如下内容

    # The java implementation to use.

    export JAVA_HOME=/opt/jdk1.6.0_45

    3配置 yarn-env.sh 文件-->>修改JAVA_HOME

    添加如下内容

    # some Java parameters

    export JAVA_HOME=/opt/jdk1.6.0_45

    4配置slaves文件-->>增加slave节点 

    (删除原来的localhost)

    添加如下内容

    Ubuntu1

    Ubuntu2

    5配置 core-site.xml文件-->>增加hadoop核心配置

    (hdfs文件端口是9000、file:/home/hduser/hadoop/tmp)

    添加如下内容

    <configuration>
     <property>
      <name>fs.defaultFS</name>
      <value>hdfs://Ubuntu1:9000</value>
     </property>

     <property>
      <name>io.file.buffer.size</name>
      <value>131072</value>
     </property>
     <property>
      <name>hadoop.tmp.dir</name>
      <value>file:/home/hduser/hadoop/tmp</value>
      <description>Abasefor other temporary directories.</description>
     </property>

    <property>

     <name>hadoop.native.lib</name>
      <value>true</value>
      <description>Should native hadoop libraries, if present, be used.</description>
    </property>

    </configuration>

    6配置  hdfs-site.xml 文件-->>增加hdfs配置信息

    (namenode、datanode端口和目录位置)

    <configuration>
     <property>
      <name>dfs.namenode.secondary.http-address</name>
      <value>Ubuntu1:9001</value>
     </property>

      <property>
       <name>dfs.namenode.name.dir</name>
       <value>file:/home/hduser/hadoop/dfs/name</value>
     </property>

     <property>
      <name>dfs.datanode.data.dir</name>
      <value> file:/home/hduser/hadoop/dfs/data</value>
      </property>

     <property>
      <name>dfs.replication</name>
      <value>2</value>
     </property>

     <property>
      <name>dfs.webhdfs.enabled</name>
      <value>true</value>
     </property>
    </configuration>

    7配置 mapred-site.xml 文件-->>增加mapreduce配置

    (使用yarn框架、jobhistory使用地址以及web地址)

    <configuration>
      <property>
       <name>mapreduce.framework.name</name>
       <value>yarn</value>
     </property>
     <property>
      <name>mapreduce.jobhistory.address</name>
      <value>Ubuntu1:10020</value>
     </property>
     <property>
      <name>mapreduce.jobhistory.webapp.address</name>
      <value> Ubuntu1:19888</value>
     </property>
    </configuration>

    8)配置  yarn-site.xml 文件-->>增加yarn功能

    <configuration>
      <property>
       <name>yarn.nodemanager.aux-services</name>
       <value>mapreduce_shuffle</value>
      </property>
      <property>
       <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
       <value>org.apache.hadoop.mapred.ShuffleHandler</value>
      </property>
      <property>
       <name>yarn.resourcemanager.address</name>
       <value>Ubuntu1:8032</value>
      </property>
      <property>
       <name>yarn.resourcemanager.scheduler.address</name>
       <value>Ubuntu1:8030</value>
      </property>
      <property>
       <name>yarn.resourcemanager.resource-tracker.address</name>
       <value>Ubuntu1:8035</value>
      </property>
      <property>
       <name>yarn.resourcemanager.admin.address</name>
       <value>Ubuntu1:8033</value>
      </property>
      <property>
       <name>yarn.resourcemanager.webapp.address</name>
       <value>Ubuntu1:8088</value>
      </property>

    </configuration>

    9将配置好的Ubuntu1/hadoop/etc/hadoop文件夹复制到到Ubuntu2对应位置(删除Ubuntu2原来的文件夹/hadoop/etc/hadoop)

    scp -r /home/hduser/hadoop/etc/hadoop/ hduser@Ubuntu2:/home/hduser/hadoop/etc/

    8.3 验证

    下面验证Hadoop配置是否正确:

    1格式化namenode:

    hduser@Ubuntu1:~$ cd hadoop

    hduser@Ubuntu1:~/hadoop$ ./bin/hdfs namenode -format

    hduser@Ubuntu2:~$ cd hadoop

    hduser@Ubuntu2:~/hadoop$ ./bin/hdfs namenode -format

    2)启动hdfs:

    hduser@Ubuntu1:~/hadoop$ ./sbin/start-dfs.sh

    15/04/27 04:18:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

    Starting namenodes on [Ubuntu1]

    Ubuntu1: starting namenode, logging to /home/hduser/hadoop/logs/hadoop-hduser-namenode-Ubuntu1.out

    Ubuntu1: starting datanode, logging to /home/hduser/hadoop/logs/hadoop-hduser-datanode-Ubuntu1.out

    Ubuntu2: starting datanode, logging to /home/hduser/hadoop/logs/hadoop-hduser-datanode-Ubuntu2.out

    Starting secondary namenodes [Ubuntu1]

    Ubuntu1: starting secondarynamenode, logging to /home/hduser/hadoop/logs/hadoop-hduser-secondarynamenode-Ubuntu1.out

    15/04/27 04:19:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

    查看java进程(Java Virtual Machine Process Status Tool)

    hduser@Ubuntu1:~/hadoop$ jps

    8008 NameNode

    8443 Jps

    8158 DataNode

    8314 SecondaryNameNode

    3)停止hdfs:

    hduser@Ubuntu1:~/hadoop$ ./sbin/stop-dfs.sh

    Stopping namenodes on [Ubuntu1]

    Ubuntu1: stopping namenode

    Ubuntu1: stopping datanode

    Ubuntu2: stopping datanode

    Stopping secondary namenodes [Ubuntu1]

    Ubuntu1: stopping secondarynamenode

    查看java进程

    hduser@Ubuntu1:~/hadoop$ jps

    8850 Jps

    4)启动yarn:

    hduser@Ubuntu1:~/hadoop$ ./sbin/start-yarn.sh

    starting yarn daemons

    starting resourcemanager, logging to /home/hduser/hadoop/logs/yarn-hduser-resourcemanager-Ubuntu1.out

    Ubuntu2: starting nodemanager, logging to /home/hduser/hadoop/logs/yarn-hduser-nodemanager-Ubuntu2.out

    Ubuntu1: starting nodemanager, logging to /home/hduser/hadoop/logs/yarn-hduser-nodemanager-Ubuntu1.out

    查看java进程

    hduser@Ubuntu1:~/hadoop$ jps

    8911 ResourceManager

    9247 Jps

    9034 NodeManager

    5)停止yarn:

    hduser@Ubuntu1:~/hadoop$  ./sbin/stop-yarn.sh

    stopping yarn daemons

    stopping resourcemanager

    Ubuntu1: stopping nodemanager

    Ubuntu2: stopping nodemanager

    no proxyserver to stop

    查看java进程

    hduser@Ubuntu1:~/hadoop$ jps

    9542 Jps

    6)查看集群状态:

    首先启动集群:./sbin/start-dfs.sh

    hduser@Ubuntu1:~/hadoop$ ./bin/hdfs dfsadmin -report

    Configured Capacity: 39891361792 (37.15 GB)

    Present Capacity: 28707627008 (26.74 GB)

    DFS Remaining: 28707569664 (26.74 GB)

    DFS Used: 57344 (56 KB)

    DFS Used%: 0.00%

    Under replicated blocks: 0

    Blocks with corrupt replicas: 0

    Missing blocks: 0

    -------------------------------------------------

    Live datanodes (2):

    Name: 192.168.159.132:50010 (Ubuntu2)

    Hostname: Ubuntu2

    Decommission Status : Normal

    Configured Capacity: 19945680896 (18.58 GB)

    DFS Used: 28672 (28 KB)

    Non DFS Used: 5575745536 (5.19 GB)

    DFS Remaining: 14369906688 (13.38 GB)

    DFS Used%: 0.00%

    DFS Remaining%: 72.05%

    Configured Cache Capacity: 0 (0 B)

    Cache Used: 0 (0 B)

    Cache Remaining: 0 (0 B)

    Cache Used%: 100.00%

    Cache Remaining%: 0.00%

    Xceivers: 1

    Last contact: Mon Apr 27 04:26:09 PDT 2015

    Name: 192.168.159.131:50010 (Ubuntu1)

    Hostname: Ubuntu1

    Decommission Status : Normal

    Configured Capacity: 19945680896 (18.58 GB)

    DFS Used: 28672 (28 KB)

    Non DFS Used: 5607989248 (5.22 GB)

    DFS Remaining: 14337662976 (13.35 GB)

    DFS Used%: 0.00%

    DFS Remaining%: 71.88%

    Configured Cache Capacity: 0 (0 B)

    Cache Used: 0 (0 B)

    Cache Remaining: 0 (0 B)

    Cache Used%: 100.00%

    Cache Remaining%: 0.00%

    Xceivers: 1

    Last contact: Mon Apr 27 04:26:08 PDT 2015

    7)查看hdfshttp://Ubuntu1:50070/

    三、运行wordcount程序

    1)创建 file目录

    hduser@Ubuntu1:~$ mkdir file

    2)在file创建file1.txtfile2.txt并写内容(在图形界面)

    分别填写如下内容

    file1.txt输入内容:Hello world hi HADOOP

    file2.txt输入内容:Hello hadoop hi CHINA

    创建后查看:

    hduser@Ubuntu1:~ /hadoop $ cat file/file1.txt

    Hello world hi HADOOP

    hduser@Ubuntu1:~ /hadoop $ cat file/file2.txt

    Hello hadoop hi CHINA

    3)在hdfs创建/input2目录

    hduser@Ubuntu1:~/hadoop$ ./bin/hadoop fs -mkdir /input2

    4)将file1.txtfile2.txt文件copyhdfs /input2目录

    hduser@Ubuntu1:~/hadoop$ ./bin/hadoop fs -put file/file*.txt /input2

    5)查看hdfs上是否有file1.txtfile2.txt文件

    hduser@Ubuntu1:~/hadoop$ bin/hadoop fs -ls /input2/

    Found 2 items

    -rw-r--r--   2 hduser supergroup         21 2015-04-27 05:54 /input2/file1.txt

    -rw-r--r--   2 hduser supergroup         24 2015-04-27 05:54 /input2/file2.txt

    6)执行wordcount程序

    先启动hdfs和yarn

    hduser@Ubuntu1:~/hadoop$ ./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /input2/ /output2/wordcount1

    15/04/27 05:57:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

    15/04/27 05:57:17 INFO client.RMProxy: Connecting to ResourceManager at Ubuntu1/192.168.159.131:8032

    15/04/27 05:57:19 INFO input.FileInputFormat: Total input paths to process : 2

    15/04/27 05:57:19 INFO mapreduce.JobSubmitter: number of splits:2

    15/04/27 05:57:19 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1430138907536_0001

    15/04/27 05:57:20 INFO impl.YarnClientImpl: Submitted application application_1430138907536_0001

    15/04/27 05:57:20 INFO mapreduce.Job: The url to track the job: http://Ubuntu1:8088/proxy/application_1430138907536_0001/

    15/04/27 05:57:20 INFO mapreduce.Job: Running job: job_1430138907536_0001

    15/04/27 05:57:32 INFO mapreduce.Job: Job job_1430138907536_0001 running in uber mode : false

    15/04/27 05:57:32 INFO mapreduce.Job:  map 0% reduce 0%

    15/04/27 05:57:43 INFO mapreduce.Job:  map 100% reduce 0%

    15/04/27 05:57:58 INFO mapreduce.Job:  map 100% reduce 100%

    15/04/27 05:57:59 INFO mapreduce.Job: Job job_1430138907536_0001 completed successfully

    15/04/27 05:57:59 INFO mapreduce.Job: Counters: 49

           File System Counters

                  FILE: Number of bytes read=84

                  FILE: Number of bytes written=317849

                  FILE: Number of read operations=0

                  FILE: Number of large read operations=0

                  FILE: Number of write operations=0

                  HDFS: Number of bytes read=247

                  HDFS: Number of bytes written=37

                  HDFS: Number of read operations=9

                  HDFS: Number of large read operations=0

                  HDFS: Number of write operations=2

           Job Counters

                  Launched map tasks=2

                  Launched reduce tasks=1

                  Data-local map tasks=2

                  Total time spent by all maps in occupied slots (ms)=16813

                  Total time spent by all reduces in occupied slots (ms)=12443

                  Total time spent by all map tasks (ms)=16813

                  Total time spent by all reduce tasks (ms)=12443

                  Total vcore-seconds taken by all map tasks=16813

                  Total vcore-seconds taken by all reduce tasks=12443

                  Total megabyte-seconds taken by all map tasks=17216512

                  Total megabyte-seconds taken by all reduce tasks=12741632

           Map-Reduce Framework

                  Map input records=2

                  Map output records=8

                  Map output bytes=75

                  Map output materialized bytes=90

                  Input split bytes=202

                  Combine input records=8

                  Combine output records=7

                  Reduce input groups=5

                  Reduce shuffle bytes=90

                  Reduce input records=7

                  Reduce output records=5

                  Spilled Records=14

                  Shuffled Maps =2

                  Failed Shuffles=0

                  Merged Map outputs=2

                  GC time elapsed (ms)=622

                  CPU time spent (ms)=2000

                  Physical memory (bytes) snapshot=390164480

                  Virtual memory (bytes) snapshot=1179254784

                  Total committed heap usage (bytes)=257892352

           Shuffle Errors

                  BAD_ID=0

                  CONNECTION=0

                  IO_ERROR=0

                  WRONG_LENGTH=0

                  WRONG_MAP=0

                  WRONG_REDUCE=0

           File Input Format Counters

                  Bytes Read=45

           File Output Format Counters

                  Bytes Written=37

    7)查看运行结果

    hduser@Ubuntu1:~/hadoop$ ./bin/hdfs dfs -cat /output2/wordcount1/*

    CHINA   1

    Hello      2

    hadoop    2

    hi         2

    world      1

    ——————————————

    显示出以上结果,表明您已经成功安装了Hadoop!

    Eclipse开发环境的建立

    1,  需要下载eclipse

    2,  需要插件,插件的终极解决方案是

    https://github.com/winghc/hadoop2x-eclipse-plugin下载并编译。

    也可用提供好的插件。

    3,  复制编译好的jar到eclipse插件目录,重启eclipse

    4,  配置 hadoop 安装目录

    window ->preference -> hadoop Map/Reduce -> Hadoop installation directory

     

    5,      配置Map/Reduce 视图

    window ->Open Perspective -> other->Map/Reduce -> 点击“OK”

    windows → show view → other->Map/Reduce Locations-> 点击“OK”

    6,在“Map/Reduce Locations” Tab页 点击图标<大象+>或者在空白的地方右键,选择“New Hadoop location…”,弹出对话框“New hadoop location…”,

    进行相应配置

    MR Master和DFS Master配置必须和mapred-site.xml和core-site.xml等配置文件一致

    7,打开Project Explorer,查看HDFS文件系统。

    8,新建Map/Reduce任务

    需要先启动Hadoop服务

    File->New->project->Map/Reduce Project->Next

    编写WordCount类:

    import java.io.IOException;

    import java.util.StringTokenizer;

    import org.apache.hadoop.conf.Configuration;

    import org.apache.hadoop.fs.Path;

    import org.apache.hadoop.io.IntWritable;

    import org.apache.hadoop.io.Text;

    import org.apache.hadoop.mapreduce.Job;

    import org.apache.hadoop.mapreduce.Mapper;

    import org.apache.hadoop.mapreduce.Reducer;

    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    public class WordCount {

      public static class TokenizerMapper

           extends Mapper<Object, Text, Text, IntWritable>{

        private final static IntWritable one = new IntWritable(1);

        private Text word = new Text();

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

    // Object key, Text value就是输入的key和value, Context记录输入的key和value

          StringTokenizer itr = new StringTokenizer(value.toString());

          while (itr.hasMoreTokens()) {

            word.set(itr.nextToken());

            context.write(word, one);

          }

        }

      }

      public static class IntSumReducer

           extends Reducer<Text,IntWritable,Text,IntWritable> {

        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,

                           Context context

                           ) throws IOException, InterruptedException {

    //reduce函数与map函数基本相同,但value是一个迭代器的形式Iterable<IntWritable> values,也就是说reduce的输入是一个key对应一组的值的value

          int sum = 0;

          for (IntWritable val : values) {

            sum += val.get();

          }

          result.set(sum);

          context.write(key, result); //结果例如World, 2

        }

      }

      public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf, "word count");//指定job名称,及运行对象 

        job.setJarByClass(WordCount.class);       job.setMapperClass(TokenizerMapper.class); //指定map函数

        job.setCombinerClass(IntSumReducer.class); // combiner整合

        job.setReducerClass(IntSumReducer.class);//设定reduce函数

        job.setOutputKeyClass(Text.class);//设定输出key数据类型

        job.setOutputValueClass(IntWritable.class);//设定输出value数据类型

        FileInputFormat.addInputPath(job, new Path(args[0]));//设定输入目录

        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

      }


    音乐记录倒排索引

    MapReduce程序开发

    1、  我们的任务要求是:

    有一批音乐播放记录清单,包含歌曲被播放的用户

    tom                            LittleApple              

    jack                              YesterdayOnceMore  

    Rose                            MyHeartWillGoOn      

    jack                              LittleApple            

    John                             MyHeartWillGoOn      

    kissinger                     LittleApple            

    kissinger                     YesterdayOnceMore

    2、  我们的任务输出结果是:

    完成一个倒排索引形成的文本文件如下

    LittleApple                         tom| jack| kissinger

    YesterdayOnceMore                   jack| kissinger

    MyHeartWillGoOn             Rose| John

    3、  我们的算法思路是:

    将源文件按照每行进行分割,在mapper 过程中以歌曲名(LittleApple)作为key,以用户名(Tom)作为value,在reducer过程中是相同个歌曲码汇总,输出为倒排索引。

    tom                            LittleApple              

    jack                              YesterdayOnceMore  

    Rose                            MyHeartWillGoOn

    Map函数对应的<key,value>是

    <LittleApple, Tom>

    < YesterdayOnceMore, Jack >

    < MyHeartWillGoOn, Rose>

    Reduce函数将歌曲汇总

    输出是

    LittleApple      tom

                                Jack

    Kissinger

    最终输出到HDFS为结果

    LittleApple                         tom| jack| kissinger

    YesterdayOnceMore                   jack| kissinger

    MyHeartWillGoOn             Rose| John

    4、  倒排索引源程序的注释:

    import java.io.IOException;

    import org.apache.hadoop.conf.Configuration;

    import org.apache.hadoop.conf.Configured;

    import org.apache.hadoop.fs.Path;

    import org.apache.hadoop.io.*;

    import org.apache.hadoop.mapreduce.*;

    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

    import org.apache.hadoop.util.Tool;

    import org.apache.hadoop.util.ToolRunner;

    public class Test_1 extends Configured implements Tool

    {

      enum Counter

      {

        LINESKIP, // 出错的行

      }

      public static class Map extends Mapper<LongWritable,Text,Text,Text>

      {

        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException

        {

        String line = value.toString(); // 读取源数据,将其字符串化

             try

             {                  

           // 数据处理

               String[] lineSplit = line.split(" ");

    //将数据用空格进行分割,例如Tom  LittleApple 

               String anum = lineSplit[0]; //此处anum为Tom

               String bnum = lineSplit[1]; //此处bnum为 LittleApple

               context.write(new Text(bnum), new Text(anum));

    // 输出到context的键值对为<LittleApple ,tom>

              }

             catch (java.lang.ArrayIndexOutOfBoundsException e)   //出错保障

             {

               context.getCounter(Counter.LINESKIP).increment(1);

               return;

             }

         }

       }

       public static class Reduce extends Reducer<Text,Text,Text,Text>

       {

          public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException

          {

            String valueString;

            String out = "";

           

            for (Text value : values)

            {

              valueString = value.toString();

              out += valueString + "|";  //将听同一歌曲用|分隔符隔开累加

              //System.out.println("Ruduce:key="+key+"  value="+value);

            }

            context.write(key, new Text(out));

          }

       }

       @Override

       public int run(String[] args) throws Exception

       {

         Configuration conf = this.getConf();

        

         Job job = new Job(conf, "Test_1"); // 任务名

         job.setJarByClass(Test_1.class); // 指定Class

         FileInputFormat.addInputPath(job, new Path(args[0])); // 输入路径

         FileOutputFormat.setOutputPath(job, new Path(args[1])); // 输出路径

         job.setMapperClass(Map.class); // 调用上面Map类作为Map任务代码

         job.setReducerClass(Reduce.class); // 调用上面Reduce类作为Reduce任务代码

         job.setOutputFormatClass(TextOutputFormat.class);

         job.setOutputKeyClass(Text.class); // 指定输出的KEY的格式

         job.setOutputValueClass(Text.class); // 指定输出的VALUE的格式

         job.waitForCompletion(true);

         return job.isSuccessful()?0:1;

        }

        public static void main(String[] args) throws Exception

        {

          // 运行任务

          int res = ToolRunner.run(new Configuration(), new Test_1(), args);

          System.exit(res);

        }

    }

    5、  注意设置输入输出的路径:

    可以在eclipse上直接运行,也可打成jar包后运行。

  • 相关阅读:
    eclipse部署web项目至本地的tomcat但在webapps中找不到
    tomcat使用jdbc连接mysql出现的错误
    MySQL-5.6.13解压版(zip版)安装配置教程
    mysql简单用法
    关于java代理(静态代理和动态代理)
    shell 基础
    一、Django入门
    java 强制类项转换
    Java多态性详解——父类引用子类对象
    Java中抽象类和接口的区别
  • 原文地址:https://www.cnblogs.com/zeussbook/p/8683149.html
Copyright © 2011-2022 走看看