Hadoop完全分布式安装教程

zoukankan html css js c++ java

Hadoop完全分布式安装教程

一、软件版本

Hadoop版本号：hadoop-2.6.0.tar；

VMWare版本号：VMware-workstation-full-11.0.0-2305329

Ubuntu版本号：ubuntu-14.04.1-desktop-i386 其他版本也可

Jdk版本号：jdk-6u45-linux-i586.bin

后三项对版本要求不严格，如果使用Hbase1.0.0版本，需要JDK1.8以上版本。

二、安装教程

1、VMWare安装教程

       VMWare虚拟机是个软件，安装后可用来创建虚拟机，在虚拟机上再安装系统，在这个虚拟系统上再安装应用软件，所有应用就像操作一台真正的电脑，

请直接到VMWare官方网站下载相关软件

http://www.vmware.com/cn/products/workstation/workstation-evaluation

       以上链接如果因为官方网站变动发生变化，可以直接在搜索引擎中搜索VMWare来查找其下载地址，建议不要在非官方网站下载。

       安装试用版后有30天的试用期。

2、Ubuntu安装教程

打开VMWare点击创建新的虚拟机

选择典型

点击浏览

选择ubuntu

暂时只建两个虚拟机，注意分别给两个虚拟机起名为Ubuntu1和Ubuntu2；也可以按照自己的习惯取名，但是后续的许多配置文件要相应更改，会带来一些麻烦。

密码也请记牢，后面会经常使用。

3、安装VMWare-Tools



Ubuntu中会显示有光盘插入了光驱

双击打开光盘将光盘中VMwareTools-9.6.1-1378637.tar.gz复制到桌面，复制方法类似windows系统操作。

点击Extract Here

从菜单打开Ubuntu的控制终端

cd Desktop/vmware-tools-distrib/

sudo ./vmware-install.pl

输入root密码，一路回车，重启系统

注意： ubuntu安装后， root 用户默认是被锁定了的，不允许登录，也不允许“ su” 到 root 。

允许 su 到 root

非常简单，下面是设置的方法：

注意：ubuntu安装后要更新软件源：

cd /etc/apt

sudo apt-get update

安装各种软件比较方便

4、用户创建

创建hadoop用户组： sudo addgroup hadoop

   创建hduser用户：sudo adduser -ingroup hadoop hduser

   注意这里为hduser用户设置同主用户相同的密码

   为hadoop用户添加权限：sudo gedit /etc/sudoers，在root ALL=(ALL) ALL下添加

hduser ALL=(ALL) ALL。

设置好后重启机器：sudo reboot

切换到hduser用户登录；

5、主机配置

Hadoop集群中包括2个节点：1个Master，2个Salve，其中虚拟机Ubuntu1既做Master,也做Slave；虚拟机Ubuntu2只做Slave。

   配置hostname：Ubuntu下修改机器名称: sudo gedit /etc/hostname ，改为Ubuntu1；修改成功后用重启命令：hostname,查看当前主机名是否设置成功；

此时可以用虚拟机克隆的方式再复制一个。（先关机 vmware 菜单--虚拟机-管理--克隆）

注意：修改克隆的主机名为Ubuntu2。



   配置hosts文件：查看Ubuntu1和Ubuntu2的ip：ifconfig；

   打开hosts文件：sudo gedit /etc/hosts，添加如下内容：

   192.168.xxx.xxx Ubuntu1

   192.168.xxx.xxx Ubuntu2

注意这里的ip地址需要学员根据自己的电脑的ip设置。

在Ubuntu1上执行命令：ping Ubuntu2，若能ping通，则说明执行正确。

6、SSH无密码验证配置

   安装ssh服务器，默认安装了ssh客户端：sudo apt-get install openssh-server；

   在Ubuntu1上生成公钥和秘钥：ssh-keygen -t rsa -P "" ；

   查看路径 /home/hduser/.ssh文件里是否有id_rsa和id_rsa.pub；
   将公钥赋给authorized_keys：cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys；

   无密码登录：ssh localhost；

   无密码登陆到Ubuntu2，在Ubuntu1上执行：ssh-copy-id Ubuntu2，查看Ubuntu2的/home/hduser/.ssh文件里是否有authorized_keys；

   在Ubuntu1上执行命令：ssh Ubuntu2，首次登陆需要输入密码，再次登陆则无需密码；

   若要使Ubuntu2无密码登录Ubuntu1，则在Ubutu2上执行上述相同操作即可。

注：若无密码登录设置不成功，则很有可能是文件夹/文件权限问题，修改文件夹/文件权限即可。sudo chmod 777 “文件夹” 即可。

7、Java环境配置

获取opt文件夹权限：sudo chmod 777 /opt

将java压缩包放在/opt/，root模式执行sudo ./jdk-6u45-linux-i586.bin

配置jdk的环境变量：sudo gedit /etc/profile，将一下内容复制进去并保存

   # java

   export JAVA_HOME=/opt/jdk1.6.0_45

export JRE_HOME=$JAVA_HOME/jre

   export CLASSPATH=$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH

   export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH



   执行命令，使配置生效：source /etc/profile；

   执行命令：java -version，若出现java版本号，则说明安装成功。

8、hadoop集群安装

8.1 安装

将hadoop压缩包hadoop-2.6.0.tar.gz放在/home/hduser目录下，并解压缩到本地，重命名为hadoop；配置hadoop环境变量，执行：sudo gedit /etc/profile，将以下复制到profile内：

    #hadoop

export HADOOP_HOME=/home/hduser/hadoop

export PATH=$HADOOP_HOME/bin:$PATH

执行：source /etc/profile

注意：Ubuntu1、ubuntu2都要配置以上步骤；

8.2 配置

主要涉及的配置文件有7个：都在/hadoop/etc/hadoop文件夹下，可以用gedit命令对其进行编辑。

（1）进去hadoop配置文件目录

cd /home/hduser/hadoop/etc/hadoop/

（2）配置 hadoop-env.sh文件-->修改JAVA_HOME

gedit hadoop-env.sh

添加如下内容

# The java implementation to use.

export JAVA_HOME=/opt/jdk1.6.0_45

（3）配置 yarn-env.sh 文件-->>修改JAVA_HOME

添加如下内容

# some Java parameters

export JAVA_HOME=/opt/jdk1.6.0_45

（4）配置slaves文件-->>增加slave节点

（删除原来的localhost）

添加如下内容

Ubuntu1

Ubuntu2

（5）配置 core-site.xml文件-->>增加hadoop核心配置

（hdfs文件端口是9000、file:/home/hduser/hadoop/tmp）

添加如下内容

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://Ubuntu1:9000</value>
</property>

<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/hduser/hadoop/tmp</value>
<description>Abasefor other temporary directories.</description>
</property>

<property>

<name>hadoop.native.lib</name>
<value>true</value>
<description>Should native hadoop libraries, if present, be used.</description>
</property>

</configuration>

（6）配置 hdfs-site.xml 文件-->>增加hdfs配置信息

（namenode、datanode端口和目录位置）

<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>Ubuntu1:9001</value>
</property>

<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hduser/hadoop/dfs/name</value>
</property>

<property>
<name>dfs.datanode.data.dir</name>
<value> file:/home/hduser/hadoop/dfs/data</value>
</property>

<property>
<name>dfs.replication</name>
<value>2</value>
</property>

<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>

（7）配置 mapred-site.xml 文件-->>增加mapreduce配置

（使用yarn框架、jobhistory使用地址以及web地址）

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>Ubuntu1:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value> Ubuntu1:19888</value>
</property>
</configuration>

（8）配置 yarn-site.xml 文件-->>增加yarn功能

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>Ubuntu1:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>Ubuntu1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>Ubuntu1:8035</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>Ubuntu1:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>Ubuntu1:8088</value>
</property>

</configuration>

（9）将配置好的Ubuntu1中/hadoop/etc/hadoop文件夹复制到到Ubuntu2对应位置（删除Ubuntu2原来的文件夹/hadoop/etc/hadoop）

scp -r /home/hduser/hadoop/etc/hadoop/ hduser@Ubuntu2:/home/hduser/hadoop/etc/

8.3 验证

下面验证Hadoop配置是否正确：

（1）格式化namenode:

hduser@Ubuntu1:~$ cd hadoop

hduser@Ubuntu1:~/hadoop$ ./bin/hdfs namenode -format

hduser@Ubuntu2:~$ cd hadoop

hduser@Ubuntu2:~/hadoop$ ./bin/hdfs namenode -format

（2）启动hdfs:

hduser@Ubuntu1:~/hadoop$ ./sbin/start-dfs.sh

15/04/27 04:18:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Starting namenodes on [Ubuntu1]

Ubuntu1: starting namenode, logging to /home/hduser/hadoop/logs/hadoop-hduser-namenode-Ubuntu1.out

Ubuntu1: starting datanode, logging to /home/hduser/hadoop/logs/hadoop-hduser-datanode-Ubuntu1.out

Ubuntu2: starting datanode, logging to /home/hduser/hadoop/logs/hadoop-hduser-datanode-Ubuntu2.out

Starting secondary namenodes [Ubuntu1]

Ubuntu1: starting secondarynamenode, logging to /home/hduser/hadoop/logs/hadoop-hduser-secondarynamenode-Ubuntu1.out

15/04/27 04:19:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

查看java进程（Java Virtual Machine Process Status Tool）

hduser@Ubuntu1:~/hadoop$ jps

8008 NameNode

8443 Jps

8158 DataNode

8314 SecondaryNameNode

（3）停止hdfs:

hduser@Ubuntu1:~/hadoop$ ./sbin/stop-dfs.sh

Stopping namenodes on [Ubuntu1]

Ubuntu1: stopping namenode

Ubuntu1: stopping datanode

Ubuntu2: stopping datanode

Stopping secondary namenodes [Ubuntu1]

Ubuntu1: stopping secondarynamenode

查看java进程

hduser@Ubuntu1:~/hadoop$ jps

8850 Jps

（4）启动yarn:

hduser@Ubuntu1:~/hadoop$ ./sbin/start-yarn.sh

starting yarn daemons

starting resourcemanager, logging to /home/hduser/hadoop/logs/yarn-hduser-resourcemanager-Ubuntu1.out

Ubuntu2: starting nodemanager, logging to /home/hduser/hadoop/logs/yarn-hduser-nodemanager-Ubuntu2.out

Ubuntu1: starting nodemanager, logging to /home/hduser/hadoop/logs/yarn-hduser-nodemanager-Ubuntu1.out

查看java进程

hduser@Ubuntu1:~/hadoop$ jps

8911 ResourceManager

9247 Jps

9034 NodeManager

（5）停止yarn:

hduser@Ubuntu1:~/hadoop$ ./sbin/stop-yarn.sh

stopping yarn daemons

stopping resourcemanager

Ubuntu1: stopping nodemanager

Ubuntu2: stopping nodemanager

no proxyserver to stop

查看java进程

hduser@Ubuntu1:~/hadoop$ jps

9542 Jps

（6）查看集群状态：

首先启动集群：./sbin/start-dfs.sh

hduser@Ubuntu1:~/hadoop$ ./bin/hdfs dfsadmin -report

Configured Capacity: 39891361792 (37.15 GB)

Present Capacity: 28707627008 (26.74 GB)

DFS Remaining: 28707569664 (26.74 GB)

DFS Used: 57344 (56 KB)

DFS Used%: 0.00%

Under replicated blocks: 0

Blocks with corrupt replicas: 0

Missing blocks: 0

-------------------------------------------------

Live datanodes (2):

Name: 192.168.159.132:50010 (Ubuntu2)

Hostname: Ubuntu2

Decommission Status : Normal

Configured Capacity: 19945680896 (18.58 GB)

DFS Used: 28672 (28 KB)

Non DFS Used: 5575745536 (5.19 GB)

DFS Remaining: 14369906688 (13.38 GB)

DFS Used%: 0.00%

DFS Remaining%: 72.05%

Configured Cache Capacity: 0 (0 B)

Cache Used: 0 (0 B)

Cache Remaining: 0 (0 B)

Cache Used%: 100.00%

Cache Remaining%: 0.00%

Xceivers: 1

Last contact: Mon Apr 27 04:26:09 PDT 2015

Name: 192.168.159.131:50010 (Ubuntu1)

Hostname: Ubuntu1

Decommission Status : Normal

Configured Capacity: 19945680896 (18.58 GB)

DFS Used: 28672 (28 KB)

Non DFS Used: 5607989248 (5.22 GB)

DFS Remaining: 14337662976 (13.35 GB)

DFS Used%: 0.00%

DFS Remaining%: 71.88%

Configured Cache Capacity: 0 (0 B)

Cache Used: 0 (0 B)

Cache Remaining: 0 (0 B)

Cache Used%: 100.00%

Cache Remaining%: 0.00%

Xceivers: 1

Last contact: Mon Apr 27 04:26:08 PDT 2015

（7）查看hdfs：http://Ubuntu1:50070/

三、运行wordcount程序

（1）创建 file目录

hduser@Ubuntu1:~$ mkdir file

（2）在file创建file1.txt、file2.txt并写内容（在图形界面）

分别填写如下内容

file1.txt输入内容：Hello world hi HADOOP

file2.txt输入内容：Hello hadoop hi CHINA

创建后查看：

hduser@Ubuntu1:~ /hadoop $ cat file/file1.txt

Hello world hi HADOOP

hduser@Ubuntu1:~ /hadoop $ cat file/file2.txt

Hello hadoop hi CHINA

（3）在hdfs创建/input2目录

hduser@Ubuntu1:~/hadoop$ ./bin/hadoop fs -mkdir /input2

（4）将file1.txt、file2.txt文件copy到hdfs /input2目录

hduser@Ubuntu1:~/hadoop$ ./bin/hadoop fs -put file/file*.txt /input2

（5）查看hdfs上是否有file1.txt、file2.txt文件

hduser@Ubuntu1:~/hadoop$ bin/hadoop fs -ls /input2/

Found 2 items

-rw-r--r--   2 hduser supergroup         21 2015-04-27 05:54 /input2/file1.txt

-rw-r--r--   2 hduser supergroup         24 2015-04-27 05:54 /input2/file2.txt

（6）执行wordcount程序

先启动hdfs和yarn

hduser@Ubuntu1:~/hadoop$ ./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /input2/ /output2/wordcount1

15/04/27 05:57:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

15/04/27 05:57:17 INFO client.RMProxy: Connecting to ResourceManager at Ubuntu1/192.168.159.131:8032

15/04/27 05:57:19 INFO input.FileInputFormat: Total input paths to process : 2

15/04/27 05:57:19 INFO mapreduce.JobSubmitter: number of splits:2

15/04/27 05:57:19 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1430138907536_0001

15/04/27 05:57:20 INFO impl.YarnClientImpl: Submitted application application_1430138907536_0001

15/04/27 05:57:20 INFO mapreduce.Job: The url to track the job: http://Ubuntu1:8088/proxy/application_1430138907536_0001/

15/04/27 05:57:20 INFO mapreduce.Job: Running job: job_1430138907536_0001

15/04/27 05:57:32 INFO mapreduce.Job: Job job_1430138907536_0001 running in uber mode : false

15/04/27 05:57:32 INFO mapreduce.Job: map 0% reduce 0%

15/04/27 05:57:43 INFO mapreduce.Job: map 100% reduce 0%

15/04/27 05:57:58 INFO mapreduce.Job: map 100% reduce 100%

15/04/27 05:57:59 INFO mapreduce.Job: Job job_1430138907536_0001 completed successfully

15/04/27 05:57:59 INFO mapreduce.Job: Counters: 49

       File System Counters

              FILE: Number of bytes read=84

              FILE: Number of bytes written=317849

              FILE: Number of read operations=0

              FILE: Number of large read operations=0

              FILE: Number of write operations=0

              HDFS: Number of bytes read=247

              HDFS: Number of bytes written=37

              HDFS: Number of read operations=9

              HDFS: Number of large read operations=0

              HDFS: Number of write operations=2

       Job Counters

              Launched map tasks=2

              Launched reduce tasks=1

              Data-local map tasks=2

              Total time spent by all maps in occupied slots (ms)=16813

              Total time spent by all reduces in occupied slots (ms)=12443

              Total time spent by all map tasks (ms)=16813

              Total time spent by all reduce tasks (ms)=12443

              Total vcore-seconds taken by all map tasks=16813

              Total vcore-seconds taken by all reduce tasks=12443

              Total megabyte-seconds taken by all map tasks=17216512

              Total megabyte-seconds taken by all reduce tasks=12741632

       Map-Reduce Framework

              Map input records=2

              Map output records=8

              Map output bytes=75

              Map output materialized bytes=90

              Input split bytes=202

              Combine input records=8

              Combine output records=7

              Reduce input groups=5

              Reduce shuffle bytes=90

              Reduce input records=7

              Reduce output records=5

              Spilled Records=14

              Shuffled Maps =2

              Failed Shuffles=0

              Merged Map outputs=2

              GC time elapsed (ms)=622

              CPU time spent (ms)=2000

              Physical memory (bytes) snapshot=390164480

              Virtual memory (bytes) snapshot=1179254784

              Total committed heap usage (bytes)=257892352

       Shuffle Errors

              BAD_ID=0

              CONNECTION=0

              IO_ERROR=0

              WRONG_LENGTH=0

              WRONG_MAP=0

              WRONG_REDUCE=0

       File Input Format Counters

              Bytes Read=45

       File Output Format Counters

              Bytes Written=37

（7）查看运行结果

hduser@Ubuntu1:~/hadoop$ ./bin/hdfs dfs -cat /output2/wordcount1/*

CHINA   1

Hello      2

hadoop    2

hi         2

world      1

——————————————

显示出以上结果，表明您已经成功安装了Hadoop！

Eclipse开发环境的建立

1，需要下载eclipse

2，需要插件，插件的终极解决方案是

https://github.com/winghc/hadoop2x-eclipse-plugin下载并编译。

也可用提供好的插件。

3，复制编译好的jar到eclipse插件目录，重启eclipse

4，配置 hadoop 安装目录

window ->preference -> hadoop Map/Reduce -> Hadoop installation directory

5，     配置Map/Reduce 视图

window ->Open Perspective -> other->Map/Reduce -> 点击“OK”

windows → show view → other->Map/Reduce Locations-> 点击“OK”

6，在“Map/Reduce Locations” Tab页点击图标<大象+>或者在空白的地方右键，选择“New Hadoop location…”，弹出对话框“New hadoop location…”，

进行相应配置

MR Master和DFS Master配置必须和mapred-site.xml和core-site.xml等配置文件一致

7，打开Project Explorer,查看HDFS文件系统。

8，新建Map/Reduce任务

需要先启动Hadoop服务

File->New->project->Map/Reduce Project->Next

编写WordCount类：

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper

       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);

    private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

// Object key, Text value就是输入的key和value, Context记录输入的key和value

      StringTokenizer itr = new StringTokenizer(value.toString());

      while (itr.hasMoreTokens()) {

        word.set(itr.nextToken());

        context.write(word, one);

      }

    }

}

public static class IntSumReducer

       extends Reducer<Text,IntWritable,Text,IntWritable> {

    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,

                       Context context

                       ) throws IOException, InterruptedException {

//reduce函数与map函数基本相同，但value是一个迭代器的形式Iterable<IntWritable> values，也就是说reduce的输入是一个key对应一组的值的value

      int sum = 0;

      for (IntWritable val : values) {

        sum += val.get();

      }

      result.set(sum);

      context.write(key, result); //结果例如World, 2

    }

}

public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();

    Job job = Job.getInstance(conf, "word count");//指定job名称，及运行对象

    job.setJarByClass(WordCount.class);       job.setMapperClass(TokenizerMapper.class); //指定map函数

    job.setCombinerClass(IntSumReducer.class); // combiner整合

    job.setReducerClass(IntSumReducer.class);//设定reduce函数

    job.setOutputKeyClass(Text.class);//设定输出key数据类型

    job.setOutputValueClass(IntWritable.class);//设定输出value数据类型

    FileInputFormat.addInputPath(job, new Path(args[0]));//设定输入目录

    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    System.exit(job.waitForCompletion(true) ? 0 : 1);

}

}
音乐记录倒排索引

MapReduce程序开发

1、我们的任务要求是：

有一批音乐播放记录清单，包含歌曲被播放的用户

tom                           LittleApple

jack                              YesterdayOnceMore

Rose                            MyHeartWillGoOn

jack                              LittleApple

John                             MyHeartWillGoOn

kissinger                     LittleApple

kissinger                     YesterdayOnceMore

2、我们的任务输出结果是：

完成一个倒排索引形成的文本文件如下

LittleApple                        tom| jack| kissinger

YesterdayOnceMore                   jack| kissinger

MyHeartWillGoOn             Rose| John

3、我们的算法思路是：

将源文件按照每行进行分割，在mapper 过程中以歌曲名（LittleApple）作为key，以用户名（Tom）作为value，在reducer过程中是相同个歌曲码汇总，输出为倒排索引。

tom                           LittleApple

jack                              YesterdayOnceMore

Rose                            MyHeartWillGoOn

Map函数对应的<key,value>是

<LittleApple, Tom>

< YesterdayOnceMore, Jack >

< MyHeartWillGoOn, Rose>

Reduce函数将歌曲汇总

输出是

LittleApple     tom

                            Jack

Kissinger

最终输出到HDFS为结果

LittleApple                        tom| jack| kissinger

YesterdayOnceMore                   jack| kissinger

MyHeartWillGoOn             Rose| John

4、倒排索引源程序的注释：

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.*;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

public class Test_1 extends Configured implements Tool

{

enum Counter

{

    LINESKIP, // 出错的行

}

public static class Map extends Mapper<LongWritable,Text,Text,Text>

{

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException

    {

    String line = value.toString(); // 读取源数据，将其字符串化

         try

         {

       // 数据处理

           String[] lineSplit = line.split(" ");

//将数据用空格进行分割，例如Tom LittleApple

           String anum = lineSplit[0]; //此处anum为Tom

           String bnum = lineSplit[1]; //此处bnum为 LittleApple

           context.write(new Text(bnum), new Text(anum));

// 输出到context的键值对为<LittleApple ,tom>

          }

         catch (java.lang.ArrayIndexOutOfBoundsException e)   //出错保障

         {

           context.getCounter(Counter.LINESKIP).increment(1);

           return;

         }

     }

   }

   public static class Reduce extends Reducer<Text,Text,Text,Text>

   {

      public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException

      {

        String valueString;

        String out = "";



        for (Text value : values)

        {

          valueString = value.toString();

          out += valueString + "|"; //将听同一歌曲用|分隔符隔开累加

          //System.out.println("Ruduce:key="+key+" value="+value);

        }

        context.write(key, new Text(out));

      }

   }

   @Override

   public int run(String[] args) throws Exception

   {

     Configuration conf = this.getConf();



     Job job = new Job(conf, "Test_1"); // 任务名

     job.setJarByClass(Test_1.class); // 指定Class

     FileInputFormat.addInputPath(job, new Path(args[0])); // 输入路径

    FileOutputFormat.setOutputPath(job, new Path(args[1])); // 输出路径

     job.setMapperClass(Map.class); // 调用上面Map类作为Map任务代码

     job.setReducerClass(Reduce.class); // 调用上面Reduce类作为Reduce任务代码

     job.setOutputFormatClass(TextOutputFormat.class);

     job.setOutputKeyClass(Text.class); // 指定输出的KEY的格式

     job.setOutputValueClass(Text.class); // 指定输出的VALUE的格式

     job.waitForCompletion(true);

     return job.isSuccessful()?0:1;

    }

    public static void main(String[] args) throws Exception

    {

      // 运行任务

      int res = ToolRunner.run(new Configuration(), new Test_1(), args);

      System.exit(res);

    }

}

5、注意设置输入输出的路径：

可以在eclipse上直接运行，也可打成jar包后运行。

查看全文

相关阅读:
如何把方法(函数)当参数传递
 致加西亚的信摘录
 算法：C#数组去除重复元素算法研究
 [转帖]SQL SERVER 2005 安全设置
 [转].NET学习网站收集
 C#你真的懂了吗啥叫引用2
比IETEST更好用的浏览器兼容性测试软件[绿色]
[转帖]使用asp.net访问Oracle的方法汇总
 影响力密码信任你自己
 [转]自动刷新页面的实现方法总结

原文地址：https://www.cnblogs.com/zeussbook/p/8683149.html

Hadoop完全分布式安装教程

一、软件版本

二、安装教程

1、VMWare安装教程

2、Ubuntu安装教程

3、安装VMWare-Tools

4、用户创建

5、主机配置

6、SSH无密码验证配置

7、Java环境配置

8、hadoop集群安装

三、运行wordcount程序

Eclipse开发环境的建立