1->下载hadoop-1.2.1.tar.gz
tar -zxvf hadoop-1.2.1.tar.gz 解压 这里假设解压的文件在 /root/soft
2->创建 hadoop 账户
groupadd hadoop
useradd -g haddop -d /home/hadoop
chown -R hadoop:hadoop /home/hadoop
mv /root/soft/hadoop-1.2.1 /home/hadoop/hadoop-1.2.1
3->hadoop账户免登陆设置
su - hadoop
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
cd .ssh && chmod 710 authorized_keys
4->安装jdk
下载 jdk-8u40-linux-i586.rpm
linux下安装 rpm -ivh jdk-8u40-linux-i586.rpm
安装完成后 被安装在/usr/java/jdk1.8.0_40
设置环境变量
/etc/profile添加
JAVA_HOME=/usr/java/jdk1.8.0_40
export PATH=$JAVA_HOME/bin:$PATH
进入hadoop根目录/conf/hadoop-env.sh 添加javahome
export JAVA_HOME=/usr/java/jdk1.8.0_40
5->设置hadoop环境变量
/etc/profile添加
HADOOP_HOME=/home/hadoop/hadoop-1.2.1
PATH=$HADOOP_HOME/bin:$PATH
进入hadoop根目录/conf
修改三个配置文件
conf/core-site.xml:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
conf/hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
conf/mapred-site.xml:
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration>
6->启动hadoop
su - hadoop
start-all.sh
或者 依次运行start-dfs.sh start-mapred.sh
启动完成后 查看 hdfs文件系统的目录里的文件 可以理解为是个ftp服务器
hadoop fs -lsa /
这里 运行 hadoop 根目录下的 hadoop-examples-1.2.1.jar 测试程序
hadoop jar hadoop-examples-1.2.1.jar 可以看到 这个测试程序 有多个程序
[root@localhost hadoop-1.2.1]# hadoop jar hadoop-examples-1.2.1.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
dbcount: An example job that count the pageview counts from a database.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using monte-carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sleep: A job that sleeps at each map and reduce task.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
这里测试个wordcount 这个程序 是通过mapreduce程序 将hdfs某个目录下的文本文件 统计他里面出现的所有单词 以及单词出现的次数
首先 在hdfs文件系统上创建一个文件夹 这里的/就是我们定义的
conf/core-site.xml:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
当然 一般情况下要写成 hadoop fs -mkdir hdfs://localhost:9000/test 下面简便写如下
hadoop fs -mkdir /test
hadoop fs -mkdir /test/input
下面也可以将 /test/input 改成 hdfs://localhost:9000/test/input
[hadoop@localhost hadoop-1.2.1]$ hadoop fs -put /home/hadoop/hadoop-1.2.1/conf/*.xml
/test/input
[hadoop@localhost hadoop-1.2.1]$ hadoop fs -lsr /
drwxr-xr-x - hadoop supergroup 0 2015-03-31 14:36 /test
drwxr-xr-x - hadoop supergroup 0 2015-03-31 14:38 /test/input
-rw-r--r-- 1 hadoop supergroup 7457 2015-03-31 14:38 /test/input/capacity-scheduler.xml
-rw-r--r-- 1 hadoop supergroup 294 2015-03-31 14:38 /test/input/core-site.xml
-rw-r--r-- 1 hadoop supergroup 327 2015-03-31 14:38 /test/input/fair-scheduler.xml
-rw-r--r-- 1 hadoop supergroup 4644 2015-03-31 14:38 /test/input/hadoop-policy.xml
-rw-r--r-- 1 hadoop supergroup 274 2015-03-31 14:38 /test/input/hdfs-site.xml
-rw-r--r-- 1 hadoop supergroup 2033 2015-03-31 14:38 /test/input/mapred-queue-acls.xml
-rw-r--r-- 1 hadoop supergroup 285 2015-03-31 14:38 /test/input/mapred-site.xml
drwxr-xr-x - hadoop supergroup 0 2015-03-31 13:21 /tmp
drwxr-xr-x - hadoop supergroup 0 2015-03-31 13:21 /tmp/hadoop-hadoop
drwxr-xr-x - hadoop supergroup 0 2015-03-31 13:59 /tmp/hadoop-hadoop/mapred
drwx------ - hadoop supergroup 0 2015-03-31 13:59 /tmp/hadoop-hadoop/mapred/system
-rw------- 1 hadoop supergroup 4 2015-03-31 13:59 /tmp/hadoop-hadoop/mapred/system/jobtracker.info
运行程序
hadoop jar hadoop-examples-1.2.1.jar wordcount /test/input /test/output
可以通过http://192.168.2.88:50070/ 查看hdfs文件情况
这里看 input为7个文件
Go to parent directory
Name | Type | Size | Replication | Block Size | Modification Time | Permission | Owner | Group |
capacity-scheduler.xml | file | 7.28 KB | 1 | 64 MB | 2015-03-31 14:38 | rw-r--r-- | hadoop | supergroup |
core-site.xml | file | 0.29 KB | 1 | 64 MB | 2015-03-31 14:38 | rw-r--r-- | hadoop | supergroup |
fair-scheduler.xml | file | 0.32 KB | 1 | 64 MB | 2015-03-31 14:38 | rw-r--r-- | hadoop | supergroup |
hadoop-policy.xml | file | 4.54 KB | 1 | 64 MB | 2015-03-31 14:38 | rw-r--r-- | hadoop | supergroup |
hdfs-site.xml | file | 0.27 KB | 1 | 64 MB | 2015-03-31 14:38 | rw-r--r-- | hadoop | supergroup |
mapred-queue-acls.xml | file | 1.99 KB | 1 | 64 MB | 2015-03-31 14:38 | rw-r--r-- | hadoop | supergroup |
mapred-site.xml | file | 0.28 KB | 1 | 64 MB | 2015-03-31 14:38 | rw-r--r-- | hadoop | supergroup |
可以通过http://192.168.2.88:50030/ 查看mapreduce运行情况
User: hadoop
Job Name: word count
Job File: hdfs://localhost:9000/tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201503311441_0001/job.xml
Submit Host: localhost.localdomain
Submit Host Address: 127.0.0.1
Job-ACLs: All users are allowed
Job Setup: Successful
Status: Succeeded
Started at: Tue Mar 31 14:45:01 PDT 2015
Finished at: Tue Mar 31 14:45:40 PDT 2015
Finished in: 38sec
Job Cleanup: Successful
Kind | % Complete | Num Tasks | Pending | Running | Complete | Killed | Failed/Killed Task Attempts |
|
---|---|---|---|---|---|---|---|---|
map | 100.00%
|
7 | 0 | 0 | 7 | 0 | 0 / 0 | |
reduce | 100.00%
|
1 | 0 | 0 | 1 | 0 | 0 / 0 |
这里通过 这个结果 可以理解 什么是map 什么是reduce 这里 input目录下 可以看到有 7个文件 就启动了 7个 task 也就是7个map
就是7个线程单独统计 各自文件中的单词及其单词出现的次数
最后 再来一次 reduce 这里可以理解为统计 7个线程 统计出来的7个结果 合并到一起 排序后 最后所有相同单词 统计到一起 就是reduce的过程
这里 map 运行的就是 tasktracker 而tasktracker 处理的文件处于 datanode中 所以一般 运行中tasktracker 和datanode处于同一台机器
而 jobtracker 可以理解为 tasktracker 的启动者 创建多少个tasktracker 都由jobtracker 调度
集群安装 参考
http://www.cnblogs.com/xia520pi/archive/2012/05/16/2503949.html