最好的安装手册
1.安装jdk
2.配置java环境变量
$vi /etc/profile
#在末尾添加以下内容
#JAVA_HOME
export JAVA_HOME=/usr/java/jre1.8.0_131
export PATH=$PATH:$JAVA_HOME/bin
3.hadoop下载与安装
本次试验安装版本为2.5.0
Unpack the downloaded Hadoop distribution. In the distribution, edit the file etc/hadoop/hadoop-env.sh to define some parameters as follows:(按照以下说明修改安装路径的etc/hadoop/hadoop-env.sh文件)
# set to the root of your Java installation(修改JAVA_HOME的路径)
export JAVA_HOME=/usr/java/latest
# Assuming your installation directory is /usr/local/hadoop(设置你的安装路径,可以忽略这一步)
export HADOOP_PREFIX=/usr/local/hadoop
这样,你就已经装好了Hadoop
I 单节点运行(Standalone Operation)
By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.
The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.(Hadoop默认就是单节点模式,就像一个单独的Java进程,下面是一个使用Hadoop跑一个程序的例子)
$ mkdir input
$ cp etc/hadoop/*.xml input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar grep input output 'dfs[a-z.]+'
# 启动Hadoop 文件,传入参数jar 从example包打开一个范例 grep 设置输入输出文件目录,设置参数
$ cat output/*
II 伪分布式的搭建(Pseudo-Distributed Operation)
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
(hadoop)也可以在一台机器上模拟分布式模式
Configuration
Use the following:(修改以下文件)
core-site.xml文件的全部属性
etc/hadoop/core-site.xml:
<configuration>
<!--设置主机的IP和端口-->
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:8020</value> #老师说习惯用8020端口,还要改主机名。我就暂时不改吧,看看再说
</property>
<!--设置临时文件的存储目录-->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/modules/hadoop-2.5.0/data/tmp</value>
</property>
</configuration>
etc/hadoop/hdfs-site.xml:(HDFS文件系统配置文件)
hdfs-site.xml全部属性
<!--修改备份文件数量为1,因为只有一台机子,默认的3份没卵用-->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
需要设置ssh才能继续后面的工作
Execution
The following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, see YARN on Single Node.
1.Format the filesystem:(格式化文件系统,这Y好像只能执行一次,以后不用执行)
$ bin/hdfs namenode -format
2.Start NameNode daemon and DataNode daemon:(启动 NameNode 和 DataNode )
$ sbin/start-dfs.sh
The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).
or(老师说用第二种方法)
$sbin/hadoop-deamon.sh start namenode
$sbin/hadoop-deamon.sh start datanode
这时候使用jps可以查看Java进程,有 NameNode DataNode 进程
这里如果出现问题可能是文件没有更新或者JDK安装错误
3.Browse the web interface for the NameNode; by default it is available at:(使用浏览器打开localhost:50070可以查看可视化界面)
NameNode - http://localhost:50070/
4.Make the HDFS directories required to execute MapReduce jobs:(使用 dfs 创建用户目录)
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/<username>
or
$ bin/hdfs dfs -mkdir -p /user/username/ # -p 同时多个目录
#使用命令行查看
$ bin/hdfs dfs -ls -R / # -R 递归查看 后面必须加上路径,否则为 . 即当前目录
4.Copy the input files into the distributed filesystem:(复制输出目录到hdfs系统,其实你完全可以自己创建一个)
$ bin/hdfs dfs -put etc/hadoop input
5.Run some of the examples provided:(运行范例)
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar grep input output 'dfs[a-z.]+' #(这里建议还是写绝对路径的)
7.Examine the output files:(查看结果)
Copy the output files from the distributed filesystem to the local filesystem and examine them:
$ bin/hdfs dfs -get output output
$ cat output/*
or
View the output files on the distributed filesystem:
$ bin/hdfs dfs -cat output/*
8.When you're done, stop the daemons with:(关闭dfs)
$ sbin/stop-dfs.sh
YARN配置及启动
1.YARN配置环境文件etc/hadoop/yarn-env.sh文件
#配置JAVA_HOME
export JAVA_HOME=/usr/java/jdk1.8.0_131/
2.配置etc/hadoop/yarn-site.xml文件
<!--配置nodemaneger服务-->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!--配置主机名-->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>cen-ubuntu</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
3.配置slaves(奴隶)文件
<!--填写主机名-->
cen-ubuntu
4.启动YARN
$sbin/yarn-daemon.sh start resourcemanager
$sbin/yarn-daemon.sh start nodemanager
可视化界面
cen-ubuntu:8088
在YARN上运行MapReduce
1.配置etc/hadoop/mapred-env.sh文件(mapred环境配置)
配置JAVA_HOME
2.配置etc/hadoop/mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
3.在YARN上运行MapReduce
bin/yarn jar ...跟上面步骤一样。。。
注意:output文件夹不能存在,否则报错
4.启动历史服务
$ sbin/mr-jobhistory-deamon.sh start historyservice
5.启动Aggregation(日志聚集)功能
修改yran-site.sh文件(看官方文档)
6.设置日志文件保存时间(同上)
yarn.log-aggregation.retain-seconds
到这里,Hadoop已经成功地运行在HDFS文件系统之上,中间过程太坎坷,但是都度过来了,大象会跳舞,加油