原本以为有大神已经总结的很清楚了,就不自己在写了, 但是在自己安装的过程中还是出现了一些问题, 所以打算以自己的方式重新总结一下。 参考https://blog.csdn.net/hliq5399/article/details/78193113
完全分布式安装
对于hadoop的本地模式,伪分布式的安装,由于在实际工作中用处不大, 这里就省略不写了。
下载最新版本hadoop
https://hadoop.apache.org/releases.html
服务器功能规划
之前在VirtualBox网络的Host-Only配置 中我已经配置了三台虚拟机, 具体的功能划分如下
master | slave1 | slave2 |
---|---|---|
NameNode | ResourceManage | |
DataNode | DataNode | DataNode |
NodeManager | NodeManager | NodeManager |
HistoryServer | SecondaryNameNode |
配置Hostname
这部分由于我在安装虚拟机的时候就输入了主机名,所以并不需要额外配置。
具体的修改方法(以master机器为例, 其他机器修改为对应的主机名slave1,slave2)
[root@master hadoop]# vi /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=master
[root@master hadoop]# service network restart
配置hosts
[root@master hadoop]# vi /etc/hosts
192.168.102.3 master 192.168.102.4 slave1 192.168.102.5 slave2
此处还是以master为例, 另外两台slave机器也同样配置。
设置SSH无密码登录
Hadoop集群中的各个机器间会相互地通过SSH访问,每次访问都输入密码是不现实的,所以要配置各个机器间的SSH是无密码登录的。
1、 在master上生成公钥
[root@master hadoop]# ssh-keygen -t rsa
一路回车,都设置为默认值,然后再当前用户(我直接用了root用户,所以没有配置hadoop用户和用户组的过程)的Home目录下的.ssh
目录中会生成公钥文件(id_rsa.pub)
和私钥文件(id_rsa)
。
[root@master hadoop]# cd /root/.ssh/
authorized_keys id_rsa id_rsa.pub known_hosts
2、 分发公钥(本机也要分发)
[root@master hadoop]# ssh-copy-id master
[root@master hadoop]# ssh-copy-id slave1
[root@master hadoop]# ssh-copy-id slave2
3、以同样方式设置slave1、slave2到其他机器的无密钥登录。
在master机器上安装Hadoop
我们采用先在master机器上解压、配置Hadoop,然后再分发到其他两台机器上的方式来安装集群。
解压Hadoop目录:
[root@master hadoop]# tar xzvf hadoop-2.9.2.tar.gz
修改目录的用户名:
[root@master hadoop]# chown -R root:root hadoop-2.9.2
配置Hadoop JDK路径修改hadoop-env.sh、mapred-env.sh、yarn-env.sh文件中的JDK路径:
[root@master hadoop]#cd /opt/hadoop/hadoop-2.9.2/etc/hadoop [root@master hadoop]# vi hadoop-env.sh
[root@master hadoop]# vi mapred-env.sh
[root@master hadoop]# vi yarn-env.sh
export JAVA_HOME="/opt/java/jdk1.8.0_191"
配置core-site.xml:
[root@master hadoop]# vi /opt/hadoop/hadoop-2.9.2/etc/hadoop/core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://master:8020</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/opt/tmp</value> </property> </configuration>
fs.defaultFS为NameNode的地址。
hadoop.tmp.dir为hadoop临时目录的地址,默认情况下,NameNode和DataNode的数据文件都会存在这个目录下的对应子目录下。应该保证此目录是存在的,如果不存在,先创建。
配置hdfs-site.xml:
[root@master hadoop]# vi /opt/hadoop/hadoop-2.9.2/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>slave2:50090</value>
</property>
</configuration>
dfs.namenode.secondary.http-address是指定secondaryNameNode的http访问地址和端口号,因为在规划中,我们将slave2规划为SecondaryNameNode服务器。
所以这里设置为:slave2:50090
配置slaves:
[root@master hadoop]# vi /opt/hadoop/hadoop-2.9.2/etc/hadoop/slaves
master
slave1
slave2
slaves文件是指定HDFS上有哪些DataNode节点。
配置yarn-site.xml:
[root@master hadoop]# vi /opt/hadoop/hadoop-2.9.2/etc/hadoop/yarn-site.xml
<configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>slave1</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>106800</value> </property> </configuration>
根据规划yarn.resourcemanager.hostname
这个指定resourcemanager服务器指向slave1
。
yarn.log-aggregation-enable
是配置是否启用日志聚集功能。
yarn.log-aggregation.retain-seconds
是配置聚集的日志在HDFS上最多保存多长时间。
配置mapred-site.xml:
从mapred-site.xml.template复制一个mapred-site.xml文件。
[root@master hadoop]# cp /opt/hadoop/hadoop-2.9.2/etc/hadoop/mapred-site.xml.template /opt/hadoop/hadoop-2.9.2/etc/hadoop/mapred-site.xml
[root@master hadoop]# vi /opt/hadoop/hadoop-2.9.2/etc/hadoop/mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>master:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>master:19888</value> </property> </configuration>
mapreduce.framework.name设置mapreduce任务运行在yarn上。
mapreduce.jobhistory.address是设置mapreduce的历史服务器安装在master机器上。
mapreduce.jobhistory.webapp.address是设置历史服务器的web页面地址和端口号。
分发Hadoop文件
1、 首先在其他两台机器上创建存放Hadoop的目录
[root@slave1 opt]# mkdir /opt/hadoop/
[root@slave2 opt]# mkdir /opt/hadoop/
2、 通过Scp分发
Hadoop根目录下的share/doc目录是存放的hadoop的文档,文件相当大,建议在分发之前将这个目录删除掉,可以节省硬盘空间并能提高分发的速度。
doc目录大小有1.6G。
[root@master hadoop]# scp -r /opt/hadoop/hadoop-2.9.2/ slave1:/opt/hadoop/
[root@master hadoop]# scp -r /opt/hadoop/hadoop-2.9.2/ slave2:/opt/hadoop/
配置Hadoop环境变量:
在所有机器上执行下面命令。
千万注意:
1、如果你使用root用户进行安装。 vi /etc/profile 即可 系统变量
2、如果你使用普通用户进行安装。 vi ~/.bashrc 用户变量
[root@master hadoop]#vi /etc/profile
在文件最后加上
export HADOOP_HOME=/opt/hadoop/hadoop-2.9.2
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:
修改后重新加载一下环境变量 source /etc/profile 或 source ~/.bashrc
格式化NameNode
在master(NameNode)机器上执行格式化:
[root@master bin]# hdfs namenode -format
启动集群
1、 启动HDFS
[root@master current]# start-dfs.sh
18/12/20 14:13:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [master]
master: starting namenode, logging to /opt/hadoop/hadoop-2.9.2/logs/hadoop-root-namenode-master.out
slave1: starting datanode, logging to /opt/hadoop/hadoop-2.9.2/logs/hadoop-root-datanode-slave1.out
slave2: starting datanode, logging to /opt/hadoop/hadoop-2.9.2/logs/hadoop-root-datanode-slave2.out
master: starting datanode, logging to /opt/hadoop/hadoop-2.9.2/logs/hadoop-root-datanode-master.out
Starting secondary namenodes [slave2]
slave2: starting secondarynamenode, logging to /opt/hadoop/hadoop-2.9.2/logs/hadoop-root-secondarynamenode-slave2.out
18/12/20 14:13:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2、 启动YARN
在slave1(ResourceManager)机器上启动YARN服务
[root@slave1 opt]# start-yarn.sh starting yarn daemons starting resourcemanager, logging to /opt/hadoop/hadoop-2.9.2/logs/yarn-root-resourcemanager-slave1.out slave2: starting nodemanager, logging to /opt/hadoop/hadoop-2.9.2/logs/yarn-root-nodemanager-slave2.out master: starting nodemanager, logging to /opt/hadoop/hadoop-2.9.2/logs/yarn-root-nodemanager-master.out slave1: starting nodemanager, logging to /opt/hadoop/hadoop-2.9.2/logs/yarn-root-nodemanager-slave1.out
3、 启动日志服务器
因为我们规划的是在master服务器上运行MapReduce日志服务,所以要在master上启动。
[root@master sbin]# mr-jobhistory-daemon.sh start historyserver starting historyserver, logging to /opt/hadoop/hadoop-2.9.2/logs/mapred-root-historyserver-master.out
4、查看3台服务器的进程启动状态
[root@master logs]# jps 6180 Jps 4790 NodeManager 5694 NameNode 6079 JobHistoryServer
[root@slave1 opt]# jps 4592 DataNode 4048 ResourceManager 4155 NodeManager 4686 Jps
[root@slave2 hadoop]# jps 3744 SecondaryNameNode 3673 DataNode 3852 Jps 3373 NodeManager
5、 查看HDFS Web页面
http://192.168.102.3:50070/
6、 查看YARN Web 页面
http://192.168.102.4:8088/cluster
测试Job
我们这里用hadoop自带的wordcount例子来在本地模式下测试跑mapreduce。
1、 准备mapreduce输入文件wc.input
[root@master logs]# cat /opt/data/wc.input hadoop mapreduce hive hbase spark storm sqoop hadoop hive spark hadoop
2、 在HDFS创建输入目录input
[root@master logs]# hdfs dfs -mkdir /input
3、 将wc.input上传到HDFS
[root@master logs]# hdfs dfs -put /opt/data/wc.input /input/wc.input
4、 运行hadoop自带的mapreduce Demo
[root@master logs]# yarn jar /opt/hadoop/hadoop-2.9.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordcount /input/wc.input /output/ 18/12/20 14:43:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/12/20 14:43:29 INFO client.RMProxy: Connecting to ResourceManager at slave1/192.168.102.4:8032 18/12/20 14:43:30 INFO input.FileInputFormat: Total input files to process : 1 18/12/20 14:43:31 INFO mapreduce.JobSubmitter: number of splits:1 18/12/20 14:43:31 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled 18/12/20 14:43:31 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1545286577071_0001 18/12/20 14:43:31 INFO impl.YarnClientImpl: Submitted application application_1545286577071_0001 18/12/20 14:43:31 INFO mapreduce.Job: The url to track the job: http://slave1:8088/proxy/application_1545286577071_0001/ 18/12/20 14:43:31 INFO mapreduce.Job: Running job: job_1545286577071_0001 18/12/20 14:43:41 INFO mapreduce.Job: Job job_1545286577071_0001 running in uber mode : false 18/12/20 14:43:41 INFO mapreduce.Job: map 0% reduce 0% 18/12/20 14:43:47 INFO mapreduce.Job: map 100% reduce 0% 18/12/20 14:43:54 INFO mapreduce.Job: map 100% reduce 100% 18/12/20 14:43:55 INFO mapreduce.Job: Job job_1545286577071_0001 completed successfully 18/12/20 14:43:55 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=94 FILE: Number of bytes written=396815 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=169 HDFS: Number of bytes written=60 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=3250 Total time spent by all reduces in occupied slots (ms)=4671 Total time spent by all map tasks (ms)=3250 Total time spent by all reduce tasks (ms)=4671 Total vcore-milliseconds taken by all map tasks=3250 Total vcore-milliseconds taken by all reduce tasks=4671 Total megabyte-milliseconds taken by all map tasks=3328000 Total megabyte-milliseconds taken by all reduce tasks=4783104 Map-Reduce Framework Map input records=4 Map output records=11 Map output bytes=115 Map output materialized bytes=94 Input split bytes=98 Combine input records=11 Combine output records=7 Reduce input groups=7 Reduce shuffle bytes=94 Reduce input records=7 Reduce output records=7 Spilled Records=14 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=136 CPU time spent (ms)=1160 Physical memory (bytes) snapshot=368492544 Virtual memory (bytes) snapshot=4127322112 Total committed heap usage (bytes)=165810176 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=71 File Output Format Counters Bytes Written=60
5、 查看输出文件
[root@master logs]# hdfs dfs -ls /output 18/12/20 14:45:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 2 items -rw-r--r-- 3 root supergroup 0 2018-12-20 14:43 /output/_SUCCESS -rw-r--r-- 3 root supergroup 60 2018-12-20 14:43 /output/part-r-00000 [root@master logs]# hdfs dfs -cat /output/part-r-00000 18/12/20 14:45:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable hadoop 3 hbase 1 hive 2 mapreduce 1 spark 2 sqoop 1 storm 1