第一步环境准备:
jdk安装,用户用组新建
useradd -m hadoop
passwd hadoop 修改密码
添加用户hadoop到hadoop用户组
wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar -xvf hadoop-3.2.1.tar.gz -C /data/projects
sudo chown -R hadoop:hadoop /data/projects
usermod -a -G hadoop haddop 第一个hadoop是组名,-a 防止其他用户组的hadoop离开,保持旧的用户组拥有hadoop用户状态
单机伪分布式,免密操作
ssh-keygen -t rsa
cat id_rsa.pub >> authorized_keys
chmod 600 authorized_keys
修改主机名不重启
hostname hadoop
配置hadoop环境变量:类比jdk
# hadoop home
export HADOOP_HOME=/data/projects/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
修改hadoop 配置文件:/data/projects/hadoop/etc/hadoop
1.修改hadoop-env.sh添加如下:
[hadoop@hadoop hadoop]$ grep JAVA_HOME hadoop-env.sh
export JAVA_HOME=/usr/local/java/jdk1.8.0_221
2.修改core-site.xml
.配置默认采用的文件系统。
(由于存储层和运算层松耦合,要为它们指定使用hadoop原生的分布式文件系统hdfs。value填入的是uri,参数是 分布式集群中主节点的地址 : 指定端口号
)
2.配置hadoop的公共目录
(指定hadoop进程运行中产生的数据存放的工作目录,NameNode、DataNode等就在本地工作目录下建子目录存放数据。但事实上在生产系统里,NameNode、DataNode等进程都应单独配置目录,而且配置的应该是磁盘挂载点,以方便挂载更多的磁盘扩展容量
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/data/projects/hadoop/tmp</value> </property> </configuration>
3.修稿hdfs-site.xml,配置副本数量
1.配置启动hadoop50070端口
2.(客户端将文件存到hdfs的时候,会存放在多个副本。value一般指定3,但因为搭建的是伪分布式就只有一台机器,所以只能写1。)
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.http.address</name> <value>192.168.110.151:50070</value> </property> </configuration>
4.配置 mapred-site.xml
指定MapReduce程序应该放在哪个资源调度集群上运行。若不指定为yarn,那么MapReduce程序就只会在本地运行而非在整个集群中运行。
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
5.配置 yarn-site.xml
1.指定yarn集群中的老大(就是本机)
2.配置yarn集群中的重节点,指定map产生的中间结果传递给reduce采用的机制是shuffle
<configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
6.配置 关闭防火墙
格式化hadoop :
执行hdfs namenode -format
2020-05-27 19:18:49,081 INFO util.GSet: 0.029999999329447746% max memory 839.5 MB = 257.9 KB
2020-05-27 19:18:49,081 INFO util.GSet: capacity = 2^15 = 32768 entries
2020-05-27 19:18:49,112 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1667952246-192.168.110.151-1590578329102
2020-05-27 19:18:49,131 INFO common.Storage: Storage directory /data/projects/hadoop/tmp/dfs/name has been successfully formatted.
2020-05-27 19:18:49,184 INFO namenode.FSImageFormatProtobuf: Saving image file /data/projects/hadoop/tmp/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2020-05-27 19:18:49,367 INFO namenode.FSImageFormatProtobuf: Image file /data/projects/hadoop/tmp/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 401 bytes saved in 0 seconds .
2020-05-27 19:18:49,399 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2020-05-27 19:18:49,416 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
2020-05-27 19:18:49,416 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop/192.168.110.151
************************************************************/
启动服务:cd /data/projects/hadoop/sbin 执行
[hadoop@hadoop sbin]$ start-dfs.sh
Starting namenodes on [hadoop]
hadoop: Warning: Permanently added 'hadoop' (ECDSA) to the list of known hosts.
Starting datanodes
localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Starting secondary namenodes [hadoop]
[hadoop@hadoop sbin]$ start-yarn.sh
Starting resourcemanager
Starting nodemanagers
[hadoop@hadoop sbin]$ jps
57681 NameNode
58020 SecondaryNameNode
57800 DataNode
58712 Jps
58380 NodeManager
58255 ResourceManager
六个一个不少就成功了