zoukankan      html  css  js  c++  java
  • Hadoop环境搭建及wordcount程序

        目的: 前期学习了一些机器学习基本算法,实际企业应用中算法是核心,运行的环境和数据处理的平台是基础。

        手段: 搭建简易hadoop集群(由于机器限制在自己的笔记本上通过虚拟机搭建)

    一、基础环境介绍


    win10

    vmware15.0.0

    3 ubuntu 虚拟机(1 台作为master ,另外2台作为 slave1、slave2)

    hadoop2.8.5

    jdk1.8

    二、搭建步骤


    1. 安装vmware ,安装ubuntu 先安装一台,后面配置完成后直接克隆 (此处不作详细介绍,可参考其它文档进行搭建)

    2. linux基础环境配置

    a) 创建用户 test 执行所有安装相关操作 :

              sudo useradd -m test -s /bin/bash

          sudo passwd hadoop

    b)安装基础软件

    1. 基础工具
          . sudo apt-get install vim    (edit tools)
    
          . sudo apt-get install openssh-client openssh-server  (openssh service for log in the server via ssh)
    
          . sudo apt-get install nfs-common  (for nfs mounting )
    
          . sudo apt-get install git (for git tool)
    
    2.Setup nfs service on Ubuntu for mounting 
          . sudo apt-get install nfs-kernel-server       (install nfs server)
          
          . sudo mkdir /nfsroot; 
          
          . sudo chmod 777 /nfsroot         ( create /nfsroot fold as mounting directory)
    
          . sudo vim /etc/exports           (config the mount directory)
    
           add below line in /etc/exports: 
            
             /nfsroot *(rw,sync,no_root_squash)
        
         . sudo service nfs-kernel-server restart  (restart nfs service)
         
    3. setup samba service for share folders with windows OS
         . sudo apt-get install samba smbclient     (install necessay tools)
    
         . sudo apt-get install samba smbclient      (config the samba server)
    
         . Add following lines in /etc/samba//smb.conf:
         
            [nfsroot]
            comment = nfsroot
            path = /nfsroot
            public = yes
            guest ok = yes
            browseable = yes
            writeable = yes
            
        . sudo service smbd restart  (restart the samba service)

    c) 配置服务器之间免密互相访问(通过公钥私钥的方式)

           ssh-keygen -t rsa # 会有提示,都按回车就可以

           cat id_rsa.pub >> authorized_keys # 加入授权

          当所有节点都克隆完成后可以测试ssh登录:  ssh 192.168.xx.xxx@test   

    3. 配置java和hadoop软件 

             下载jdk1.8                  解压文件放在 /opt/java 目录下,并配置环境变量 (java –version 进行测试)

             下载hadoop2.8.5         解压文件放在 /opt/hadoop 目录下,并配置环境变量 (hadoop version 进行测试)

    4. 克隆当前版本的linux

          vmware有克隆虚拟机的功能,会将所有配置进行克隆

          配置每台机器的域名

    sudo hostname master  (主节点)

    sudo hostname slave1 (从节点)

    sudo hostname slave2(从节点)

         配置每台机器的固定ip地址,并增加域名解析配置: vim /etc/hosts  文件增加如下配置:

    127.0.0.1       localhost

    192.168.61.100   master
    192.168.61.101   slave1
    192.168.61.102   slave2

      这里可以先配置一台,然后通过scp命令将配置复制到其他两台机器上去,后面的hdfs、yarn、MapReduce的配置同样如此。

    5. 配置HDFS

           到hadoop安装目录下配置: ./etc/hadoop/core-site.xml

    <configuration>
     <property>
                    <name>hadoop.tmp.dir</name>
                    <value>file:/home/test/hadoop-2.8.5/hdfs/tmp</value>
                    <description>A base for other temporary directories.</description>
            </property>
    
            <property>
                    <name>io.file.buffer.size</name>
                    <value>131072</value>
            </property>
            <property>
                    <name>fs.defaultFS</name>
                    <value>hdfs://master:9000</value>
            </property>
    </configuration>

    配置hdfs: vim ./etc/hadoop/hdfs-site.xml

    <configuration>
    <property>
    <name>dfs.replication</name>
      <value>2</value>
    </property>
    <property>
      <name>dfs.namenode.name.dir</name>
      <value>file:/opt/hadoop-2.8.5/hdfs/name</value>
      <final>true</final>
    </property>
    <property>
      <name>dfs.datanode.data.dir</name>
      <value>file:/opt/hadoop-2.8.5/hdfs/data</value>
      <final>true</final>
    </property>
    <property>
      <name>dfs.namenode.secondary.http-address</name>
      <value>master:9001</value>
    </property>
    <property>
      <name>dfs.webhdfs.enabled</name>
      <value>true</value>
    </property>
    <property>
      <name>dfs.permissions</name>
      <value>false</value>
    </property>
    </configuration>

    6. 配置yarn

    <configuration>
    
    <!-- Site specific YARN configuration properties -->
    <property>
    <name>yarn.resourcemanager.address</name>
      <value>master:18040</value>
    </property>
    <property>
      <name>yarn.resourcemanager.scheduler.address</name>
      <value>master:18030</value>
    </property>
    <property>
      <name>yarn.resourcemanager.webapp.address</name>
      <value>master:18088</value>
    </property>
    <property>
      <name>yarn.resourcemanager.resource-tracker.address</name>
      <value>master:18025</value>
    </property>
    <property>
      <name>yarn.resourcemanager.admin.address</name>
      <value>master:18141</value>
    </property>
    <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
    </property>
    <property>
      <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>1024</value>
    </property>
    
    
    <property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
    </property>
    
    <property>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>3.0</value>
    </property>
    
    
    <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>1</value>
    </property>
    
    <property>         
         <name>yarn.nodemanager.localizer.address</name>
         <value>0.0.0.0:8040</value>     
    </property>     
    <property>         
    <description>The address of the container manager in the NM.</description>         
    <name>yarn.nodemanager.address</name>         
    <value>0.0.0.0:8041</value>     
    </property>     
    <property>         
    <description>NM Webapp address.</description>         
    <name>yarn.nodemanager.webapp.address</name>         
    <value>0.0.0.0:8042</value>     
    </property>
    </configuration>

    7.  配置mapreduce

    <configuration>
    <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
    </property>
    
    <property>
      <name>yarn.app.mapreduce.am.resource.mb</name>
      <value>1024</value>
    </property>
    
    <property>
      <name>mapreduce.map.memory.mb</name>
      <value>1024</value>
    </property>
    
    <property>
      <name>mapreduce.reduce.memory.mb</name>
      <value>1024</value>
    </property>
    
    </configuration>

    8. 测试:

    在master节点上运行 ./sbin/start-all.sh 

    通过jps 可以查看 master上的namenode和slave上的datanode  (结果如下)

    test@master:/opt/hadoop-2.8.5$ jps
    8960 Jps
    7940 NameNode
    8373 ResourceManager
    8206 SecondaryNameNode

    slave2上运行结果如下:

    test@slave2:/opt/hadoop-2.8.5/logs$ jps
    7301 Jps
    6938 NodeManager
    6767 DataNode

    三、wordcount程序

             在运行完start-all.sh脚本后。  就可以运行hadoop自带的wordcount程序了。

    1. 上传文件到hdfs的wc_input中

    2. 执行实例程序

    ./bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar wordcount /wc_input /wc_output.out7

    3. 执行结果如下:

    18/10/21 16:13:18 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.61.100:18040
    18/10/21 16:13:20 INFO input.FileInputFormat: Total input files to process : 2
    18/10/21 16:13:20 INFO mapreduce.JobSubmitter: number of splits:2
    18/10/21 16:13:20 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1540109557238_0001
    18/10/21 16:13:21 INFO impl.YarnClientImpl: Submitted application application_1540109557238_0001
    18/10/21 16:13:21 INFO mapreduce.Job: The url to track the job: http://master:18088/proxy/application_1540109557238_0001/
    18/10/21 16:13:21 INFO mapreduce.Job: Running job: job_1540109557238_0001
    18/10/21 16:13:35 INFO mapreduce.Job: Job job_1540109557238_0001 running in uber mode : false
    18/10/21 16:13:35 INFO mapreduce.Job:  map 0% reduce 0%
    18/10/21 16:13:42 INFO mapreduce.Job:  map 50% reduce 0%
    18/10/21 16:13:46 INFO mapreduce.Job:  map 100% reduce 0%
    18/10/21 16:13:51 INFO mapreduce.Job:  map 100% reduce 100%
    18/10/21 16:13:52 INFO mapreduce.Job: Job job_1540109557238_0001 completed successfully
    18/10/21 16:13:52 INFO mapreduce.Job: Counters: 49
            File System Counters
                    FILE: Number of bytes read=93
                    FILE: Number of bytes written=473483
                    FILE: Number of read operations=0
                    FILE: Number of large read operations=0
                    FILE: Number of write operations=0
                    HDFS: Number of bytes read=242
                    HDFS: Number of bytes written=39
                    HDFS: Number of read operations=9
                    HDFS: Number of large read operations=0
                    HDFS: Number of write operations=2
            Job Counters
                    Launched map tasks=2
                    Launched reduce tasks=1
                    Data-local map tasks=2
                    Total time spent by all maps in occupied slots (ms)=7691
                    Total time spent by all reduces in occupied slots (ms)=3635
                    Total time spent by all map tasks (ms)=7691
                    Total time spent by all reduce tasks (ms)=3635
                    Total vcore-milliseconds taken by all map tasks=7691
                    Total vcore-milliseconds taken by all reduce tasks=3635
                    Total megabyte-milliseconds taken by all map tasks=7875584
                    Total megabyte-milliseconds taken by all reduce tasks=3722240
            Map-Reduce Framework
                    Map input records=3
                    Map output records=8
                    Map output bytes=71
                    Map output materialized bytes=99
                    Input split bytes=203
                    Combine input records=8
                    Combine output records=8
                    Reduce input groups=6
                    Reduce shuffle bytes=99
                    Reduce input records=8
                    Reduce output records=6
                    Spilled Records=16
                    Shuffled Maps =2
                    Failed Shuffles=0
                    Merged Map outputs=2
                    GC time elapsed (ms)=178
                    CPU time spent (ms)=2180
                    Physical memory (bytes) snapshot=721473536
                    Virtual memory (bytes) snapshot=5936779264
                    Total committed heap usage (bytes)=474480640
            Shuffle Errors
                    BAD_ID=0
                    CONNECTION=0
                    IO_ERROR=0
                    WRONG_LENGTH=0
                    WRONG_MAP=0
                    WRONG_REDUCE=0
            File Input Format Counters
                    Bytes Read=39
            File Output Format Counters
                    Bytes Written=39
    
    View Code

    注: 配置、安装、执行过程中不可避免遇到问题,需要学会看log解决问题。

    参考: https://blog.csdn.net/xiao_bai_9527/article/details/79167562

    https://blog.csdn.net/qinzhaokun/article/details/47804923

  • 相关阅读:
    Python-属性描叙符协议ORM实现原理依据- __set__ __get__ __delete__
    Python-类属性查询协议-__getattr__ __getattribute__
    Python-__init__ 和 __new__区别和原理
    Python-在不在判断 in 和 in判断协议- in __contains__
    Python-求序列长度和序列长度协议-len() __len__
    Python-序列反转和序列反转协议-reversed __reversed__
    Python-序列切片原理和切片协议-[start:end:step] __getitem__
    Python-序列常用方法 + * += extend append方法区别
    Python其他数据结构collection模块-namtuple defaultdict deque Queue Counter OrderDict arrary
    Python-函数式编程-map reduce filter lambda 三元表达式 闭包
  • 原文地址:https://www.cnblogs.com/NeilZhang/p/9825675.html
Copyright © 2011-2022 走看看