zoukankan      html  css  js  c++  java
  • [Cloudera Hadoop] CDH 4.0 Quick Start Guide (动手实践,最新版CDH4.0,企业版Hadoop)

    Cloudera推出了最新版的CDH4.0

    在安装CDH4.0之前,首先要进行一些系统准备工作,下面是具体需要做的事情。

    1. Support OS for CDH4.0.

    2. Install JDK.

    =========================安装前准备实践 开始 ===============================

    1. 准备操作系统:由于日后希望使用Crowbar来自动安装部署CDH4.0和Cloudera Manager所以选择了 Centos6.2 64bit Server的系统,根据Centos6.2的安装步骤顺利完成安装。

    2. 安装好操作系统之后,首先要配置网络,设置为静态IP地址,设置GW和DNS,便于系统设置的稳定。

    2.1 修改对应网卡的IP地址的配置文件
    # vi /etc/sysconfig/network-scripts/ifcfg-eth0

    #Set Static IP address
    DEVICE="eth0"
    NM_CONTROLLED="yes"
    ONBOOT=yes
    TYPE=Ethernet

    BOOTPROTO=static
    IPADDR=192.168.26.140
    NETMASK=255.255.255.0
    NETWORK=192.168.26.0
    BROADCAST=192.168.26.255

    DEFROUTE=yes
    IPV4_FAILURE_FATAL=yes
    IPV6INIT=no
    NAME="System eth0"
    UUID=5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03
    HWADDR=00:0C:29:7A:CE:59
    PEERDNS=yes
    PEERROUTES=yes

    2.2 CentOS 修改网关
    修改对应网卡的网关的配置文件
    [root@centos]# vi /etc/sysconfig/network

    修改以下内容
    NETWORKING=yes(表示系统是否使用网络,一般设置为yes。如果设为no,则不能使用网络,而且很多系统服务程序将无法启动)
    HOSTNAME=CDH4.0-Node-1(设置本机的主机名,这里设置的主机名要和/etc/hosts中设置的主机名对应)
    GATEWAY=192.168.26.2(设置本机连接的网关的IP地址。例如,网关为10.0.0.2)

    2.3 CentOS 修改DNS服务器
    # vi /etc/resolv.conf
    修改以下内容
    # Generated by NetworkManager
    domain localdomain
    search localdomain
    nameserver 192.168.26.2 

    2.4 CentOS 网络配置的3个关键文件

    [root@CDH4 ~]# vi /etc/sysconfig/network-scripts/ifcfg-eth0
    [root@CDH4 ~]# vi /etc/sysconfig/network
    [root@CDH4 ~]# vi /etc/resolv.conf

    2.5 CentOS 网络配置验证:

    网关通畅:

    [root@CDH4 ~]# ping 192.168.26.2
    PING 192.168.26.2 (192.168.26.2) 56(84) bytes of data.
    64 bytes from 192.168.26.2: icmp_seq=1 ttl=128 time=1.57 ms
    64 bytes from 192.168.26.2: icmp_seq=2 ttl=128 time=0.238 ms
    ^C
    --- 192.168.26.2 ping statistics ---
    2 packets transmitted, 2 received, 0% packet loss, time 1182ms
    rtt min/avg/max/mdev = 0.238/0.907/1.577/0.670 ms

    出口通畅:

    [root@CDH4 ~]# ping 192.168.88.1
    PING 192.168.88.1 (192.168.88.1) 56(84) bytes of data.
    64 bytes from 192.168.88.1: icmp_seq=1 ttl=128 time=2.64 ms
    64 bytes from 192.168.88.1: icmp_seq=2 ttl=128 time=3.20 ms
    ^C
    --- 192.168.88.1 ping statistics ---
    2 packets transmitted, 2 received, 0% packet loss, time 1630ms
    rtt min/avg/max/mdev = 2.644/2.924/3.204/0.280 ms

    公网通畅:

    [root@CDH4 ~]# ping www.baidu.com
    PING www.a.shifen.com (61.135.169.125) 56(84) bytes of data.
    64 bytes from 61.135.169.125: icmp_seq=1 ttl=128 time=33.9 ms
    64 bytes from 61.135.169.125: icmp_seq=2 ttl=128 time=33.3 ms

    3. 然后更新源,然后升级操作系统到最新。

    关于源的速度,国内比较好的源有“USTC中国科技大学”和“网易 mirrors.163.com”

    centos更新源(网易 163.com)

    # 备份
    # mv /etc/yum.repos.d/CentOS-Base.repo{,.bak}
    # 修改
    # vi /etc/yum.repos.d/CentOS-Base.repo

    # CentOS-Base.repo
    #
    # The mirror system uses the connecting IP address of the client and the
    # update status of each mirror to pick mirrors that are updated to and
    # geographically close to the client. You should use this for CentOS updates
    # unless you are manually picking other mirrors.
    #
    # If the mirrorlist= does not work for you, as a fall back you can try the
    # remarked out baseurl= line instead.
    #
    #

    [base]
    name=CentOS-$releasever - Base
    #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=os
    baseurl=http://mirrors.163.com/centos/$releasever/os/$basearch/
    gpgcheck=1
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-6

    #released updates
    [updates]
    name=CentOS-$releasever - Updates
    #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=updates
    baseurl=http://mirrors.163.com/centos/$releasever/updates/$basearch/
    gpgcheck=1
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-6

    #additional packages that may be useful
    [extras]
    name=CentOS-$releasever - Extras
    #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=extras
    baseurl=http://mirrors.163.com/centos/$releasever/extras/$basearch/
    gpgcheck=1
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-6

    #additional packages that extend functionality of existing packages
    [centosplus]
    name=CentOS-$releasever - Plus
    #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=centosplus
    baseurl=http://mirrors.163.com/centos/$releasever/centosplus/$basearch/
    gpgcheck=1
    enabled=0
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-6

    #contrib - packages by Centos Users
    [contrib]
    name=CentOS-$releasever - Contrib
    #mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=contrib
    baseurl=http://mirrors.163.com/centos/$releasever/contrib/$basearch/
    gpgcheck=1
    enabled=0
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-6


    # yum clean all
    # yum makecache     #将服务器上的软件包信息缓存到本地,以提高搜索安装软件的速度
    # yum install vim*    #测试域名是否可用

    # yum update #更新系统

    4. 安装JDK,由于Hadoop运行于JVM之上所以这个是必须安装的。

    4.1 在安装CDH4.0之前,必须先安装Orcal JDK,并且满足下列要求,建议版本 1.6.0_31,集群上面的JDK保持相同版本,JAVA_HOME 环境变量必须设置。

    Requirements:
    • CDH4 requires the Oracle JDK 1.6.0_8 at a minimum. Cloudera recommends version 1.6.0_31.

    After installing the JDK, and before installing and deploying CDH:

    • If you are deploying CDH on a cluster, make sure you have the same version of the Oracle JDK on each node.
    • Make sure the JAVA_HOMEenvironment variable is set for the root user on each node. You can check by using a command such as
      $ sudo env | grep JAVA_HOME

      It should be set to point to the directory where the JDK is installed, as shown in the example below.

    4.2 开始安装Orcal JDK.

    下载JDK http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-javase6-419409.html

    安装JDK

    设置环境变量:JAVA_HOME将$JAVA_HOME/bin加入到$PATH之中。

    • As the root user, set JAVA_HOMEto the directory where the JDK is installed; for example:
       
      # export JAVA_HOME=<jdk-install-dir>
      # export PATH=$JAVA_HOME/bin:$PATH

      where <jdk-install-dir> might be something like /usr/java/jdk1.6.0_31, depending on the system configuration and where the JDK is actually installed.

    5. Create Snashot named 'OS Reday',系统准备完毕。

    =========================安装前准备实践 结束 ===============================

    ========================= 安装Cloudera Hadoop 开始 ===============================

    For OS VersionClick this Link
    Red Hat/CentOS/Oracle 5 Red Hat/CentOS/Oracle 5 link
    Red Hat/CentOS 6 (32-bit) Red Hat/CentOS 6 link (32-bit)
    Red Hat/CentOS 6 (64-bit) Red Hat/CentOS 6 link (64-bit)

    Installing CDH4 with MRv1 on a Single Linux Node in Pseudo-distributed mode

    1.Download the CDH4 Package :

    For OS VersionClick this Link
    Red Hat/CentOS/Oracle 5 Red Hat/CentOS/Oracle 5 link
    Red Hat/CentOS 6 (32-bit) Red Hat/CentOS 6 link (32-bit)
    Red Hat/CentOS 6 (64-bit) Red Hat/CentOS 6 link (64-bit)

    2. Install the RPM : 

    $ sudo yum --nogpgcheck localinstall sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.noarch.rpm

    3. Install CDH4 Hadoop with MRv1

    To install Hadoop with MRv1:

     
    $ sudo yum install hadoop-0.20-conf-pseudo

    Download the CDH4 Package

    1. Click the entry in the table below that matches your Red Hat or CentOS system, choose Save File, and save the file to a directory to which you have write access (it can be your home directory).
      For OS VersionClick this Link
      Red Hat/CentOS/Oracle 5 Red Hat/CentOS/Oracle 5 link
      Red Hat/CentOS 6 (32-bit) Red Hat/CentOS 6 link (32-bit)
      Red Hat/CentOS 6 (64-bit) Red Hat/CentOS 6 link (64-bit)
    2. Install the RPM:
       
      $ sudo yum --nogpgcheck localinstall sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.noarch.rpm
    Note
    For instructions on how to add a CDH4 yum repository or build your own CDH4 yum repository, see Installing CDH4 On Red Hat-compatible systems.

    Install CDH4

      1. (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing one of the the following commands:

      2. Install Hadoop in pseudo-distributed mode:

        To install Hadoop with MRv1:
         
        $ sudo yum install hadoop-0.20-conf-pseudo

    ========================= 安装Cloudera Hadoop 结束 ===============================

    Before you install CDH4 on a single node, there are some important steps you need to do to prepare your system:

    1. Verify you are using a supported operating system for CDH4. See the next section: Supported Operating Systems for CDH4.
    2. If you haven't already done so, install the Oracle Java Development Kit (JDK) before deploying CDH4. See the section below: Install the Oracle Java Development Kit.
      Important
      On SLES 11 platforms, do not install or try to use the IBM Java version bundled with the SLES distribution; Hadoop will not run correctly with that version. Install the Oracle JDK following directions under Install the Oracle Java Development Kit.

    Supported Operating Systems for CDH4

    CDH4 supports the following operating systems:

    • For Red Hat-compatible systems, Cloudera provides:
      • 64-bit packages for Red Hat Enterprise Linux 5.7, CentOS 5.7, and Oracle Linux 5.6 with Unbreakable Enterprise Kernel.
      • 32-bit and 64-bit packages for Red Hat Enterprise Linux 6.2 and CentOS 6.2.
    • For SUSE systems, Cloudera provides 64-bit packages for SUSE Linux Enterprise Server 11 (SLES 11). Service pack 1 or later is required.
    • For Ubuntu systems, Cloudera provides 64-bit packages for the Long-Term Support (LTS) releases Lucid (10.04) and Precise (12.04).
    • For Debian systems, Cloudera provides 64-bit packages for Squeeze (6.0.3).
      Note
      Cloudera has received reports that our RPMs work well on Fedora, but we have not tested this.
      Important
      For production environments, 64-bit packages are recommended. Except as noted above, CDH4 provides only 64-bit packages.
    Note
    If you are using an operating system that is not supported by Cloudera's packages, you can also download source tarballs from Downloads.

    Install the Oracle Java Development Kit

    If you have already installed the Oracle JDK, skip this step and proceed to Installing CDH4 on a Single Linux Node in Pseudo-distributed Mode.

    Install the Oracle Java Development Kit (JDK) before deploying CDH4.

    • To install the JDK, follow the instructions under Oracle JDK Installation. The completed installation must meet the requirements in the box below.
    • If you have already installed a version of the JDK, make sure your installation meets the requirements in the box below.
    Requirements:
    • CDH4 requires the Oracle JDK 1.6.0_8 at a minimum. Cloudera recommends version 1.6.0_31.

    After installing the JDK, and before installing and deploying CDH:

    • If you are deploying CDH on a cluster, make sure you have the same version of the Oracle JDK on each node.
    • Make sure the JAVA_HOMEenvironment variable is set for the root user on each node. You can check by using a command such as
      $ sudo env | grep JAVA_HOME

      It should be set to point to the directory where the JDK is installed, as shown in the example below.

    You may be able to install the Oracle JDK with your package manager, depending on your choice of operating system.

    Oracle JDK Installation

    Important
    The Oracle JDK installer is available both as an RPM-based installer (note the "-rpm" modifier before the bin file extension) for RPM-based systems, and as a binary installer for other systems. Make sure you install the jdk-6uXX-linux-x64-rpm.bin file for 64-bit systems, or jdk-6uXX-linux-i586-rpm.binfor 32-bit systems.

    On SLES 11 platforms, do not install or try to use the IBM Java version bundled with the SLES distribution; Hadoop will not run correctly with that version. Install the Oracle JDK by following the instructions below.


    To install the Oracle JDK:

    1. Download one of the recommended versions of the Oracle JDK from this page, which you can also reach by going to the Java SE Downloads page and clicking on the Previous Releases tab and then on the Java SE 6 link. (These links and directions were correct at the time of writing, but the page is restructured frequently.)
    2. Install the Oracle JDK following the directions on the the Java SE Downloads page.
    3. As the root user, set JAVA_HOMEto the directory where the JDK is installed; for example:
      # export JAVA_HOME=<jdk-install-dir>
      # export PATH=$JAVA_HOME/bin:$PATH

      where <jdk-install-dir> might be something like /usr/java/jdk1.6.0_31, depending on the system configuration and where the JDK is actually installed.

    Contents

    You can evaluate CDH4 by quickly installing Apache Hadoop and CDH4 components on a single Linux node in pseudo-distributed mode. In pseudo-distributed mode, Hadoop processing is distributed over all of the cores/processors on a single machine. Hadoop writes all files to the Hadoop Distributed File System (HDFS), and all services and daemons communicate over local TCP sockets for inter-process communication.

    Note
    CDH4 is based on Apache Hadoop 2.0, which (among other features) introduces a new generation of MapReduce — MapReduce 2.0, known as MRv2 or YARN. (See Apache Hadoop NextGen MapReduce (YARN) for more information about YARN.) CDH4 also supports an implementation of the previous version of MapReduce, now referred to as MapReduce version 1 (MRv1).
    Important
    For installations in pseudo-distributed mode, there are separate conf-pseudo packages for an installation that includes MRv1 (hadoop-0.20-conf-pseudo) or an installation that includes YARN (hadoop-conf-pseudo). Only one conf-pseudo package can be installed at a time: if you want to change from one to the other, you must uninstall the one currently installed.

    Installing CDH4 with MRv1 on a Single Linux Node in Pseudo-distributed mode

    Important
    If you have not already done so, install the Oracle Java Development Kit (JDK) before deploying CDH4. Follow these instructions.

    On Red Hat/CentOS/Oracle 5 or Red Hat 6 systems, do the following:

    Download the CDH4 Package

    1. Click the entry in the table below that matches your Red Hat or CentOS system, choose Save File, and save the file to a directory to which you have write access (it can be your home directory).
      For OS Version Click this Link
      Red Hat/CentOS/Oracle 5 Red Hat/CentOS/Oracle 5 link
      Red Hat/CentOS 6 (32-bit) Red Hat/CentOS 6 link (32-bit)
      Red Hat/CentOS 6 (64-bit) Red Hat/CentOS 6 link (64-bit)
    2. Install the RPM:
       
      $ sudo yum --nogpgcheck localinstall sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.noarch.rpm
    Note
    For instructions on how to add a CDH4 yum repository or build your own CDH4 yum repository, see Installing CDH4 On Red Hat-compatible systems.

    Install CDH4

    1. (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing one of the the following commands:

    2. Install Hadoop in pseudo-distributed mode:

      To install Hadoop with MRv1:
       
      $ sudo yum install hadoop-0.20-conf-pseudo

    Starting Hadoop and Verifying it is Working Properly:

    For MRv1, a pseudo-distributed Hadoop installation consists of one node running all five Hadoop daemons: namenode, jobtracker, secondarynamenode, datanode, and tasktracker.

    To verify the hadoop-0.20-conf-pseudo packages on your system.

    • To view the files on Red Hat or SUSE systems:
      $ rpm -ql hadoop-0.20-conf-pseudo
    • To view the files on Ubuntu systems:
      $ dpkg -L hadoop-0.20-conf-pseudo

    The new configuration is self-contained in the /etc/hadoop/conf.pseudo.mr1 directory.

    Note
    The Cloudera packages use the alternatives framework for managing which Hadoop configuration is active. All Hadoop components search for the Hadoop configuration in /etc/hadoop/conf.

    To start Hadoop, proceed as follows.

    Step 1: Format the NameNode.

    Before starting the NameNode for the first time you must format the file system.

    $ sudo -u hdfs hdfs namenode -format
    Note
    Make sure you perform the format of the NameNode as user hdfs. You can do this as part of the command string, using sudo -u hdfs as in the command above.
    Important
    In earlier releases, the hadoop-conf-pseudo package automatically formatted HDFS on installation. In CDH4, you must do this explicitly.

    Step 2: Start HDFS

    $ for service in /etc/init.d/hadoop-hdfs-* 
    > do
    > sudo $service start
    > done

    To verify services have started, you can check the web console. The NameNode provides a web console http://localhost:50070/ for viewing your Distributed File System (DFS) capacity, number of DataNodes, and logs. In this pseudo-distributed configuration, you should see one live DataNode named localhost.

    Step 3: Create the /tmp Directory

    Create the /tmp directory and set permissions:

    Important
    If you do not create /tmp properly, with the right permissions as shown below, you may have problems with CDH components later. Specifically, if you don't create /tmp yourself, another process may create it automatically with restrictive permissions that will prevent your other applications from using it.

    Create the /tmp directory after HDFS is up and running, and set its permissions to 1777 (drwxrwxrwt), as follows:

    $ sudo -u hdfs hadoop fs -mkdir /tmp
    $ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
    Note
    This is the root of hadoop.tmp.dir (/tmp/hadoop-$<user.name> by default) which is used both for the local file system and HDFS.

    Step 4: Create the MapReduce system directories:

    sudo -u hdfs hadoop fs -mkdir /var
    sudo -u hdfs hadoop fs -mkdir /var/lib
    sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs
    sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache
    sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred
    sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred
    sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
    sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
    sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred

    Step 5: Verify the HDFS File Structure

    $ sudo -u hdfs hadoop fs -ls -R /

    You should see:

    drwxrwxrwt   - hdfs supergroup          0 2012-04-19 15:14 /tmp
    drwxr-xr-x   - hdfs supergroup          0 2012-04-19 15:26 /user
    drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var
    drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib
    drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib/hadoop-hdfs
    drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib/hadoop-hdfs/cache
    drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:19 /var/lib/hadoop-hdfs/cache/mapred
    drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:29 /var/lib/hadoop-hdfs/cache/mapred/mapred
    drwxrwxrwt   - mapred   supergroup          0 2012-04-19 15:33 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

    Step 6: Start MapReduce

    for service in /etc/init.d/hadoop-0.20-mapreduce-*
    > do
    > sudo $service start
    > done

    To verify services have started, you can check the web console. The JobTracker provides a web console http://localhost:50030/ for viewing and running completed and failed jobs with logs.

    Step 7: Create User Directories

    Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:

    $ sudo -u hdfs hadoop fs -mkdir  /user/<user>
    $ sudo -u hdfs hadoop fs -chown <user> /user/<user>

    where <user> is the Linux username of each user.

    Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:

    sudo -u hdfs hadoop fs -mkdir /user/$USER
    sudo -u hdfs hadoop fs -chown $USER /user/$USER

    Running an example application with MRv1

    1. Create a home directory on HDFS for the user who will be running the job (for example, joe):
      sudo -u hdfs hadoop fs -mkdir /user/joe
      sudo -u hdfs hadoop fs -chown joe /user/joe

      Do the following steps as the user joe.

    2. Make a directory in HDFS called inputand copy some XML files into it by running the following commands:
      $ hadoop fs -mkdir input
      $ hadoop fs -put /etc/hadoop/conf/*.xml input
      $ hadoop fs -ls input
      Found 3 items:
      -rw-r--r--   1 joe supergroup       1348 2012-02-13 12:21 input/core-site.xml
      -rw-r--r--   1 joe supergroup       1913 2012-02-13 12:21 input/hdfs-site.xml
      -rw-r--r--   1 joe supergroup       1001 2012-02-13 12:21 input/mapred-site.xml
    3. Run an example Hadoop job to grep with a regular expression in your input data.
      $ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+'
    4. After the job completes, you can find the output in the HDFS directory named outputbecause you specified that output directory to Hadoop.
      $ hadoop fs -ls
      Found 2 items
      drwxr-xr-x   - joe supergroup  0 2009-08-18 18:36 /user/joe/input
      drwxr-xr-x   - joe supergroup  0 2009-08-18 18:38 /user/joe/output

      You can see that there is a new directory called output.

    5. List the output files.
      $ hadoop fs -ls output
      Found 2 items
      drwxr-xr-x  -  joe supergroup     0 2009-02-25 10:33   /user/joe/output/_logs
      -rw-r--r--  1  joe supergroup  1068 2009-02-25 10:33   /user/joe/output/part-00000
      -rw-r--r-   1  joe supergroup     0 2009-02-25 10:33   /user/joe/output/_SUCCESS
    6. Read the results in the output file; for example:
      $ hadoop fs -cat output/part-00000 | head
      1       dfs.datanode.data.dir
      1       dfs.namenode.checkpoint.dir
      1       dfs.namenode.name.dir
      1       dfs.replication
      1       dfs.safemode.extension
      1       dfs.safemode.min.datanodes

    Installing CDH4 with YARN on a Single Linux Node in Pseudo-distributed mode

    Before you start, uninstall MRv1 if necessary

    If you have already installed MRv1 following the steps in the previous section, you now need to uninstall hadoop-0.20-conf-pseudo before running YARN. Proceed as follows.

    1. Stop the daemons:
      $ for service in /etc/init.d/hadoop-hdfs-* /etc/init.d/hadoop-0.20-mapreduce-*
      > do
      > sudo $service stop
      > done
    2. Remove hadoop-0.20-conf-pseudo:
    • On Red Hat-compatible systems:
      sudo yum remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
    • On SUSE systems:
      sudo zypper remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
    • On Ubuntu or Debian systems:
      sudo apt-get remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
    Note
    In this case (after uninstalling hadoop-0.20-conf-pseudo) you can skip the package download steps below.
    Important
    If you have not already done so, install the Oracle Java Development Kit (JDK) before deploying CDH4. Follow these instructions.

    On Red Hat/CentOS/Oracle 5 or Red Hat 6 systems, do the following:

    Download the CDH4 Package

    1. Click the entry in the table below that matches your Red Hat or CentOS system, choose Save File, and save the file to a directory to which you have write access (it can be your home directory).
      For OS Version Click this Link
      Red Hat/CentOS/Oracle 5 Red Hat/CentOS/Oracle 5 link
      Red Hat/CentOS 6 (32-bit) Red Hat/CentOS 6 link (32-bit)
      Red Hat/CentOS 6 (64-bit) Red Hat/CentOS 6 link (64-bit)
    2. Install the RPM:
      $ sudo yum --nogpgcheck localinstall sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.noarch.rpm
    Note
    For instructions on how to add a CDH4 yum repository or build your own CDH4 yum repository, see Installing CDH4 On Red Hat-compatible systems.

    Install CDH4

    1. (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:

    2. Install Hadoop in pseudo-distributed mode:

      To install Hadoop with YARN:
      $ sudo yum install hadoop-conf-pseudo

    On SUSE systems, do the following:

    Download and install the CDH4 package

    1. Click this link, choose Save File, and save it to a directory to which you have write access (it can be your home directory).

    2. Install the RPM:
      $ sudo rpm -i cloudera-cdh-4-0.noarch.rpm
    Note
    For instructions on how to add a CDH4 SUSE repository or build your own CDH4 SUSE repository, see Installing CDH4 On SUSE systems.

    Install CDH4

    1. (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
    2. Install Hadoop in pseudo-distributed mode:

      To install Hadoop with YARN:
      $ sudo zypper install hadoop-conf-pseudo

    On Ubuntu and other Debian systems, do the following:

    Download and install the package

    1. Click one of the following:
      this link for a Squeeze system, or
      this link for a Lucid system
      this link for a Precise system.

    2. Install the package. Do one of the following:
      Choose Open with in the download window to use the package manager, or
      Choose Save File, save the package to a directory to which you have write access (it can be your home directory) and install it from the command line, for example:
      sudo dpkg -i Downloads/cdh4-repository_1.0_all.deb
    Note
    For instructions on how to add a CDH4 Debian repository or build your own CDH4 Debian repository, see Installing CDH4 On Ubuntu or Debian systems.

    Install CDH4

    1. (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
      • For Ubuntu Lucid systems:
        $ curl -s http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh/archive.key | sudo apt-key add -
      • For Ubuntu Precise systems:
        $ curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -
      • For Debian Squeeze systems:
        $ curl -s http://archive.cloudera.com/cdh4/debian/squeeze/amd64/cdh/archive.key | sudo apt-key add -
    2. Install Hadoop in pseudo-distributed mode:

      To install Hadoop with YARN:
      $ sudo apt-get update
      $ sudo apt-get install hadoop-conf-pseudo

    Starting Hadoop and Verifying it is Working Properly

    For YARN, a pseudo-distributed Hadoop installation consists of one node running all five Hadoop daemons: namenode, secondarynamenode, resourcemanager, datanode, and nodemanager.

    • To view the files on Red Hat or SUSE systems:
      $ rpm -ql hadoop-conf-pseudo
    • To view the files on Ubuntu systems:
      $ dpkg -L hadoop-conf-pseudo

    The new configuration is self-contained in the /etc/hadoop/conf.pseudo directory.

    Note
    The Cloudera packages use the alternative framework for managing which Hadoop configuration is active. All Hadoop components search for the Hadoop configuration in /etc/hadoop/conf.

    To start Hadoop, proceed as follows.

    Step 1: Format the NameNode.

    Before starting the NameNode for the first time you must format the file system.

    $ sudo -u hdfs hdfs namenode -format
    Note
    Make sure you perform the format of the NameNode as user hdfs. You can do this as part of the command string, using sudo -u hdfs as in the command above.
    Important
    In earlier releases, the hadoop-conf-pseudo package automatically formatted HDFS on installation. In CDH4, you must do this explicitly.

    Step 2: Start HDFS

    $ for service in /etc/init.d/hadoop-hdfs-* 
    > do
    > sudo $service start
    > done

    To verify services have started, you can check the web console. The NameNode provides a web console http://localhost:50070/ for viewing your Distributed File System (DFS) capacity, number of DataNodes, and logs. In this pseudo-distributed configuration, you should see one live DataNode named localhost.

    Step 3: Create the /tmp Directory

    1. Remove the old /tmpif it exists:
      sudo -u hdfs hadoop fs -rmr /tmp
    2. Create a new /tmpdirectory and set permissions:
      sudo -u hdfs hadoop fs -mkdir /tmp
      sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

    Step 4: Create User, Staging, and Log Directories

    1. Create a user directory and set ownership:
      sudo -u hdfs hadoop fs -mkdir /user/mydir
      sudo -u hdfs hadoop fs -chown myuser:myuser /user/mydir
    2. Create the /var/log/hadoop-yarndirectory and set ownership:
      sudo -u hdfs hadoop fs -mkdir /var/log/hadoop-yarn
      sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn
    3. Create the staging directory and set permissions:
      sudo -u hdfs hadoop fs -mkdir /tmp/hadoop-yarn/staging
      sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging
    4. Create the done_intermediatedirectory under the staging directory and set permissions:
      sudo -u hdfs hadoop fs -mkdir /tmp/hadoop-yarn/staging/history/done_intermediate
      sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging/history/done_intermediate
    5. Change ownership on the staging directory and subdirectory:
      sudo -u hdfs hadoop fs -chown -R mapred:mapred /tmp/hadoop-yarn/staging

    Step 5: Verify the HDFS File Structure:

    Run the following command:

    $ sudo -u hdfs hadoop fs -ls -R /

    You should see the following directory structure:

    drwxrwxrwt   - hdfs   supergroup        0 2012-05-31 15:31 /tmp
    drwxr-xr-x   - hdfs   supergroup        0 2012-05-31 15:31 /tmp/hadoop-yarn
    drwxrwxrwt   - mapred mapred            0 2012-05-31 15:31 /tmp/hadoop-yarn/staging
    drwxr-xr-x   - mapred mapred            0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history
    drwxrwxrwt   - mapred mapred            0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history/done_intermediate
    drwxr-xr-x   - hdfs   supergroup        0 2012-05-31 15:31 /user
    drwxr-xr-x   - myuser myuser            0 2012-05-31 15:30 /user/mydir
    drwxr-xr-x   - hdfs   supergroup        0 2012-05-31 15:31 /var
    drwxr-xr-x   - hdfs   supergroup        0 2012-05-31 15:31 /var/log
    drwxr-xr-x   - yarn   mapred            0 2012-05-31 15:31 /var/log/hadoop-yarn

    Step 6: Start YARN

    sudo /etc/init.d/hadoop-yarn-resourcemanager start
    sudo /etc/init.d/hadoop-yarn-nodemanager start
    sudo /etc/init.d/hadoop-mapreduce-historyserver start

    Step 7: Create User Directories

    Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:

    $ sudo -u hdfs hadoop fs -mkdir  /user/<user>
    $ sudo -u hdfs hadoop fs -chown <user> /user/<user>

    where <user> is the Linux username of each user.

    Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:

    sudo -u hdfs hadoop fs -mkdir /user/$USER
    sudo -u hdfs hadoop fs -chown $USER /user/$USER

    Running an example application with YARN

      1. Create a home directory on HDFS for the user who will be running the job (for example, joe):
        sudo -u hdfs hadoop fs -mkdir /user/joe
        sudo -u hdfs hadoop fs -chown joe /user/joe

        Do the following steps as the user joe.

      2. Make a directory in HDFS called inputand copy some XML files into it by running the following commands in pseudo-distributed mode:
        $ hadoop fs -mkdir input
        $ hadoop fs -put /etc/hadoop/conf/*.xml input
        $ hadoop fs -ls input
        Found 3 items:
        -rw-r--r--   1 joe supergroup       1348 2012-02-13 12:21 input/core-site.xml
        -rw-r--r--   1 joe supergroup       1913 2012-02-13 12:21 input/hdfs-site.xml
        -rw-r--r--   1 joe supergroup       1001 2012-02-13 12:21 input/mapred-site.xml
      3. Set HADOOP_MAPRED_HOME for user joe:
        $ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
      4. Run an example Hadoop job to grepwith a regular expression in your input data.
        $ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'
      5. After the job completes, you can find the output in the HDFS directory named output23because you specified that output directory to Hadoop.
        $ hadoop fs -ls
        Found 2 items
        drwxr-xr-x   - joe supergroup  0 2009-08-18 18:36 /user/joe/input
        drwxr-xr-x   - joe supergroup  0 2009-08-18 18:38 /user/joe/output23


        You can see that there is a new directory called output23.

      6. List the output files.
        $ hadoop fs -ls output23
        Found 2 items
        drwxr-xr-x  -  joe supergroup     0 2009-02-25 10:33   /user/joe/output23/_SUCCESS
        -rw-r--r--  1  joe supergroup  1068 2009-02-25 10:33   /user/joe/output23/part-r-00000
      7. Read the results in the output file.
        $ hadoop fs -cat output23/part-r-00000 | head
        1    dfs.safemode.min.datanodes
        1    dfs.safemode.extension
        1    dfs.replication
        1    dfs.permissions.enabled
        1    dfs.namenode.name.dir
        1    dfs.namenode.checkpoint.dir
        1    dfs.datanode.data.dir

  • 相关阅读:
    关于sizeof表达式作为数组元素个数的编译
    【deque】滑动窗口、双端队列解决数组问题
    【二叉树】重建二叉树
    字符数组与字符串指针
    【STL】C中的qsort与C++中的sort
    对C++不是类型安全语言的理解
    【vector】创建一个二维vector当作二维数组用
    批量处理改变文件名、文件后缀名
    位运算
    关于sqlserver帐号被禁用问题
  • 原文地址:https://www.cnblogs.com/licheng/p/2541322.html
Copyright © 2011-2022 走看看