zoukankan      html  css  js  c++  java
  • 零起步的Hadoop实践日记(搭建hadoop和hive)

    2014-3-10

    【需求】

    接受的工作需要处理海量数据,第一步先用工具做一些运营数据的产出,考虑采用hadoop方便以后跟随数据量变大可以补充机器,而不用动统计逻辑。

    当前的hadoop社区非常活跃,hadoop周边工具不断出新,以下是部分热门工具的初步了解:

    • 数据存储

    hadoop,包含hdfs和mapreduce
    hbase,支持大表,需要zk
    zookeeper,分布式集群管理,简称zk

    • 数据传输

    flume/sribe/Chukwa 分布式日志收集系统,从多个机器汇总到一个节点
    sqoop,传统db和hdfs/hbase之间数据传输

    • 主要查询接口

    hive,一个SQL查询接口
    pig,一个脚本查询接口
    hadoop流,标准输入输出的Mapreduce,使用脚本语言编写逻辑代码 shell/python/php/ruby
    hadoop pipe,socket输入输出,使用C++编写逻辑代码

    • 其他辅助工具

    avro,序列化工具
    oozie,把几个mr作业连一起
    snappy,压缩工具
    mahout,机器学习工具集

    当然最新的工具层出不穷,比如Spark,Impala。现在的需求是打起单机伪分布hadoop,然后当业务数据量增大时候,比较平滑切入多机分布。

    本着小步快跑的互联网思想,在没太多实际经验下,先做简单搭建,后期在不断补充和调式加入新工具。

    初期搭建的有:hadoop(一切的必须),hive(查询方便),hadoop流的简单包装(为了方便使用脚本语言),sqoop(从传统db导数据),pig(尝试使用,不必须)。

    后期搭建的可能包括:zookeeper,hbase(支持亿级别大表),mahout(机器学习) 等。

    [PS]工作环境 64位的 Ubuntu 12.04,线下实验是桌面版,实际操作的是服务器版;

    【Java 7】

    首先是Java的安装,Ubuntu可能默认装了openjdk,最好还是用oracle java,故卸之

    sudo apt-get purge openjdk*
    sudo apt-get install software-properties-common
    sudo apt-get install python-software-properties
    sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java7-installer

    装完了要加入JAVA_HOME到环境变量中。众多设置环境变量的文件中,如/etc/profile,~/.bashrc等,Ubuntu推荐设置在/etc/enviroment,但是笔者用了之后出现各种诡异,可能是格式问题,暂时推荐放在/etc/profile,这也是Programming Hive里面使用的。

    export JAVA_HOME="/usr/lib/jvm/java-7-oracle/"
    export JRE_HOME=$JAVA_HOME/jre  
    export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib  
    export PATH=$JAVA_HOME/bin:$PATH
    . /etc/profile

    【Hadoop搭建】

    各种不同的发行版有官方版本,Cloudera,MapR等免费开源版,商业版就不说了,反正用不上。一个数据:国内公司75%用cloudera,因为方便 vie 利用Cloudera实现Hadoop官方版的安装说明

    笔者尝试用Apache的Hadoop版本,在64位Ubuntu上搭建最新稳定版2.2.0(2014/3/10),居然直接拿来用的库文件不支持64位,要自己编译,这水就很深了,编译这种系统总是缺胳膊少腿的,不是少了编译工具,就是少了依赖。实际情况中碰到各种bug,各种问题不断压栈,使用成本不小。大概花了一天才把Hadoop搭建起来,而且感觉这样东拼西凑,感觉随时可以崩溃。遂决定换Cloudera的CDH4。

    在Cloudera官方网站查得支持的各种软硬件条件,支持我的64位的Ubuntu (12.04) (使用命令 uname -a 和 cat /etc/issue)

    官方伪分布安装说明 就几步:

    wget http://archive.cloudera.com/cdh4/one-click-install/precise/amd64/cdh4-repository_1.0_all.deb
    sudo dpkg -i cdh4-repository_1.0_all.deb
    curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -
    sudo apt-get update
    sudo apt-get install hadoop-conf-pseudo #这一步会列出要安装的软件,包括zookeeper,受网速影响可能比较慢,可以用nohup放到后台运行
    sudo -u hdfs hdfs namenode -format #格式化NameNode.

    # 启动HSFS

    for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done

    创建/tmp目录和YARN与日志目录

    sudo -u hdfs hadoop fs -rm -r /tmp
    sudo -u hdfs hadoop fs -mkdir /tmp
    sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
    sudo -u hdfs hadoop fs -mkdir /tmp/hadoop-yarn/staging
    sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging
    sudo -u hdfs hadoop fs -mkdir /tmp/hadoop-yarn/staging/history/done_intermediate
    sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging/history/done_intermediate
    sudo -u hdfs hadoop fs -chown -R mapred:mapred /tmp/hadoop-yarn/staging
    sudo -u hdfs hadoop fs -mkdir /var/log/hadoop-yarn
    sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn

      # check目录

    sudo -u hdfs hadoop fs -ls -R /

    结果应该为:

    drwxrwxrwt - hdfs supergroup 0 2012-05-31 15:31 /tmp
    drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /tmp/hadoop-yarn
    drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging
    drwxr-xr-x - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history
    drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history/done_intermediate
    drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var
    drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var/log
    drwxr-xr-x - yarn mapred 0 2012-05-31 15:31 /var/log/hadoop-yarn

     

    # Start YARN

    sudo service hadoop-yarn-resourcemanager start 
    sudo service hadoop-yarn-nodemanager start 
    sudo service hadoop-mapreduce-historyserver start

    #Create User Directories   

    
    
    sudo -u hdfs hadoop fs -mkdir /user/danny
    sudo -u hdfs hadoop fs -chown danny /user/danny
    实际格式为
    sudo
    -u hdfs hadoop fs -mkdir /user/<user>
    sudo -u hdfs hadoop fs -chown <user> /user/<user>

    #Running an example application with YARN

    hadoop fs -mkdir input
    hadoop fs -put /etc/hadoop/conf/*.xml input
    hadoop fs -ls input
    export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
    hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'
    hadoop fs -ls 
    hadoop fs -ls output23 
    hadoop fs -cat output23/part-r-00000 | head

    结果应该为

    1 dfs.safemode.min.datanodes
    1 dfs.safemode.extension
    1 dfs.replication
    1 dfs.permissions.enabled
    1 dfs.namenode.name.dir
    1 dfs.namenode.checkpoint.dir
    1 dfs.datanode.data.dir

    【hive】

    安装mysql

    sudo apt-get install hive hive-metastore hive-server
    sudo
    apt-get install mysql-server sudo service mysql start

    如果需要修改密码

    $ sudo /usr/bin/mysql_secure_installation
    [...]
    Enter current password for root (enter for none):
    OK, successfully used password, moving on...
    [...]
    Set root password? [Y/n] y
    New password:
    Re-enter new password:
    Remove anonymous users? [Y/n] Y
    [...]
    Disallow root login remotely? [Y/n] N
    [...]
    Remove test database and access to it [Y/n] Y
    [...]
    Reload privilege tables now? [Y/n] Y
    All done!

     

    To make sure the MySQL server starts at boot

    需要apt get sysv-rc-conf (替代chkconfig的)

    sudo apt-get install sysv-rc-conf
    sudo
    sysv-rc-conf mysql on

      

    创建metastore库,注册一个用户,授权

    $ mysql -u root -p
    Enter password:
    mysql> CREATE DATABASE metastore;
    mysql> USE metastore;
    mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.10.0.mysql.sql;
    mysql> create user 'hive'@'%' identified by 'hive';
    mysql> create user 'hive'@'localhost' identified by 'hive';
    mysql> revoke all privileges, grant option from 'hive'@'%';
    mysql> revoke all privileges, grant option from
    'hive'@'localhost';
    mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.
    * TO 'hive'@'%';
    mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.
    * TO 'hive'@'localhost';
    mysql> FLUSH PRIVILEGES;
    mysql> quit;

    安装mysql-connector-java and symbolically link the file into the /usr/lib/hive/lib/ directory.

    sudo apt-get install libmysql-java
    sudo ln -s /usr/share/java/libmysql-java.jar /usr/lib/hive/lib/libmysql-java.jar

    Configure the Metastore Service to Communicate with the MySQL Database,配置 hive-site.xml 

    sudo cp /etc/hive/conf/hive-site.xml /etc/hive/conf/hive-site.xml.bak
    sudo vim /etc/hive/conf/hive-site.xml

    修改为

    <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true</value>
      <description>the URL of the MySQL database</description>
    </property>
    
    <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>com.mysql.jdbc.Driver</value>
    </property>
    
    <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>hive</value>
    </property>
    
    <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>hive</value>
    </property>
    

    启动和初始化文件

    sudo service hive-metastore start
    sudo service hive-server start
    
    sudo -u hdfs hadoop fs -mkdir /user/hive
    sudo -u hdfs hadoop fs -chown hive /user/hive
    sudo -u hdfs hadoop fs -mkdir /tmp
    sudo -u hdfs hadoop fs -chmod 777 /tmp #already exist
    sudo -u hdfs hadoop fs -chmod o+t /tmp
    sudo -u hdfs hadoop fs -mkdir /data
    sudo -u hdfs hadoop fs -chown hdfs /data
    sudo -u hdfs hadoop fs -chmod 777 /data
    sudo -u hdfs hadoop fs -chmod o+t /data
    sudo chown -R hive:hive /var/lib/hive
  • 相关阅读:
    爬虫例子及知识点(scrapy知识点)
    Xpath()语法
    yield和python(如何生成斐波那契數列)
    Python3导入cookielib失败
    使用Scrapy爬虫框架简单爬取图片并保存本地(妹子图)
    python使用cookielib库示例分享
    xpath中/和//的差别
    [洛谷P3320] SDOI2015 寻宝游戏
    [洛谷P3322] SDOI2015 排序
    [51nod 1830] 路径交
  • 原文地址:https://www.cnblogs.com/aquastar/p/3591418.html
Copyright © 2011-2022 走看看