zoukankan      html  css  js  c++  java
  • 简单示例用例(Simple Example Use Cases)--hive GettingStarted用例翻译

    1、MovieLens User Ratings

    First, create a table with tab-delimited text file format:

    首先,创建一个通过tab分隔的表:

    CREATE TABLE u_data (

      userid INT,

      movieid INT,

      rating INT,

      unixtime STRING)

    ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '

    STORED AS TEXTFILE;

    Then, download the data files from MovieLens 100k on the GroupLens datasets page (which also has a README.txt file and index of unzipped files):

    然后,下载数据文件从下面方法:

    wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

    or:

    curl --remote-name http://files.grouplens.org/datasets/movielens/ml-100k.zip

    Note:  If the link to GroupLens datasets does not work, please report it on HIVE-5341 or send a message to the user@hive.apache.org mailing list.

    Unzip the data files:

    解压缩这个文件:

    unzip ml-100k.zip

    And load u.data into the table that was just created:

    并且加载数据到刚刚创建的u_data表中:

    LOAD DATA LOCAL INPATH '<path>/u.data' OVERWRITE INTO TABLE u_data;

    Count the number of rows in table u_data:

    统计表u_data的行数:

    SELECT COUNT(*) FROM u_data;

    Note that for older versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

    Now we can do some complex data analysis on the table u_data:

    现在我们可以做一些复杂的数据分析针对表u_data:

    Create weekday_mapper.py:

    import sys

    import datetime

    for line in sys.stdin:

      line = line.strip()

      userid, movieid, rating, unixtime = line.split(' ')

      weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()

      print ' '.join([userid, movieid, rating, str(weekday)])

    Use the mapper script:

    使用这个脚本:

    CREATE TABLE u_data_new (

      userid INT,

      movieid INT,

      rating INT,

      weekday INT)

    ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';

    add FILE weekday_mapper.py;

    INSERT OVERWRITE TABLE u_data_new

    SELECT

      TRANSFORM (userid, movieid, rating, unixtime)

      USING 'python weekday_mapper.py'

      AS (userid, movieid, rating, weekday)

    FROM u_data;

    解释:这里通过python脚本清洗表u_data中数据,使用python脚本通过

    TRANSFORM (userid, movieid, rating, unixtime)   --输入字段

      USING 'python weekday_mapper.py'              --脚本处理

      AS (userid, movieid, rating, weekday)         --输出字段

    SELECT weekday, COUNT(*)

    FROM u_data_new

    GROUP BY weekday;

    Note that if you're using Hive 0.5.0 or earlier you will need to use COUNT(1) in place of COUNT(*).

    2、Apache Weblog Data

    The format of Apache weblog is customizable, while most webmasters use the default.
    For default Apache weblog, we can create a table with the following command.

    More about RegexSerDe can be found here in HIVE-662 and HIVE-1719.

    CREATE TABLE apachelog (

    host STRING,

    identity STRING,

    user STRING,

    time STRING,

    request STRING,

    status STRING,

    size STRING,

    referer STRING,

    agent STRING)

    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

    WITH SERDEPROPERTIES (

    "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\[^\]*\]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|".*") ([^ "]*|".*"))?"

    )

    STORED AS TEXTFILE;

  • 相关阅读:
    Deploying an Application on X11 Platforms
    使用xenu查找web站点死链接使用方法及结果分析 Binbby 博客园
    About | Channel 9
    Amber is an implementation of the Smalltalk language that runs on top of the JavaScript runtime.
    linux下qt静态编译_自由出土文物的空间_百度空间
    A year of Qt ecosystem growth with Digia nurturing Qt Commercial
    华汇超市二层的小餐馆可当MBA案例了
    js web tools
    [转]PAC Manager: Ubuntu 上强大的 SSH 帐号管理工具,可取代 SecureCRT_Miracle_百度空间
    qq云输入法也支持五笔了,太适合我了,特别是在ubuntu下有时候输入法
  • 原文地址:https://www.cnblogs.com/wcwen1990/p/7082922.html
Copyright © 2011-2022 走看看