zoukankan      html  css  js  c++  java
  • STREAMING HIVE流过滤 官网例子 注意中间用的py脚本

    Simple Example Use Cases

    MovieLens User Ratings

    First, create a table with tab-delimited text file format:

    CREATE TABLE u_data (
      userid INT,
      movieid INT,
      rating INT,
      unixtime STRING)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '	'
    STORED AS TEXTFILE;
    

    Then, download the data files from MovieLens 100k on the GroupLens datasets page (which also has a README.txt file and index of unzipped files):

    wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

    or:

    curl --remote-name http://files.grouplens.org/datasets/movielens/ml-100k.zip

    Note:  If the link to GroupLens datasets does not work, please report it on HIVE-5341 or send a message to the user@hive.apache.org mailing list.

    Unzip the data files:

    unzip ml-100k.zip

    And load u.data into the table that was just created:

    LOAD DATA LOCAL INPATH '<path>/u.data'
    OVERWRITE INTO TABLE u_data;
    

    Count the number of rows in table u_data:

    SELECT COUNT(*) FROM u_data;
    

    Note that for older versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

    Now we can do some complex data analysis on the table u_data:

    Create weekday_mapper.py:

    import sys
    import datetime
    
    for line in sys.stdin:
      line = line.strip()
      userid, movieid, rating, unixtime = line.split('	')
      weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
      print '	'.join([userid, movieid, rating, str(weekday)])
    

    https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-DDLOperations

    Use the mapper script:

    CREATE TABLE u_data_new (
      userid INT,
      movieid INT,
      rating INT,
      weekday INT)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '	';
    
    add FILE weekday_mapper.py;
    
    INSERT OVERWRITE TABLE u_data_new
    SELECT
      TRANSFORM (userid, movieid, rating, unixtime)
      USING 'python weekday_mapper.py'
      AS (userid, movieid, rating, weekday)
    FROM u_data;
    
    SELECT weekday, COUNT(*)
    FROM u_data_new
    GROUP BY weekday;
  • 相关阅读:
    数据绘图工具之Matplotlib
    数据分析
    scrapy-redis 实现分布式爬虫
    存储库之MongoDB
    pycharm解释器链接如何pymongo
    爬虫之request相关请求
    爬虫基本原理
    Xadmin
    当网页失去焦点时改变网页的title值
    gulp基础使用及进阶
  • 原文地址:https://www.cnblogs.com/wangziyi0513/p/10515898.html
Copyright © 2011-2022 走看看